SSD: Spatially Speculative Decoding
Accelerates Autoregressive Image Generation

Rutgers University
*Equal advising
Accelerating Autoregressive Vision via 2D Spatial Anticipation
Figure 1 Accelerating Autoregressive Vision via 2D Spatial Anticipation. (a) Standard AR flattens the visual world into a 1D sequence, predicting one token at a time (𝒪(n²) steps). (b) Speculative Decoding accelerates generation locally but remains fundamentally constrained by this linear raster-scan geometry (𝒪(n²)). (c) Our SSD aligns the predictive objective with the intrinsic geometry of images. By factorizing 2D anticipation into two 1D directions, we draft entire spatial blocks in parallel, collapsing inference complexity to 𝒪(n). Right: Applied to Emu3 (8B), this geometric shift yields a 13.7× speedup while preserving high-resolution visual fidelity.

Abstract

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference.

We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference.

Our approach accelerates autoregressive image generation by up to 13.3× while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

13.3×
Inference Speedup
O(n)
Decoding Complexity
2D
Multi-Token Pred.

Motivation: Predictive Dependency is 2D

The Two-Dimensional Nature of Predictive Dependency
Figure 2 The Two-Dimensional Nature of Predictive Dependency. To demonstrate that spatial correlations are inherently 2D, we corrupt the sequential context during Janus-Pro-7B generation by replacing the second half of each row with random tokens (red outlines). Despite this severe disruption to the 1D sequence, visual coherence is preserved wherever the token directly above was accurately generated (blue outlines). This confirms that vertical prediction relies fundamentally on spatial adjacency rather than position in the flattened raster-scan order. (Right) Acceptance rates of horizontal vs. vertical drafting heads at matching spatial offsets confirm that predictability is governed by 2D spatial locality.

Results

Verification cost on Lumina-mGPT-7B

Verification cost on Lumina-mGPT-7B (48×48 grid), normalized by one AR step. (a) Latency of verifying K tokens in parallel. As K grows to 240, the cost stays below 1.6× a single AR step, since the parameter-loading cost dominates due to the memory wall. (b) Wall-clock speedup scales near-linearly with K, approaching the ideal K× bound.

Qualitative results
Qualitative Qualitative results. Side-by-side comparison of the AR baseline and SSD across three models. Our method yields up to 13.6× speedup while preserving high-resolution visual fidelity.

BibTeX

@article{xiang2026ssd,
  title={SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation},
  author={Xiang, Shilong and Zhang, Zirui and Yu, Lijun and Mao, Chengzhi},
  journal={arXiv preprint arXiv:2606.20543},
  year={2026}
}