SSD: Spatially Speculative Decoding for Autoregressive Image Generation

SSD: Spatially Speculative Decoding
Accelerates Autoregressive Image Generation

Rutgers University

^*Equal advising

Abstract

Autoregressive models excel in visual generation by treating images as 1D sequences of discrete tokens, mirroring language modeling. However, this flattening discards the intrinsic 2D spatial locality of visual signals, creating severe computational bottlenecks during inference.

We introduce Spatially Speculative Decoding (SSD), a framework that aligns the predictive objective with the natural geometry of images. Rather than predicting only the immediate next token in a 1D sequence, our model simultaneously predicts the adjacent horizontal token and the token directly below it. By capitalizing on this 2D spatial correlation, spatially speculative decoding overcomes the memory wall in visual inference.

Our approach accelerates autoregressive image generation by up to 13.3× while maintaining high fidelity on DPG-Bench and GenEval. Our results suggest that respecting the underlying geometry of vision unlocks massive computational efficiencies, paving the way for real-time, high-resolution autoregressive generative models.

13.3×

Inference Speedup

O(n)

Decoding Complexity

Multi-Token Pred.

Motivation: Predictive Dependency is 2D

Figure 2 The Two-Dimensional Nature of Predictive Dependency. To demonstrate that spatial correlations are inherently 2D, we corrupt the sequential context during Janus-Pro-7B generation by replacing the second half of each row with random tokens (red outlines). Despite this severe disruption to the 1D sequence, visual coherence is preserved wherever the token directly above was accurately generated (blue outlines). This confirms that vertical prediction relies fundamentally on spatial adjacency rather than position in the flattened raster-scan order. (Right) Acceptance rates of horizontal vs. vertical drafting heads at matching spatial offsets confirm that predictability is governed by 2D spatial locality.

Results

Verification cost on Lumina-mGPT-7B (48×48 grid), normalized by one AR step. (a) Latency of verifying K tokens in parallel. As K grows to 240, the cost stays below 1.6× a single AR step, since the parameter-loading cost dominates due to the memory wall. (b) Wall-clock speedup scales near-linearly with K, approaching the ideal K× bound.

Qualitative **Qualitative results.** Side-by-side comparison of the AR baseline and SSD across three models. Our method yields up to **13.6×** speedup while preserving high-resolution visual fidelity.

@article{xiang2026ssd, title={SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation}, author={Xiang, Shilong and Zhang, Zirui and Yu, Lijun and Mao, Chengzhi}, journal={arXiv preprint arXiv:2606.20543}, year={2026} }

SSD: Spatially Speculative DecodingAccelerates Autoregressive Image Generation

Abstract

Motivation: Predictive Dependency is 2D

Results

BibTeX

SSD: Spatially Speculative Decoding
Accelerates Autoregressive Image Generation