In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We initially analyze the factors limiting diversity in bitwise AR models and identify two key issues:(1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly-sharp logits distribution, which causes sampling collapse and reduces diversity. Built on these insights, we propose DiverseAR, a principle and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution across different sampling steps, resulting in a smoother prediction distribution and improved diversity. To mitigate the potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments highlight that DiverseAR can unlock greater diversity in bitwise autoregressive image generation.
In bitwise autoregressive models, the predicted probabilities often become overly peaked, giving one class near-certain confidence. This causes top-p sampling to collapse into a deterministic choice, removing randomness. As shown in the two subplots above, this collapse of bit-level randomness restricts feature variation, and diversity degradation mainly arises from (1) the binary classification nature of bitwise modeling and (2) the overconfident output distributions.
* The diversity issue mainly originates from bitwise tokenization, rather than from VAR (Visual Autoregressive Modeling) itself.
DiverseAR is an effective approach that enhances image and video diversity in bitwise autoregressive modeling without sacrificing visual quality. It introduces adaptive logits smoothing and an energy-based generation path selection strategy to achieve richer, more diverse sampling.
Qualitative comparisons between Infinity generation and DiverseAR show that integrating DiverseAR leads to a clear improvement in diversity while preserving high visual quality.
Qualitative comparisons in video generation show that DiverseAR consistently improves motion and content diversity while maintaining the visual fidelity of InfinityStar.
@article{yang2025diversear,
title={DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation},
author={Yang, Ying and Lv, Zhengyao and Pan, Tianlin and Wang, Haofan and Yang, Binxin and Yin, Hubery and Li, Chen and Si, Chenyang},
journal={arXiv preprint arXiv:2512.02931},
year={2025}
}