Apple Unveils AIM: A New Autoregressive Pretrained Visual Model

2024-01-18

Apple recently introduced Auto-Regressive Image Models (AIM), a series of pre-trained visual models that use auto-regressive objectives for pre-training.

These models represent the new frontier in training large-scale visual models, inspired by their text counterparts - large language models (LLMs), and exhibit similar scaling properties.

Researchers say this provides a scalable approach to unsupervised pre-training of visual models. The authors used generative auto-regressive objectives during pre-training and proposed technical improvements to adapt to downstream task transfer.

Researchers say that the performance of visual features improves with increasing model capacity and data volume. Additionally, they state that the value of the objective function is correlated with the model's performance on downstream tasks.

The team also practiced these findings by pre-training an AIM with 7 billion parameters on 2 billion images, achieving an 84.0% accuracy on a frozen backbone network on ImageNet-1k.

Interestingly, even at this scale, they did not observe any signs of performance saturation. AIM's pre-training is similar to LLM's pre-training and does not require any image-specific strategies to maintain stability in large-scale training.

About AIM

Apple believes AIM has desirable features, including the ability to scale up to 7 billion parameters using ordinary transformers without the need for stability-enhancing techniques or extensive hyperparameter tuning.

In addition, AIM's performance on pre-training tasks has a strong correlation with downstream task performance, surpassing state-of-the-art methods like MAE and narrowing the gap between generative and joint embedding pre-training approaches.

Researchers also found no signs of performance saturation as the model scales, indicating the potential for further performance improvement with larger models trained for longer durations.