Apple continues to deepen and expand the boundaries of AI on the device side, and its latest research achievement is MobileCLIP, an efficient image-text model series designed for mobile device performance optimization. They have also introduced a new multimodal reinforcement training method that significantly improves learning efficiency through knowledge transfer.
Large-scale image-text base models, such as CLIP, have demonstrated impressive zero-shot performance and enhanced robustness in various tasks. However, deploying these models on mobile devices is a challenge because they are large in size and have long response times.
The MobileCLIP model adopts a hybrid CNN-transformer architecture and performs structural reparameterization in the image and text encoders, greatly reducing model size and response time. The new model includes various variants such as S0, S1, S2, and B, meeting the diverse needs of different mobile applications in terms of size and response time.
Among them, the smallest variant is comparable to OpenAI's ViT-B/16 model in zero-shot performance, but it is 4.8 times faster and 2.8 times smaller. MobileCLIP-S2 surpasses SigLIP's ViT-B/16 model in average zero-shot performance, with a 2.3 times speed improvement and a 2.1 times size reduction, while reducing the number of training samples used by 3 times. As for MobileCLIP-B(LT), it achieves a zero-shot performance of 77.2% on ImageNet, which is significantly better than recent works with similar architectures, such as DFN and SigLIP, and even surpasses OpenAI's ViT-L/14@336.
In order to further improve the learning efficiency of the MobileCLIP model, Apple has introduced a new training strategy called multimodal reinforcement training. This strategy combines knowledge transferred from a pre-trained image captioning model and a series of powerful CLIP models. By storing additional knowledge in the reinforcement dataset, this method avoids computational overhead during training.
This enhanced dataset is called DataCompDR, which has two variants: DataCompDR-12M and DataCompDR-1B. Compared to standard CLIP training, training with DataCompDR can significantly improve learning efficiency, with improvements ranging from 10 to 1000 times. For example, using DataCompDR-12M, the CLIP model based on ViT-B/16 can achieve 61.7% zero-shot classification on ImageNet-val in about one day using 8×A100 GPUs on a single node.
The core of MobileCLIP lies in its hybrid CNN-transformer architecture for image and text encoders. In terms of image encoder, they introduce the MCi architecture, an improved hybrid visual transformer based on the recent FastViT model. Key optimization measures include reducing the MLP expansion ratio in the feed-forward blocks and increasing the model depth to improve parameter efficiency.
For text encoding, Apple has developed Text-RepMixer, a convolutional token mixer. It achieves decoupling of the architecture during training and inference by strategically replacing self-attention layers with convolutional blocks. Text-RepMixer achieves smaller size and faster speed while maintaining similar accuracy to pure transformer models.
The key innovation of multimodal reinforcement training lies in its ability to transfer knowledge obtained from image captioning models and a series of pre-trained CLIP models. This approach is inspired by offline knowledge distillation techniques and avoids the computational overhead of running large teacher models during training. Instead, it stores additional knowledge in the reinforced dataset, thereby improving training efficiency.
This research has far-reaching implications, especially for upcoming consumer devices such as the iPhone. The application of AI technology on devices like MobileCLIP means that more intelligent image search, real-time visual lookup, more accurate object detection, and novel augmented reality experiences can seamlessly run without the need for a network connection. Imagine using the iPhone camera or Vision Pro to point at an unfamiliar plant or landmark and immediately obtain relevant information, all processed locally on your device for the fastest speed and highest privacy protection.
Given Apple's ongoing research and innovation in mobile device AI models, we can expect to witness a new generation of intelligent, responsive, and privacy-focused user experiences on their consumer devices later this year. Advanced computer vision and language understanding capabilities may become standard features and work in synergy with dedicated AI hardware. This gives us a glimpse of a bright future where powerful AI is no longer limited to the cloud but integrated into the devices we carry with us every day.