Recently, Hugging Face has made available a new visual language model called SmolVLM-256M, which boasts the lowest parameter count within its category.
Thanks to its compact size, SmolVLM-256M can operate on devices with relatively limited processing power, such as consumer-grade laptops. Additionally, it supports WebGPU technology, making it possible for the model to run in web browsers. WebGPU enables AI-based web applications to leverage the graphics processing units of users' computers. This model is capable of handling various tasks involving visual data, including answering questions about scanned documents, describing video content, and interpreting charts. Hugging Face has also developed a version of this model that allows output customization based on user prompts.
Technically speaking, SmolVLM-256M comprises 256 million parameters, significantly fewer than the billions found in state-of-the-art foundational models. The fewer parameters mean less required hardware resources, which explains why SmolVLM-256M can run on devices like laptops.
SmolVLM-256M represents the latest addition to Hugging Face's series of open-source visual language models. One of the key enhancements over earlier models from the company is the adoption of a new encoder. This software component is responsible for converting files processed by AI into encodings that are easier for neural networks to handle.
The encoder used in SmolVLM-256M is based on an open-source AI algorithm named SigLIP base patch-16/512, which itself originates from an image processing model released by OpenAI in 2021. With 93 million parameters, this encoder has less than one-quarter of the parameter count of Hugging Face's previous generation encoder, contributing to the reduced hardware footprint of SmolVLM-256M. Interestingly, smaller encoders can process images at higher resolutions, typically improving visual comprehension without increasing the number of parameters, according to research by Apple and Google.
To train this AI, Hugging Face utilized an enhanced dataset previously employed for developing its preceding visual language models. To enhance SmolVLM-256M's reasoning capabilities, they added handwritten mathematical expressions to the dataset among other improvements aimed at boosting the model's document understanding and image description skills.
In an internal evaluation, Hugging Face compared SmolVLM-256M with a multimodal model featuring 8 billion parameters released 18 months prior. Across more than six benchmarks, the former outperformed the latter. In the MathVista benchmark test, the model's score for tackling geometric problems increased by over 10%.
Besides SmolVLM-256M, Hugging Face introduced a more powerful algorithm named SmolVLM-500M, which contains 500 million parameters. It trades some hardware efficiency for improved output quality. According to Hugging Face, SmolVLM-500M also performs better in following user instructions.
Currently, Hugging Face has uploaded the source codes of both models to its eponymous AI project hosting platform.