A non-profit research lab under the AI startup Cohere has unveiled Aya Vision, a multimodal "open-source" AI model. According to the lab, Aya Vision stands out as a top performer in its category.
Aya Vision supports various functions, such as adding captions to images, answering questions about photos, performing text translations, and generating summaries in 23 major languages. Cohere also offers Aya Vision for free via WhatsApp, calling it "a significant step toward making technological breakthroughs accessible to researchers worldwide."
In a blog post, Cohere noted, "Despite remarkable progress in AI, there remains a noticeable gap in how models perform across different languages, especially in multimodal tasks involving both text and images. Aya Vision aims to explicitly help bridge this gap."
Aya Vision comes in two versions: Aya Vision 32B and Aya Vision 8B. Cohere claims that the more advanced Aya Vision 32B surpasses models twice its size in certain visual understanding benchmarks, including Meta's Llama-3.2 90B Vision. Meanwhile, Aya Vision 8B performs better than models ten times its size in some evaluations.
Both models are available on the AI development platform Hugging Face under the Creative Commons 4.0 license and Cohere’s additional terms of use, though commercial applications are prohibited.
Cohere revealed that Aya Vision was trained using a "diverse" English-language dataset. The lab translated these datasets and used them to create synthetic annotations. Annotations, also known as labels, help the model understand and interpret data during training. For instance, annotations for training an image recognition model might include markings around objects or textual descriptions of people, places, or objects in the image.
Cohere's use of synthetic annotations—those generated by AI—aligns with current trends. Despite potential drawbacks, competitors like OpenAI are increasingly relying on synthetic data to train their models as real-world data becomes scarce. Research firm Gartner estimates that 60% of data used in AI and analytics projects last year was synthetically created.
Cohere stated that by training on synthetic annotations, Aya Vision achieves competitive performance while utilizing fewer resources.
In addition, Cohere introduced a new benchmark suite called AyaVisionBench, designed to test the model's skills in "vision-language" tasks, such as identifying differences between two images and converting screenshots into code.
The AI industry is grappling with what is known as an "evaluation crisis," caused by the poor correlation between overall scores from popular benchmarks and proficiency in tasks that matter most to AI users. Cohere believes that AyaVisionBench represents a step toward addressing this issue by providing a "broad and challenging" framework for evaluating cross-lingual and multimodal understanding capabilities.
"This dataset offers a robust benchmark for assessing vision-language models in multilingual and real-world contexts," wrote Cohere researchers in a post on Hugging Face. "We are sharing this evaluation set with the research community to advance the development of multilingual multimodal assessments."