Recently, Cohere enhanced their search models by integrating multimodal embedding capabilities, enabling users to incorporate images into RAG (Retrieval-Augmented Generation) style enterprise search. This new feature represents a significant advancement in enterprise search technology.
Introduced last year, the multimodal embedding model Embed 3 focuses on converting data into numerical representations. In RAG applications, businesses can transform documents into embeddings, allowing the model to retrieve information by comparing these embeddings. The updated Embed 3 now supports embeddings for both images and text.
As detailed by Cohere, the new multimodal version can generate embeddings from both images and text, enabling businesses to fully leverage the extensive data contained within images. Companies can now develop systems that accurately and swiftly search for essential multimodal assets such as detailed reports, product catalogs, and design documents, thereby boosting employee productivity.
It is noteworthy that Embed 3’s encoders operate within a unified latent space, allowing users to include both images and text in the same database. This strategy circumvents the traditional need to maintain separate databases for images and text, thereby enhancing the effectiveness of hybrid modality searches.
Cohere highlights that other models often cluster text and image data into distinct regions, resulting in search outcomes that are skewed towards textual data. In contrast, Embed 3 emphasizes the underlying meaning of the data without favoring any particular modality. Additionally, Embed 3 supports over 100 languages, further expanding its usability.
With platforms like Google launching image-based search functionalities and chat interfaces like ChatGPT gaining widespread adoption, consumers are becoming increasingly familiar with multimodal search. Enterprises are also recognizing the potential of multimodal search and are seeking models that offer multimodal embedding options.
Currently, other companies and research institutions, including Google and OpenAI, offer multimodal embedding models. However, the competitive edge will depend on which provider can deliver models that meet enterprise demands for speed, accuracy, and security.