Multimodal: The New Frontier of Artificial Intelligence

2024-05-09

Multi-modality is a relatively new term used to describe an extremely ancient phenomenon: how humans have understood the world since their existence. Individuals receive information from countless sources through their senses, including vision, hearing, and touch. The human brain combines these different data patterns into a highly detailed and comprehensive picture of reality.


"Communication between people is multi-modal," said Han Xiao, CEO of Jina AI. "They use text, voice, emotions, expressions, and sometimes even photos." These are just a few obvious ways of sharing information. Therefore, he added, "it can be very certain that communication between humans and machines in the future will also be multi-modal."


A technology that looks at the world from multiple perspectives


We have not yet reached this level. The most advanced developments in this regard are emerging in the emerging field of multi-modal AI. The problem is not a lack of vision. Mirella Lapata, a professor at the University of Edinburgh and director of its Institute for Language, Cognition, and Computation, said that while the ability to translate between different modalities is clearly valuable, "it is much more complex to implement than single-modal AI."


In practice, generative AI tools use different strategies for different types of data when building large-scale data models (i.e., complex neural networks that organize a large amount of information). For example, models that rely on textual sources separate individual tokens (usually words). Each token is assigned an "embedding" or "vector": a numerical matrix that represents how and where the token is used compared to other tokens. Overall, these vectors create a mathematical representation of token meaning. On the other hand, image models may use pixels as their embedding tokens, while audio models may use sound frequencies.


A multi-modal AI model typically relies on several single-modal models. As Henry Ajder, founder of AI consultancy firm Latent Space, said, this involves "almost stringing together" various contributing models. This process involves various techniques to align the elements of each single-modal model, a process known as fusion. For example, the word "tree," an image of an oak tree, and the sound of rustling leaves may be fused in this way. This allows the model to create a multi-faceted description of reality.