Multimodal: The New Frontier of Artificial Intelligence AI NEWS

Home
AInews
Multimodal: The New Frontier of Artificial Intelligence

Multimodal: The New Frontier of Artificial Intelligence

2024-05-09

Multi-modality is a relatively new term used to describe an extremely ancient phenomenon: how humans have understood the world since their existence. Individuals receive information from countless sources through their senses, including vision, hearing, and touch. The human brain combines these different data patterns into a highly detailed and comprehensive picture of reality.

"Communication between people is multi-modal," said Han Xiao, CEO of Jina AI. "They use text, voice, emotions, expressions, and sometimes even photos." These are just a few obvious ways of sharing information. Therefore, he added, "it can be very certain that communication between humans and machines in the future will also be multi-modal."

A technology that looks at the world from multiple perspectives

We have not yet reached this level. The most advanced developments in this regard are emerging in the emerging field of multi-modal AI. The problem is not a lack of vision. Mirella Lapata, a professor at the University of Edinburgh and director of its Institute for Language, Cognition, and Computation, said that while the ability to translate between different modalities is clearly valuable, "it is much more complex to implement than single-modal AI."

In practice, generative AI tools use different strategies for different types of data when building large-scale data models (i.e., complex neural networks that organize a large amount of information). For example, models that rely on textual sources separate individual tokens (usually words). Each token is assigned an "embedding" or "vector": a numerical matrix that represents how and where the token is used compared to other tokens. Overall, these vectors create a mathematical representation of token meaning. On the other hand, image models may use pixels as their embedding tokens, while audio models may use sound frequencies.

A multi-modal AI model typically relies on several single-modal models. As Henry Ajder, founder of AI consultancy firm Latent Space, said, this involves "almost stringing together" various contributing models. This process involves various techniques to align the elements of each single-modal model, a process known as fusion. For example, the word "tree," an image of an oak tree, and the sound of rustling leaves may be fused in this way. This allows the model to create a multi-faceted description of reality.

RECENT AI TOOLS

Tattoo Sai

Bolt.new

Langfuse

Aitubo

IllumiDesk

RECENT AI NEWS

NVIDIA CEO Jensen Huang Envisions Future Tech Giant with 50,000 Employees and 100 Million AI Assistants

Lidwave Raises $10 Million to Advance 4D LiDAR Technology

Photoshop Update: Comprehensive AI Feature Upgrades

Chinese Academy of Sciences Discovers Five Ultra-Short-Period Planets Using Artificial Intelligence

Key Microsoft AI Researcher Sebastien Bubeck Joins OpenAI

Adobe Launches Firefly Video Model, Officially Enters the Generative AI Video Space

China Academy of Information and Communications Technology and Tencent Sign Artificial Intelligence Cooperation Agreement

Tesla Robotaxi Launch Causes Stock Decline, Musk's Net Worth Decreases

RECENT AI TOOLS

Tattoo Sai

Bolt.new

Langfuse

Aitubo

IllumiDesk

HubSpot Campaign Assistant

VFusion3D

Revid AI

Shortspilot