Why "Multimodal Artificial Intelligence" is the Hottest Technology Today

2024-05-16

This week, OpenAI and Google showcased their latest and most advanced artificial intelligence technologies. Over the past two years, tech companies have been competing to make AI models smarter, but now there is a new focus: making them multimodal. OpenAI and Google are focusing on AI that can seamlessly switch between its machine mouth, eyes, and ears.

"Multimodal" has become the hottest term for tech companies betting on the most appealing form of AI models in everyday life. Since the launch of ChatGPT in 2022, AI chatbots have lost their shine. Therefore, companies hope to have more natural conversations and visual sharing with AI assistants instead of typing. When you see multimodal AI performing well, it feels like science fiction coming true.

On Monday, OpenAI showcased GPT-4 Omni, which evokes the dystopian film "Her" about human disconnection. OpenAI claims that the model can process both video and audio simultaneously. In the demonstration, an OpenAI employee used a smartphone camera to show a math problem to ChatGPT and verbally asked the chatbot to explain it. OpenAI stated that the model is now being rolled out to advanced users.

The next day, Google launched Project Astra, promising similar functionality. However, compared to GPT-4 Omni, Project Astra seems a bit slow, with a more mechanized voice resembling Siri rather than "Her." However, Google stated that this is still in the early stages and even pointed out some of the current challenges that OpenAI has already overcome.

In a blog post, Google wrote, "While we have made incredible progress in developing AI systems that can understand multimodal information, reducing response time to a level suitable for conversation is a challenging engineering task."

You may still remember Gemini, the demo video released by Google in December 2023, which turned out to be highly processed. Six months later, Google is still not ready to release the content shown in the video, but OpenAI is accelerating the development of GPT-4o. Multimodal AI represents the next big race in AI development, and OpenAI seems to be winning this race.

One key difference with GPT-4o is that a single AI model can handle audio, video, and text natively. Previously, OpenAI needed separate AI models to translate speech and video into text so that language-based GPT-4 could understand these different media. Given the slower response time, Google may still be using multiple AI models to perform these tasks.

As tech companies adopt multimodal AI, we are also seeing wider adoption of AI wearables. Examples of AI devices that utilize these different media include Humane AI Pin, Rabbit R1, and Meta Ray-Bans. These devices are expected to reduce our reliance on smartphones, but Siri and Google Assistant may soon adopt multimodal AI as well.