Last week, OpenAI's first video generation model, Sora, caused a sensation on the Internet. However, at the same time, AI experts and researchers from competing companies quickly analyzed and criticized Sora's Transformer model, sparking a debate in the field of physics.
AI scientist Gary Marcus is one of the many critics who not only criticized the accuracy of the videos generated by Sora, but also criticized the generative AI models used for video synthesis.
Competitors Meta and Google joined forces to undermine Sora's model of understanding the physical world.
Meta CEO Yann LeCun stated, "Generating the most realistic-looking videos based on prompts does not mean that the system understands the physical world. Generating videos that seem reasonable has a very large space, and video generation systems only need to generate one sample to succeed."
LeCun further explained the differences between Sora and Meta's latest AI model, V-JEPA (Video Joint Embedding Prediction Architecture), which analyzes the interactions between objects in videos. He said, "That's the whole point behind JEPA, it's not generative, but predictive in the representation space," making V-JEPA's self-supervised model appear superior to Sora's model.
Researcher and entrepreneur Eric Xing also supported LeCun's views, stating that "an agent model capable of reasoning based on understanding must go beyond LLM and DM."
The timing of the release of Gemini Pro 1.5 couldn't be better. The videos created by Sora were run on Gemini 1.5 Pro, and models criticized the inconsistencies in the videos, stating that "this is not a real scene."
Although experts quickly refuted the capabilities of generative models, the understanding of "physics" behind the models was overlooked.
Sora uses a Transformer architecture similar to the GPT model, which OpenAI believes will "understand and simulate the real world," contributing to the achievement of AGI. Although it is not called a physics engine, it is possible that data generated by Unreal Engine 5 has been used to train Sora's base model.
NVIDIA senior research scientist Jim Fan clarified OpenAI's Sora model by explaining the data-driven physics engine. "Sora learns the physics engine implicitly through gradient descent on a large number of videos," he said, referring to Sora as a learnable simulator or world model.
Fan also expressed his dissatisfaction with the reductionist view of Sora. "I see some strong objections: 'Sora did not learn physics, it just manipulates pixels in 2D.' I disagree with this reductionist view. This is similar to saying, 'GPT-4 did not learn programming, it just samples strings.' Well, all it does is manipulate a sequence of integers (token IDs). What neural networks do is manipulate floating-point numbers. This is not a valid argument," he said.
Recently, Aravind Srinivas, the founder of Perplexity, publicly supported LeCun's views. "The reality is that, although Sora is impressive, it is still not ready to accurately simulate physics," he said.
Fan also compared Sora to the "GPT-3 moment" of 2020, when the model required "a lot of prompting and hand-holding." However, it was also a "convincing demonstration of learning within the context of emerging properties."
Interestingly, OpenAI itself pointed out the limitations of the model before others did. The company's blog stated that Sora may have difficulty accurately simulating the physical properties of complex scenes because it may not understand specific instances of causal relationships. It may also confuse details in the prompt space, such as following specific camera trajectories.
The current limitations have not affected the quality of the generated outputs. When OpenAI acquired Global Illumination, a digital product company that created the open-source game "Biome", in August last year, there were speculations about video generation and building simulation model platforms through automatic agents.
Now, with the release of Sora, the possibilities of disrupting the video game industry will only continue to rise. If Sora is in the "GPT-3 moment," then the GPT-4 stage of this model will be unimaginable.