Research shows that even the best AI models generate a significant amount of fabricated content.

2024-08-15

From Google's Gemini to Anthropic's Claude, to OpenAI's latest low-key release of GPT-4o, all these generative artificial intelligence (AI) models inevitably contain fictional content in their creations. In short, they are unreliable as information providers, and this unreliability can sometimes lead to humor or confusion.


However, not all models produce fictional content at the same frequency. More importantly, the types of false information they spread are influenced by the sources of information they are exposed to.

A research team from Cornell University, the University of Washington, the University of Waterloo, and the non-profit research organization AI2 recently conducted a study. They compared the content generated by multiple AI models, including GPT-4o, with authoritative sources in various fields such as law, health, history, and geography, to fact-check the information. The research results showed that no model performed well in all topics, and the reason why models with less fictional content performed better is partly because they selectively avoid questions they might answer incorrectly.

"The core finding of our research is that we still cannot fully trust the content generated by these models," said Wenting Zhao, a doctoral student at Cornell University and co-author of the study, in an interview with TechCrunch. "Even the best-performing models have only about 35% of their output completely non-fictional."

Previously, there have been other academic attempts to explore the "factuality" of AI models, including research by another team under AI2. However, Zhao pointed out that the answers to the questions raised in these early tests were often easily found on Wikipedia. Since most models are trained on Wikipedia data, the difficulty of these tests is relatively limited.

In order to make the tests more challenging and closer to the scenarios where people actually ask questions to the models, the researchers deliberately selected topics without Wikipedia references from the internet for testing. In their test set, over half of the questions could not be answered by directly consulting Wikipedia (while they also included a portion of Wikipedia solvable questions as a control), covering a wide range of fields including culture, geography, astronomy, popular culture, finance, medicine, computer science, and celebrities.

This study evaluated more than ten popular AI models, many of which were newly released in the past year. In addition to GPT-4o, it also included Meta's Llama 3 70B, Mistral's Mixtral 8x22B, Cohere's Command R+, and other open models, as well as Perplexity's Sonar Large (based on Llama), Google's Gemini 1.5 Pro, and Anthropic's Claude 3 Opus, which are provided through APIs.