DeepMind Unveils AI Fact-Checking Tool SAFE with Superhuman Accuracy

2024-03-29

Google's DeepMind research team recently released a remarkable new discovery: an artificial intelligence system called "Search Augmented Fact Evaluator" (SAFE) that surpasses human fact-checkers in evaluating the accuracy of information generated by large language models.

The paper titled "Long-Form Facts in Large Language Models" has been published on the preprint server arXiv, detailing the workings of SAFE. SAFE is capable of breaking down the text generated by large language models into individual facts and determining the accuracy of each statement through Google searches.

This "superhuman" performance has sparked widespread discussion in the industry. Researchers found that SAFE matched the evaluations of human annotators in a dataset of 16,000 facts with an accuracy rate of 72%. Furthermore, in the 100 samples where SAFE disagreed with human annotators, SAFE's judgments were proven correct in 76% of cases.

However, some experts have expressed skepticism regarding the "superhuman" performance mentioned in the paper. Prominent AI researcher Gary Marcus pointed out on Twitter that this "superhuman" description may only be relative to low-paid crowd workers rather than actual human fact-checkers. He believes that this description is misleading and emphasizes the need to benchmark SAFE against expert human fact-checkers for more accurate conclusions.

Nevertheless, one notable advantage of SAFE is its cost-effectiveness. Researchers found that using this AI system for fact-checking costs approximately 20 times less than employing human fact-checkers. As the volume of information generated by language models continues to increase, an economically efficient and scalable verification method becomes increasingly important.

The DeepMind team also utilized SAFE to evaluate the factual accuracy of 13 top language models across four series on a new benchmark called LongFact. The results showed that larger models generally produced fewer factual errors. However, even the best-performing models generated a significant number of false statements, highlighting the risks of overreliance on language models and the crucial role that automated fact-checking tools like SAFE play in mitigating these risks.

Although the code for SAFE and the LongFact dataset have been open-sourced on GitHub, allowing other researchers to delve into further study and development, more transparency is needed regarding the human baselines used in the research. Understanding the background of crowd workers and the specific details of the fact-checking process is crucial for accurately assessing SAFE's capabilities.

As tech giants continue to develop more powerful language models for various applications such as search and virtual assistants, the ability to automatically verify the outputs of these systems becomes increasingly important. Tools like SAFE represent a significant step towards establishing trust and accountability.