Patronus AI, a startup dedicated to providing tools for companies to assess the reliability of their artificial intelligence models, has today unveiled a robust new "hallucination detection" tool designed to help identify when chatbots produce anomalous responses.
The company claims that their latest model, Lynx, represents a significant breakthrough in AI reliability, enabling businesses to detect AI hallucinations without manual annotation.
In the context of AI, "hallucinations" refer to coherent but factually incorrect responses generated by large language models. These models tend to fabricate information when they do not know how to respond accurately, which can be hazardous for companies relying on precise AI interactions with customers.
A notable example of AI hallucinations occurred with Google's experimental "AI summary" feature, which reportedly suggested using glue to prevent cheese from falling off homemade pizzas. Another instance involved advising users to use mustard gas to clean washing machines, highlighting the potential risks associated with these inaccuracies.
To address this issue, some AI firms employ AI to detect hallucinations. For instance, OpenAI has refined GPT-4 to identify inconsistencies in responses from legendary chatbots, a concept known as "LLM as judge." However, there are ongoing concerns about the accuracy of such solutions.
Patronus AI focuses on enhancing AI reliability and recently secured $17 million in funding to develop a platform that uses AI-generated adversarial prompts to test the robustness of LLMs by attempting to induce hallucinations.
Lynx is described by the startup as representing the "state-of-the-art" in AI hallucination detection, allowing developers to identify inappropriate responses in real time. Alongside Lynx, the company has also open-sourced HaluBench, a benchmark derived from real-world domains to evaluate the fidelity of LLM responses.
According to Patronus AI, extensive testing using HaluBench demonstrated that Lynx significantly outperforms GOT-4 in detecting hallucinations. The largest version of Lynx, with 70 billion parameters, showed superior accuracy compared to other tested LLMs serving as judges. Patronus AI asserts that this makes Lynx the most powerful hallucination detection model available.
HaluBench is specifically designed to test AI models in specialized fields like healthcare, medicine, and finance, making it highly applicable for practical scenarios.
Sample results from Patronus AI's benchmarks indicate that Lynx (70B) surpasses GPT-4 by 8.3% in detecting medical inaccuracies. Meanwhile, the smaller Lynx (8B) model outperforms older versions like GPT-3.5 by 24.5% across all HaluBench domains. It also exceeds Anthropic PBC's Claude-3-Sonnet by 8.6% and Claude-3-Haiku by 18.4%. Furthermore, Lynx outshines open-source LLMs such as Meta Platforms Inc.'s Llama-3-8B-Instruct.
Anand Kannappan, CEO of Patronus AI, notes that hallucinations pose one of the most critical challenges in the AI industry. Recent studies suggest that between 3% and 10% of all LLM responses contain inaccuracies.
Hallucinations can manifest in various forms, including leaking training data, exhibiting biases, or stating outright falsehoods. Kannappan explains that Lynx aims to tackle these issues, although he acknowledges that it may not provide a permanent solution. Nonetheless, it serves as a valuable tool for developers to gauge the likelihood of their LLMs producing inaccurate outputs.
"Developers can utilize [Lynx and HaluBench] to measure the hallucination rate of their fine-tuned LLMs in specific domain scenarios," he elaborates.