Google Launches DataGemma Model to Reduce Factual Errors in Language Models

2024-09-13

Google recently announced DataGemma, two new versions of its open Gemma model based on the Gemini architecture, which relies on real-world statistical data from Google Data Commons. Google claims that DataGemma is the first open model of its kind to reduce "hallucinations" (i.e., factual errors). For a long time, language models have often made factual errors when dealing with tasks involving numerical or statistical data, which is a pressing issue. Google's Data Commons is a repository that contains over 240 billion data points from trusted sources such as the United Nations and the Centers for Disease Control and Prevention. DataGemma combines model outputs with real-world data during the generation process through two key techniques: Retrieve Interleaved Generation (RIG) and Retrieve Augmented Generation (RAG). RIG proactively queries trusted sources before generating a response, while RAG retrieves relevant information from Data Commons before generation and provides comprehensive answers through the long-context window feature of Gemini 1.5 Pro. Preliminary research results show that these two techniques significantly improve the accuracy of the model in handling numerical facts and statistical queries. However, the research also points out existing challenges, including accuracy issues with the natural language interface of Data Commons, irrelevant model generation, and insufficient data coverage. Specifically, when using the RIG method, the accuracy of factual information has greatly improved from a baseline of 5-17% to approximately 58%. However, in about 33-27% of cases, the model or Data Commons provides incorrect information. The RAG method performs well when citing specific values, with an accuracy rate of 98-99%. However, when making inferences based on statistical data, there are 6-20% of cases with errors or inaccurate inferences. Google emphasizes that DataGemma is currently primarily used for academic and research purposes and is not yet ready for commercialization or public use. In the future, the team plans to expand the training dataset, improve the natural language processing capabilities of Data Commons, and explore user interfaces that display fact-checking results. Furthermore, Google recognizes the ethical implications of this work and has conducted red team testing to examine potential risky queries. They also promise to continuously evaluate and improve the behavior of the model. As the research progresses, DataGemma is expected to lay the foundation for creating more trustworthy and reliable AI systems and have a broad impact in fields such as healthcare, policy-making, education, and scientific research.