Google AI Unveils GRANOLA QA: A New Approach to Enhance the Accuracy of Large Language Models AI NEWS

Home
AInews
Google AI Unveils GRANOLA QA: A New Approach to Enhance the Accuracy of Large Language Models

Google AI Unveils GRANOLA QA: A New Approach to Enhance the Accuracy of Large Language Models

2024-01-16

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing and have applications in almost every field, with fact-based question answering being one of the most common use cases. Unlike others, fact answers can be correctly answered at different levels of granularity. For example, both "1961" and "August 4, 1961" are correct answers to the question "When was Barack Obama born?" This diversity in providing answers poses challenges in accurately evaluating these answers and leads to inconsistencies between lexical matching and human evaluation. Standard question answering (QA) evaluation settings do not consider this characteristic of fact answers and typically evaluate predicted answers based on a set of reference answers at the same granularity level. There is no concept of which matching is better even in cases of different granularity levels. This often leads to underestimation of LLMs' knowledge, known as the knowledge evaluation gap. To address this issue, the authors of this research paper from Google propose GRANOLA QA, a multi-granularity QA evaluation setting that evaluates answers not only based on accuracy but also based on information content. Accuracy is measured by matching the answer with any of the GRANOLA answers, while information content is measured by matching with fine-grained answers using an appropriate weighting scheme. The answer generation process of GRANOLA consists of two steps - first, using an external knowledge graph (KG) to retrieve answer entities and descriptions of any entities mentioned in the question, and then using zero-shot prompts to have LLMs create an ordered list of answers at different granularity levels. The researchers use WikiData to verify the correctness of answers. For information content, they check if the response is a trivial answer to the question, i.e., if it can be generated solely based on the question template. Finally, for granularity, the researchers evaluate if the response is coarser than its preceding answers. The researchers also developed GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset, and evaluated models using different decoding methods, including a novel decoding strategy called DRAG, which encourages LLMs to adjust the granularity level of their responses based on their uncertainty level. The results show that LLMs tend to generate specific answers but often incorrect. In contrast, when evaluated with DRAG on multi-granularity answers, it shows a 20 percentage point improvement in average accuracy, which is more pronounced for rare entities. The authors also highlight some limitations of their work. Their approach of enhancing QA benchmarks with multi-granularity answers relies on extracting entities from original QA pairs and matching them with their knowledge graph entries. This process may be more complex in datasets with less structured data. Additionally, distinguishing correct answers based on real knowledge rather than mere guesses is essential for better evaluation. In conclusion, the authors of this paper emphasize that generating responses more detailed than their knowledge support is the main source of factual errors in LLMs. They introduce GRANOLA QA, GRANOLA EQ, and DRAG to ensure the granularity of these models' responses aligns with their uncertainty level. Experimental results demonstrate that considering granularity level in evaluation and decoding processes can significantly improve model accuracy. Despite some limitations, their work provides a good starting point for future research extensions.

MathGPT

MathGPT - Solve math problems with step-by-step explanations

Face Detector

Face Detector - Analyze face shape from uploaded photos

Glambase

Glambase - Create and monetize AI influencers.

Aider Chat

Aider Chat - Pair program with AI in terminal.

Tidio Chat

Tidio Chat - Manage customer communications through live chat, email, and chatbots.

Botpress

Botpress - Build and manage AI chatbots.

Theee AI

Theee AI - Use 50,000 AI tools for free online

RECENT AI TOOLS

CopyCopter

MathGPT

Face Detector

Glambase

Aider Chat

RECENT AI NEWS

El Capitan Tops Supercomputer Rankings, Powered by AMD Instinct Chips

Logo Creator: New AI-Powered Design Tool Simplifies Logo Creation Process

AWS Launches Multi-Agent Orchestrator for Managing AI Agents

Microsoft Ignite Conference Unveils Copilot Actions and Multiple AI Enhancements

Microsoft Launches Windows 365 Link, a New Option for Cloud Mini PCs

Niantic Develops Large-Scale Geospatial Models to Redefine Real-World Interactions

Google Gemini Update: Personalized Memory Feature Launched

OpenAI Launches Advanced Voice Mode for ChatGPT Web Version

RECENT AI TOOLS