Google AI Unveils GRANOLA QA: A New Approach to Enhance the Accuracy of Large Language Models
Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing and have applications in almost every field, with fact-based question answering being one of the most common use cases. Unlike others, fact answers can be correctly answered at different levels of granularity. For example, both "1961" and "August 4, 1961" are correct answers to the question "When was Barack Obama born?" This diversity in providing answers poses challenges in accurately evaluating these answers and leads to inconsistencies between lexical matching and human evaluation.
Standard question answering (QA) evaluation settings do not consider this characteristic of fact answers and typically evaluate predicted answers based on a set of reference answers at the same granularity level. There is no concept of which matching is better even in cases of different granularity levels. This often leads to underestimation of LLMs' knowledge, known as the knowledge evaluation gap. To address this issue, the authors of this research paper from Google propose GRANOLA QA, a multi-granularity QA evaluation setting that evaluates answers not only based on accuracy but also based on information content.
Accuracy is measured by matching the answer with any of the GRANOLA answers, while information content is measured by matching with fine-grained answers using an appropriate weighting scheme. The answer generation process of GRANOLA consists of two steps - first, using an external knowledge graph (KG) to retrieve answer entities and descriptions of any entities mentioned in the question, and then using zero-shot prompts to have LLMs create an ordered list of answers at different granularity levels.
The researchers use WikiData to verify the correctness of answers. For information content, they check if the response is a trivial answer to the question, i.e., if it can be generated solely based on the question template. Finally, for granularity, the researchers evaluate if the response is coarser than its preceding answers.
The researchers also developed GRANOLA-EQ, a multi-granularity version of the ENTITYQUESTIONS dataset, and evaluated models using different decoding methods, including a novel decoding strategy called DRAG, which encourages LLMs to adjust the granularity level of their responses based on their uncertainty level. The results show that LLMs tend to generate specific answers but often incorrect. In contrast, when evaluated with DRAG on multi-granularity answers, it shows a 20 percentage point improvement in average accuracy, which is more pronounced for rare entities.
The authors also highlight some limitations of their work. Their approach of enhancing QA benchmarks with multi-granularity answers relies on extracting entities from original QA pairs and matching them with their knowledge graph entries. This process may be more complex in datasets with less structured data. Additionally, distinguishing correct answers based on real knowledge rather than mere guesses is essential for better evaluation.
In conclusion, the authors of this paper emphasize that generating responses more detailed than their knowledge support is the main source of factual errors in LLMs. They introduce GRANOLA QA, GRANOLA EQ, and DRAG to ensure the granularity of these models' responses aligns with their uncertainty level. Experimental results demonstrate that considering granularity level in evaluation and decoding processes can significantly improve model accuracy. Despite some limitations, their work provides a good starting point for future research extensions.