Google Launches ASPIRE Framework to Enhance Accuracy of Large Language Models

2024-01-22

With the continuous improvement of large language models such as GPT-4 and Gemini, researchers are exploring how to make their prediction results more reliable in practical applications. Confidence calibration remains a major obstacle, as even top-performing models can generate persuasive and fluent text that is inaccurate.

To address this challenge, Google AI has introduced a new framework called "ASPIRE." ASPIRE stands for Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs. The core idea is to train the model to better evaluate the correctness of its answers in question-answering tasks through targeted fine-tuning.

ASPIRE does not rely on manually designed heuristic methods but allows the model to interact with training data to learn to internally differentiate between correct and incorrect answers. This self-evaluation capability comes from three key stages: task-specific adjustments to improve accuracy, sampling of high-probability candidate answers, and specialized training to effectively label these candidate answers as accurate or inaccurate.

In addition, if the confidence score is low, ASPIRE can also allow the model to output uncertainty warnings while making predictions, such as "I don't know!" For example, if the selection score is only 0.1, indicating doubt about the answer, the LLM can further respond with "I don't know!" to remind users not to trust the output and suggest verification through other sources. This transparent indication of potential errors is an additional reliability guarantee.

Experimental results have shown that ASPIRE outperforms existing selective prediction methods on various question-answering datasets such as CoQA, TriviaQA, and SQuAD. It is worth noting that on the CoQA benchmark, ASPIRE increased the area under the accuracy coverage curve (AUACC) from 91.23% to 92.63% and the area under the receiver operating characteristic curve (AUROC) from 74.61% to 80.25%.

Interestingly, smaller models enhanced with ASPIRE sometimes surpass the selective prediction ability of larger default models. Researchers suggest that specialized self-evaluation training may be more valuable than scale for identifying the determinism of model-generated text in specific applications.

The success of ASPIRE in improving LLM reliability opens up new avenues for LLM applications in critical decision-making areas. LLMs are capable of discerning the accuracy of their predictions, which has tremendous potential in highly sensitive fields such as healthcare, law, and other domains that require high precision. On the path to reliable artificial intelligence, improving self-awareness can complement the continuous progress in the quality of base models.