New LiveBench Benchmark Testing Simplifies Language Model Evaluation

2024-06-14

A group of researchers has released a new benchmark called LiveBench under an open-source license, aiming to simplify the evaluation of large language models (LLMs) for question-answering capabilities. The project is sponsored by Abacus.AI Inc., an AI startup backed by venture capital, and involves the participation of Turing Award winner and computer scientist Yann LeCun.


The launch of LiveBench is intended to address two major challenges that researchers face in existing LLM evaluation benchmarks. Firstly, it attempts to tackle the issue of "pollution." Secondly, it addresses the situation where software teams often use another LLM to evaluate the question-answering abilities of an LLM, which can lead to accuracy issues. LiveBench also proposes a solution to this problem.

An AI benchmark is a set of tools that tests the neural network's understanding of specific subject knowledge through a series of questions. Some benchmarks also include other types of tasks, such as requiring LLMs to debug code files. By examining the number of tasks that LLMs perform correctly, researchers can gain deeper insights into the capabilities and limitations of these models.

Language models are typically trained using a large amount of publicly available web content. In many cases, this content includes answers to questions from popular AI evaluation benchmarks. If an LLM has already learned the answers to these benchmarks, it may "cheat" during evaluation, making the benchmark results not reflect its true abilities. In the field of machine learning, this phenomenon is known as "pollution."

According to the developers of LiveBench, the newly released benchmark effectively avoids the issue of pollution in LLM evaluation. It achieves this by providing tasks to the neural network in its training dataset that are unlikely to contain the answers. To take further preventive measures, researchers will regularly update the set of tasks in LiveBench to prevent LLMs from eventually obtaining the answers to current questions.

The researchers explained in detail, "The design of LiveBench is intended to limit potential pollution by releasing new questions monthly and setting questions based on recently released datasets, arXiv papers, news articles, and IMDb movie plot summaries."

In AI accuracy evaluation, the answers to benchmark test questions are typically not scored manually. Instead, researchers use external large language models (LLMs) such as GPT-4 to check these answers. However, the developers of LiveBench believe that this approach has limitations because LLMs often make mistakes when evaluating responses to other neural network benchmarks.

The researchers further pointed out, "We demonstrate in the paper that for challenging reasoning and math problems, the correlation between GPT-4-Turbo's pass/fail judgments and true pass/fail judgments is less than 60%." Additionally, they found that LLMs sometimes incorrectly label their own correct benchmark answers as incorrect.

To address these challenges, LiveBench provides pre-packaged answers for each evaluation question. Using these answers, researchers can accurately determine whether an LLM has generated the correct response without relying on external AI systems.

The researchers also noted a limitation of LiveBench: "One drawback is that certain types of questions do not have correct answers, such as 'Write a travel guide for Hawaii.' However, although this restricts the types of questions that can be evaluated, it does not affect the validity of questions judged in this way."

The current version of LiveBench includes 960 questions in six categories, including reasoning, data analysis, mathematics, programming, language understanding, and instruction following. Some of these questions are more challenging versions of the tested content in existing AI benchmark tests. The other tasks in LiveBench will be regularly updated based on information added to frequently updated public data sources, such as arXiv, a popular research paper repository.