New LiveBench Benchmark Testing Simplifies Language Model Evaluation AI NEWS

Home
AInews
New LiveBench Benchmark Testing Simplifies Language Model Evaluation

New LiveBench Benchmark Testing Simplifies Language Model Evaluation

2024-06-14

A group of researchers has released a new benchmark called LiveBench under an open-source license, aiming to simplify the evaluation of large language models (LLMs) for question-answering capabilities. The project is sponsored by Abacus.AI Inc., an AI startup backed by venture capital, and involves the participation of Turing Award winner and computer scientist Yann LeCun.

The launch of LiveBench is intended to address two major challenges that researchers face in existing LLM evaluation benchmarks. Firstly, it attempts to tackle the issue of "pollution." Secondly, it addresses the situation where software teams often use another LLM to evaluate the question-answering abilities of an LLM, which can lead to accuracy issues. LiveBench also proposes a solution to this problem.

An AI benchmark is a set of tools that tests the neural network's understanding of specific subject knowledge through a series of questions. Some benchmarks also include other types of tasks, such as requiring LLMs to debug code files. By examining the number of tasks that LLMs perform correctly, researchers can gain deeper insights into the capabilities and limitations of these models.

Language models are typically trained using a large amount of publicly available web content. In many cases, this content includes answers to questions from popular AI evaluation benchmarks. If an LLM has already learned the answers to these benchmarks, it may "cheat" during evaluation, making the benchmark results not reflect its true abilities. In the field of machine learning, this phenomenon is known as "pollution."

According to the developers of LiveBench, the newly released benchmark effectively avoids the issue of pollution in LLM evaluation. It achieves this by providing tasks to the neural network in its training dataset that are unlikely to contain the answers. To take further preventive measures, researchers will regularly update the set of tasks in LiveBench to prevent LLMs from eventually obtaining the answers to current questions.

The researchers explained in detail, "The design of LiveBench is intended to limit potential pollution by releasing new questions monthly and setting questions based on recently released datasets, arXiv papers, news articles, and IMDb movie plot summaries."

In AI accuracy evaluation, the answers to benchmark test questions are typically not scored manually. Instead, researchers use external large language models (LLMs) such as GPT-4 to check these answers. However, the developers of LiveBench believe that this approach has limitations because LLMs often make mistakes when evaluating responses to other neural network benchmarks.

The researchers further pointed out, "We demonstrate in the paper that for challenging reasoning and math problems, the correlation between GPT-4-Turbo's pass/fail judgments and true pass/fail judgments is less than 60%." Additionally, they found that LLMs sometimes incorrectly label their own correct benchmark answers as incorrect.

To address these challenges, LiveBench provides pre-packaged answers for each evaluation question. Using these answers, researchers can accurately determine whether an LLM has generated the correct response without relying on external AI systems.

The researchers also noted a limitation of LiveBench: "One drawback is that certain types of questions do not have correct answers, such as 'Write a travel guide for Hawaii.' However, although this restricts the types of questions that can be evaluated, it does not affect the validity of questions judged in this way."

The current version of LiveBench includes 960 questions in six categories, including reasoning, data analysis, mathematics, programming, language understanding, and instruction following. Some of these questions are more challenging versions of the tested content in existing AI benchmark tests. The other tasks in LiveBench will be regularly updated based on information added to frequently updated public data sources, such as arXiv, a popular research paper repository.

LockedIn AI

LockedIn AI - AI job interview assistant

Interviewer AI

Interviewer AI - AI video interviews streamline talent screening process

Jules

Jules - AI coding assistant with automatic pull requests

Final Round AI

Final Round AI - Automated job interview preparation and assistance

Sapia

Sapia - AI hiring agent for fair recruitment processes

Magic Motion

Magic Motion - AI transforms text into engaging 3D animations

Recall

Recall - AI summarizer for streamlined knowledge management

RECENT AI TOOLS

Zeroheight

LockedIn AI

Interviewer AI

Jules

Final Round AI

RECENT AI NEWS

Apple Confirms Launch of Next-Gen AI Assistant with iOS 26

Daniel Gross, Former CEO of Safety Superintelligence, Joins Meta's New AI Lab

Google Launches New Veo 3 Video Generation Model Globally

Meta's New Strategy: Enhancing User Engagement via Proactive Messaging Chatbots

Perplexity AI Launches New "Max" Subscription Service with Monthly Fee of $200

Sam Altman Criticizes Meta's Hiring Strategy as 'Unpalatable,' Calls OpenAI Still Mission-Driven

ChatGPT's News Site Recommendations Rising, but Not Enough to Offset Search Traffic Decline

Google Releases Urgent Chrome Fix for Zero-Day Vulnerability — Users Advised to Update Immediately

RECENT AI TOOLS