GAIA: A New Benchmark Tool for General Artificial Intelligence Testing

2023-12-04

A group of researchers from Gen AI, Meta, AutoGPT, HuggingFace, and Fair Meta, affiliated with an AI startup, have developed a benchmarking tool for AI assistants, particularly those manufacturers producing products based on large language models, to test their applications as potential Artificial General Intelligence (AGI) systems. They named this tool GAIA. They have written a paper describing their tool and how to use it, which has been published on the arXiv preprint server. Over the past year, researchers in the field of AI have been privately and publicly discussing the capabilities of AI systems. Some believe that AI systems are very close to AGI, while others argue the opposite. However, everyone agrees that such systems will eventually match or even surpass human intelligence. The only question is when. In this new endeavor, the research team points out that in order to reach a consensus, a rating system must be established to measure the intelligence level of AGI systems if they emerge. They further emphasize that such a system must start with a benchmark, which is what they propose in their paper. The benchmark created by the team includes a series of questions posed to future AI systems, and the answers are compared with a random set of answers provided by humans. When creating the benchmark, the team ensures that these questions are not typical AI queries that AI systems often score highly on. Instead, the questions they propose are ones that humans find easy to answer but are difficult for computers. In many cases, finding the answers to the questions designed by the researchers requires multiple steps of work and "thinking." For example, they might ask a specific question about a particular website, such as, "According to the USDA standards reported on Wikipedia, is the fat content of a pint of ice cream higher or lower than the standard?" The research team tested their collaborative AI products and found that none of them came close to passing the benchmark, indicating that the industry may not be as close to developing true AGI as some imagine.