Large AI Chatbots Tend to Provide Incorrect Answers

2024-09-27

A recent study examining the latest and more advanced versions of three major AI chatbots reveals that they are more likely to provide incorrect answers rather than acknowledge their mistakes. Published on Wednesday, September 25th, in Nature, the research also found that individuals often find it challenging to identify these errors.

As previously reported by ReadWrite on how chatbots can "fabricate" answers from nothing, José Hernández-Orallo and his colleagues from the Valencia Institute of Artificial Intelligence in Spain investigated these inaccuracies to understand how they evolve as AI models scale up and training data increases. Additionally, these models now include more parameters or decision nodes, consuming greater computational power.

The team also explored whether the number of errors aligns with people's perceptions of question difficulty and whether individuals can effectively identify incorrect answers.

Can Large Language Models (LLMs) Be Trusted?

The research team discovered that larger and more refined versions of large language models (LLMs), optimized through fine-tuning methods such as reinforcement learning from human feedback, have seen significant improvements in accuracy. However, their reliability has decreased. Researchers found that the proportion of incorrect answers has risen because these AI models are now less likely to refrain from answering questions—for instance, admitting uncertainty or changing the topic.

One of the researchers, Lexin Zhou, wrote on X: "LLMs perform less accurately on tasks that humans consider difficult, but they achieve success on simpler tasks before tackling the more challenging ones, making it difficult for humans to determine under what conditions LLMs can be trusted."

He added that the latest versions of LLMs have mainly improved on "high difficulty instances," exacerbating the inconsistency between human difficulty expectations and the success of LLMs, which is "concerning."

The research team evaluated OpenAI’s GPT, Meta’s LLaMA, and BLOOM. They tested both early and fine-tuned models with prompts that included arithmetic, geography, and information transformation. They found that as the model size increased, accuracy improved, but performance declined on more challenging questions.

Models, including GPT-4, frequently attempt to answer difficult questions, but some fine-tuned models had error rates exceeding 60%. Surprisingly, even simple questions were sometimes answered incorrectly. Volunteers misclassified inaccurate answers as correct between 10% and 40% of the time, highlighting issues in model supervision.

Hernández-Orallo recommends that developers "enhance AI performance on simple questions" and encourage chatbots to avoid answering difficult questions, enabling users to more accurately assess AI reliability. He stated, "We need humans to understand: ‘I can use it in this area, and I shouldn’t in that area.’"