OpenAI GPT-4.5 Successfully Passes Turing Test

2025-04-03

Recently, a new preprint study has unveiled that OpenAI's GPT-4.5 large language model achieved remarkable results in a three-party version of the Turing Test. The Turing Test has long been considered the benchmark for determining whether a machine exhibits human-like intelligence. In this test, GPT-4.5 excelled by successfully surpassing the ability of human interrogators to distinguish between human and machine.

In this study, which is still awaiting peer review, researchers developed an experimental setup where participants engaged in text-based conversations with both a human and an AI, tasked with identifying which was human. When GPT-4.5 was prompted to adopt a specific persona during the chat, it was perceived as human 73% of the time—far exceeding the 50% probability expected from random guessing. This outcome signifies a significant milestone in the evolution of the Turing Test.

Besides GPT-4.5, the research also evaluated Meta's LLaMa 3.1-405B model, OpenAI’s GPT-4o, and the earlier chatbot ELIZA. The findings revealed that participants could not differentiate between humans and either GPT-4.5 or LLaMa (when provided with persona cues) any better than random chance. In some cases, GPT-4.5 was even more frequently identified as human than actual human participants.

The Turing Test was first proposed by British mathematician and computer scientist Alan Turing in 1950. It assesses machine intelligence through text-based dialogues, positing that if an interrogator cannot reliably determine which entity is human and which is a machine, the machine may possess a form of human-like intelligence.

In this latest study, researchers conducted the Turing Test online. Nearly 300 participants were randomly assigned roles as interrogators or one of two "witnesses," with the other witness being a chatbot. A key factor in the experiment was how the AI models were prompted: either without persona cues or with them. Without persona prompts, GPT-4.5 only achieved a success rate of 36%, significantly lower than the 73% when persona cues were applied. By comparison, GPT-4o (the current ChatGPT version), which was tested exclusively without persona cues, achieved just a 21% success rate.

Despite GPT-4.5’s notable performance, this does not imply that it possesses true human-level intelligence. The Turing Test is more of a thought experiment than a definitive measure. Although large language models are trained on vast amounts of human-generated text and can generate plausible responses to unfamiliar questions, they still lack genuine human cognition.

Cameron Jones, the lead author of the study and a researcher at the Language and Cognition Lab at the University of California, San Diego, stated that his findings do not definitively prove that large language models have human-like intelligence. However, he noted that the results further demonstrate these models' ability to convincingly mimic human behavior in short interactions, potentially leading to increased automation in workplaces, heightened social engineering threats, and broader societal disruptions.

Finally, Jones emphasized that the Turing Test serves not only as a gauge for machines but also reflects humanity’s evolving perceptions of technology. As people grow more accustomed to interacting with AI, they may become better at recognizing artificial entities, meaning the outcomes of the Turing Test are not set in stone.