AI Achieves Milestone by Passing Rigorous Turing Test for the First Time

Researchers from the University of San Diego have published a paper that empirically demonstrates for the first time that an AI system has effectively passed the classic Turing test: the Win Rate of the new OpenAI model, GPT-4.5, is 73%.

It may seem that the Turing test has become outdated, with language models having «defeated» it long ago. However, it’s important to clarify: there have been numerous articles measuring AI capabilities on the Turing test. But firstly, although some models achieved results above 50%, those results were still unstable and close to random guessing. Secondly, many studies did not examine the original three-party version of the test but rather simplified surrogates.

**What does the three-party version of the test entail?** It involves two humans and a bot in each experiment. The interrogator receives two parallel conversations—one with a person and one with the bot—asks any questions for five minutes, uses various strategies to identify who is who, and then casts a vote.

This is how the researchers from San Diego tested the models. They conducted experiments with GPT-4.5, LLaMa-3.1-405B, GPT-4o, and the oldest language model, ELIZA.

As a side note: despite originating from the 1960s, ELIZA is still a fascinating model. Remarkably, in this study, it (spoiler alert) outperformed the more contemporary GPT-4o. If you wish to read more about it, feel free to check our Telegram channel Data Secrets, where we have shared a series of illustrated posts on the history of AI and NLP.

Additionally, on our channel (run by a team of active ML engineers), we share practical notes on ML, analyses of important recent papers, and other intriguing news every day. So, don’t hesitate to join us: we have 55 thousand members, and our community welcomes new specialists and enthusiasts! 🙂

In the experiments, each model was tested twice using two different starting prompts. In the first scenario, the model was explicitly told it needed to outsmart a human and make them think it was a real conversation partner (this is the NO-PERSONA variant). In the second scenario, the model was simply instructed to behave «like a young introvert interested in internet culture and actively using slang» (this is the PERSONA variant).

Surprisingly, the second, more natural variant proved to be more effective. In the NO-PERSONA format, GPT-4.5 only succeeded 36% of the time, while adopting the «young introvert» persona it achieved an impressive 73%. This indicates that in almost **three out of four instances, the model «outperformed» both humans and led the interrogator to believe that the real conversant was the bot, while it was perceived as the human.**

The result for LLaMa was also notable at 56%. However, as previously mentioned, this is closer to random guessing and does not count as a valid passing of the test. ELIZA achieved 23%, and GPT-4o performed even worse with just 21%.

All of these figures can confidently be considered statistically significant. The authors provided replicated results from two independent groups of people, with each participant serving multiple times as both judge and conversant.

Does passing the Turing test prove «intelligence»? As of today, the answer is more likely no than yes. However, the research certainly establishes one thing: LLMs have reached a hypothetical level where they can mislead humans, and we may frequently find it challenging to «unmask» them easily.