Sydney Taylor
2025
GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests
Ishika M. Rathi
|
Sydney Taylor
|
Benjamin Bergen
|
Cameron Jones
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Everyday AI detection requires differentiating between humans and AI in informal, online conversations. At present, human users most often do not interact directly with bots but instead read their conversations with other humans. We measured how well humans and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.