AI Search Engines Invent Sources for ~60% of Queries, Study Finds


AI search engines are like that friend of yours who claims to be an expert in a whole host of topics, droning on with authority even when they do not really know what they are talking about. A new research report from the Columbia Journalism Review (CJR) has found that AI models from the likes of OpenAI and xAI will, when asked about a specific news event, more often than not, simply make up a story or get significant details wrong.

The researchers fed various models direct excerpts from actual news stories and then asked them to identify information, including the article’s headline, publisher, and URL. Perplexity returned incorrect information 37 percent of the time, while at the extreme end, xAI’s Grok made details up 97 percent of the time. Errors included offering links to articles that did not go anywhere because the bot even made up the URL itself. Overall, researchers found the AI models spat out false information for 60 percent of the test queries.

Sometimes, search engines like Perplexity will bypass the paywalls of sites like National Geographic even when those websites have used do-not-crawl text that search engines normally respect. Perplexity has gotten in hot water over this in the past but has argued the practice is fair use. It has tried offering revenue-sharing deals to placate publishers but still refuses to end the practice.

A graph shows how various AI search engines invent sources for stories.
© Columbia Journalism Review’s Tow Center for Digital Journalism

Anyone who has used chatbots in recent years should not be surprised. Chatbots are biased toward returning answers even when they are not confident. Search is enabled in chatbots through a technique called retrieval-augmented generation, which, as the name implies, scours the web for real-time information as it produces an answer, rather than relying on a fixed dataset that an AI model maker has provided. That could make the inaccuracy issue worse as countries like Russia feed search engines with propaganda.

One of the most damning things that some users of chatbots have noticed is that, when reviewing their “reasoning” text, or the chain of logic the chatbots use to answer a prompt, they will often admit they are making things up. Anthropic’s Claude has been caught inserting “placeholder” data when asked to conduct research work, for instance.

Mark Howard, chief operating officer at Time magazine, expressed concern to CJR about publishers’ ability to control how their content is ingested and displayed in AI models. It can potentially damage the brand of publishers if, for instance, users learn that news stories they are purportedly receiving from The Guardian are wrong. This has been a recent problem for the BBC, which has taken Apple to task over its Apple Intelligence notification summaries that have rewritten news alerts inaccurately. But Howard also blamed the users themselves. From Ars Technica:

However, Howard also did some user shaming, suggesting it’s the user’s fault if they aren’t skeptical of free AI tools’ accuracy: “If anybody as a consumer is right now believing that any of these free products are going to be 100 percent accurate, then shame on them.”

Expectations should be set at the floor here. People are lazy, and chatbots answer queries in a confident-sounding manner that can lull users into complacency. Sentiment on social media demonstrates that people do not want to click links and would rather get an immediate answer from the likes of Google’s AI Overviews; CJR says one in four Americans now use AI models for search. And even before the launch of generative AI tools, more than half of Google searches were “zero-click,” meaning the user got the information they needed without clicking through to a website. Other sites like Wikipedia have proven over the years that people will accept something that may be less authoritative if it is free and easily accessible.

None of these findings from CJR should be a surprise. Language models have an intractable challenge with understanding anything they are saying because they are just glorified autocomplete systems that try and create something that looks right. They are ad-libbing.

One other quote from Howard that stood out was when he said that he sees room for future improvement in chatbots. “Today is the worst that the product will ever be,” citing all the investment going into the field. But that can be said about any technology throughout history. It is still irresponsible to release this made up information out into the world.


Leave a Reply

Your email address will not be published. Required fields are marked *