A Scanning Error Created a Fake Science Term—Now AI Won’t Let It Die


AI trawling the internet’s vast repository of journal articles has reproduced an error that’s made its way into dozens of research papers—and now a team of researchers has found the source of the issue.

It’s the question on the tip of everyone’s tongues: What the hell is “vegetative electron microscopy”? As it turns out, the term is nonsensical.

It sounds technical—maybe even credible—but it’s complete nonsense. And yet, it’s turning up in scientific papers, AI responses, and even peer-reviewed journals. So… how did this phantom phrase become part of our collective knowledge?

As painstakingly reported by Retraction Watch in February, the term may have been pulled from parallel columns of text in a 1959 paper on bacterial cell walls. The AI seemed to have jumped the columns, reading two unrelated lines of text as one contiguous sentence, according to one investigator.

The farkakte text is a textbook case of what researchers call a digital fossil: An error that gets preserved in the layers of AI training data and pops up unexpectedly in future outputs. The digital fossils are “nearly impossible to remove from our knowledge repositories,” according to a team of AI researchers who traced the curious case of “vegetative electron microscopy,” as noted in The Conversation.

The fossilization process started with a simple mistake, as the team reported. Back in the 1950s, two papers were published in Bacteriological Reviews that were later scanned and digitized.

The layout of the columns as they appeared in those articles confused the digitization software, which mashed up the word “vegetative” from one column with “electron” from another. The fusion is a so-called “tortured phrase”—one that is hidden to the naked eye, but apparent to software and language models that “read” text.

As chronicled by Retraction Watch, nearly 70 years after the biology papers were published, “vegetative electron microscopy” started popping up in research papers out of Iran.

There, a Farsi translation glitch may have helped reintroduce the term: the words for “vegetative” and “scanning” differ by just a dot in Persian script—and scanning electron microscopy is a very real thing. That may be all it took for the false terminology to slip back into the scientific record.

But even if the error began with a human translation, AI replicated it across the web, according to the team who described their findings in The Conversation. The researchers prompted AI models with excerpts of the original papers, and indeed, the AI models reliably completed phrases with the BS term, rather than scientifically valid ones. Older models, such as OpenAI’s GPT-2 and BERT, did not produce the error, giving the researchers an indication of when the contamination of the models’ training data occurred.

“We also found the error persists in later models including GPT-4o and Anthropic’s Claude 3.5,” the group wrote in its post. “This suggests the nonsense term may now be permanently embedded in AI knowledge bases.”

The group identified the CommonCrawl dataset—a gargantuan repository of scraped internet pages—as the likely source of the unfortunate term that was ultimately picked up by AI models. But as tricky as it was to find the source of the errors, eliminating them is even harder. CommonCrawl consists of petabytes of data, which makes it tough for researchers outside of the largest tech companies to address issues at scale. That’s besides the fact that leading AI companies are famously resistant to sharing their training data.

But AI companies are only part of the problem—journal-hungry publishers are another beast. As reported by Retraction Watch, the publishing giant Elsevier tried to justify the sensibility of “vegetative electron microscopy” before ultimately issuing a correction.

The journal Frontiers had its own debacle last year, when it was forced to retract an article that included nonsensical AI-generated images of rat genitals and biological pathways. Earlier this year, a team of researchers in Harvard Kennedy School’s Misinformation Review highlighted the worsening issue of so-called “junk science” on Google Scholar, essentially unscientific bycatch that gets trawled up by the engine.

AI has genuine use cases across the sciences, but its unwieldy deployment at scale is rife with the hazards of misinformation, both for researchers and for the scientifically inclined public. Once the erroneous relics of digitization become embedded in the internet’s fossil record, recent research indicates they’re pretty darn difficult to tamp down.


Leave a Reply

Your email address will not be published. Required fields are marked *