We’ve all felt the creeping suspicion that what we’re reading is being written by a large language model. But it’s very difficult to pin down. For several months last year, everyone became convinced that certain words like “explore” and “underscore” could identify the model, but the evidence was thin, and as the models became more sophisticated, the telltale words became harder to track down.
But as it turns out, the folks at Wikipedia have gotten pretty good at flagging prose written by AI. And this group’s public guide, “Signs of AI Writing,” is the best resource I’ve found to find out if your suspicions are justified. (Credit to poet Jameson Fitzpatrick for pointing out the X documentation.)
Since 2023, Wikipedia editors have been working to understand AI posts. This is a project they call Project AI Cleanup. With millions of edits made every day, there is a wealth of material to tap into, and in the style of classic Wikipedia editors, the group has created a detailed, evidence-based field guide.
First, this guide will review what we already know. This means that automated tools are basically useless. Instead, this guide focuses on conventions and phrases that are rare on Wikipedia but common across the internet (and therefore common in model training data). According to the guide, AI submissions spend a lot of time emphasizing why their subject matter is important, usually in general terms like “pivotal moment” or “broader movement.” The AI model also spends a lot of time detailing minor spots in the media to make the subject seem noteworthy. This is the kind of thing you would expect from a personal biography, but not from an independent source.
This guide points out a particularly interesting quirk regarding final clauses whose significance is ambiguous. The model would say that some event or detail “emphasizes” the significance of something, or “reflects the continuing relevance” of some general idea. (Grammar geeks will know this as the “present participle.”) It’s a little hard to identify, but once you learn to recognize it, you’ll see it everywhere.
Marketing terminology also tends to be vague, which is very common on the internet. The scenery is always beautiful, the views are always breathtaking, and everything is clean and modern. As our editors say, it’s “similar to transcribing a TV commercial.”
This guide is worth reading in its entirety, I was very impressed. Previously, I would have said that LLM prose was developing too quickly to be specific. But the habits flagged here are deeply embedded in how AI models are trained and deployed. You can hide them, but it’s difficult to get rid of them completely. And as the public becomes more savvy about identifying AI prose, all sorts of interesting consequences could arise.
