鶹ýվ

Subscribe to the OSS Weekly Newsletter!

Searching for the Fountain of Youth

Lecture given by Dr. Joe Schwarcz

Why Did LLMs Steal Our Em-Dashes?

The em-dash, or “—”,is a writing tool which allows for a clearer expression of complex thoughts, and AI seems to think so too. As well-articulating students, researchers, and other writers attemptto navigate this dashlessexistence, two questions arise. When are we getting them back? Will we ever?

Upon ChatGPT's release in 2022, I realized that I wrote like AI. My sentences were long, my writing patterns were predictable, and my use of em-dashes was frequent. Initially, I was not concerned: if models are being trained to write like me, I must be doinga good job, right?

Then, however, came the "AI detectors" used by teachers and reviewers. These AI detectors aredesigned to spot. They measure the predictability of a written work’swording ("perplexity"), the variability of the sentence structure ("burstiness"), and other markers. Essentially, AIsare being used to spot content created by other AIs. At this point, I began to change my writing style. No more back-to-back 20+ word sentences. No more dash-filled phrases, semicolons, or groups of threes; I was not willing to risk being flagged.

Returning to the point of the em-dash (and other snubbed marks), there are two key reasons why they—along with words like "delve" and "underscores"—are so frequent in AI-generated writing: the training data, and the assessment of said data.

First, let'slook at the data collection process for LLMs (large language models). Over 60% of the training data used in early models like GPT-3, which collectpublicly available text off the internet. After the data is collected, it is used to train models to predict language structures and patterns. Most LLMs are trained to predict, internalizing patterns in grammar and style along the way. If a particular structure (like the em-dash) appears often enough and isn'tadjusted later on, it can become a characteristic aspect of the model's output.

As a result of this pattern-based learning—and the fact that these patterns aren'talways corrected—models can take on specific stylistic habits that become hard-wired as their "instinct". AsMediumwriter Brent Csutorasdemonstratedin his failed attempts to, the em-dash has become embedded into the output style of today's LLMs.

To be clear, you are not imagining this em-dash overuse. According to Freeburg, an independent researcher, LLMs use em-dashes much more frequentlythan human writers, with GPT-4.1 having ain standard essays. Similarly to Csutoras' conclusion, they found that em-dashes were almost entirelyresistant to prompt manipulations and user restrictions.

Now, how did no one realize that AIs were learning to use em-dashes so frequently? Some journalists, includingThe Economist'sAlex Hern, believe that theis a key factor. African English uses words like "delve" much more frequentlythan the internet at large, which may affect the regulators' choices. However, the work of thesemostly ties to removing sexist, racist, and other harmful content, not directly altering the linguistic choices of the models.

Initially, I hypothesized that the explanation was tied to the datasets being used to train LLMs. However, after acomparing word frequenciesin—a text dataset of popular modern media (think Star Trek)—and, a set which mimics AIs training data, I found that while OpenWebTextoften "won out" in terms of frequency, the gap wasn'tsignificant.

The em-dash frequency of OpenWebTextwas so high (1621.88 uses/million), I had to remove it from this chart. I have no reference for COCA and am only drawing conclusions based on words.

I then turned to another potential argument: implicit bias, or the internal perceptionsand judgements of individuals. Before em-dashes rose back into fame, they were mostly used in prose and other writing spaces that encourage wide vocabulariesand creative structuring. Many people didn'tknow what they were before LLMs began splicing them into their sentences, andgiven our more regular reliance on short-form content like text and emails, they didn'tneed to. In contrast, LLM training involves, where em-dashes are more common than in the average person's consumed media. Bias explains why em-dashes feelsoout of place, but not why em-dashes are actually being used more frequently than normal.

The generally acceptedhypothesis to explain this overuse ties back LLMs’ training and reinforcement processes. As models learn to predict language patterns, they begin to use their learned patterns to do so.However, this isn’tthe only factor determiningwhich patterns get used more often. Models like Claude and ChatGPT have an additionalgoal with their responses: to provide users with clarity. Em-dashes, which allow for explanatory pauses and the breaking down of complex ideas,. As such, LLMs are not only introduced to more em-dashes, but their training also reinforces their usage. This results in em-dashes appearing more frequentlythan in typical human writing.

Sowhat does this mean in the long term? Personally, I believe that these models will soon reduce their use of em-dashes. Individuals are currently avoiding em-dashes and other AI "red flags", so their overall usage is decreasing. LLMs are trained to replicate the styles of human writers, and as LLMs get more frequent content updates, the decreased use of these writing tools should have an influence on their responses.

The only question is whether we, as writers, will ever go back. This fear of being "caught" has begun to overtake what writing once was: freedom of expression. There are now countless, flagging everything from empty questions to the use of writing structures that people were once taught to use. To write "humanly", we have to write less creatively.


Lia Erisson is a second year (U2) Computer Science & Economics student minoring in Physiology. She loves exploring the intersection of technology, wellbeing, and the human experience.

Part of the OSS mandate is to foster science communication and critical thinking in our students and the public. We hope you enjoy these pieces from our Student Contributors and welcome any feedback you may have!

Back to top