Grammar Girl Quick and Dirty Tips for Better Writing

Why AI loves em dashes, with Sean Goedecke

Episode Summary

1157. This week, we look at AI em dashes with Sean Goedecke, software engineer for GitHub. We talk about why artificial intelligence models frequently use em dashes and words like "delve," and how training on public domain books from the late 1800s may have influenced these patterns. We also look at the role of human feedback in shaping "AI style."

Episode Notes

www.SeanGoedecke.com

🔗 Join the Grammar Girl Patreon.

🔗 Share your familect recording in Speakpipe or by leaving a voicemail at 833-214-GIRL (833-214-4475)

🔗 Watch my LinkedIn Learning writing courses.

🔗 Subscribe to the newsletter.

🔗 Take our advertising survey.

🔗 Get the edited transcript.

🔗 Get Grammar Girl books.

| HOST: Mignon Fogarty

| Grammar Girl is part of the Quick and Dirty Tips podcast network.

Audio Engineer: Dan Feierabend
Director of Podcast: Holly Hutchings
Advertising Operations Specialist: Morgan Christianson
Marketing and Video: Nat Hoopes, Rebekah Sebastian
Podcast Associate: Maram Elnagheeb

| Theme music by Catherine Rannus.

| Grammar Girl Social Media: YouTube. TikTok. Facebook. Threads. Instagram. LinkedIn. Mastodon. Bluesky.

Episode Transcription

[Computer-generated transcript]

Mignon Fogarty: Grammar Girl here. I’m Mignon Fogarty, and I bet many of you are as tired as I am of hearing about em dashes being a sign of AI writing. But today, I have someone really interesting who can help us answer a question that is more interesting, which is: Why? Why does AI use so many em dashes? Sean Goedecke is a software engineer for GitHub. He’s from Melbourne and he is a prolific blogger who writes about all these issues. Sean, welcome to the Grammar Girl podcast.

Sean Goedecke: Hi, Mignon. Thanks for having me.

Mignon Fogarty: You bet. So I saw your blog post, and I was instantly fascinated because it’s this thing that people in my world—all the writers—are talking about and debating whether we should leave em dashes out of our writing because everyone thinks it’s a sign of AI. I have always loved the em dash. I use it a lot, so I didn’t even really notice that it was being used so much. But obviously, now it’s a thing. I always felt like so much normal writing, so much good writing, has em dashes. Isn’t that just why AI is doing it—because it’s in human writing?

Sean Goedecke: Well, the short answer—and maybe I’m preempting the podcast entirely with this—is yes. That’s basically why it is. But there are interesting questions around why it emerged, because the initial versions of AI, like ChatGPT-3, didn’t use em dashes. The latest versions of the models use em dashes less.

Mignon Fogarty: That was one of the things that absolutely fascinated me. So ChatGPT-3.5 barely used em dashes at all, right?

Sean Goedecke: Yes, that’s right. It certainly was not enough for people to talk about it as a sign of AI. It might have used the odd one here or there, but it wasn’t the kind of extreme overuse we saw out of later models like 4.0 and 4.

Mignon Fogarty: So that’s a clue, right?

Sean Goedecke: Yeah. Well, it’s a mystery. I think one of the most interesting things about this is that I'll have, as I'll get to, some theories about why this is happening, but nobody knows. Like nobody actually knows. Not even the people who build the models know, because it's so non-deterministic this process of constructing an AI model because it's trained or grown rather than designed from scratch. It's really like emergent behavior this use of em dashes.

Mignon Fogarty: Why don’t you for the audience, lots of writers, why don't you give a really quick overview of how these models are made and why it’s not deterministic?

Sean Goedecke: Yeah, sure. So in a normal computer program, somebody sits down and they go through step-by-step everything that can happen in the program, and that’s what becomes the program. Whereas an AI model is more like—imagine a series of knobs and dials the size of a football stadium. If you tune all those knobs and dials to precisely the right values, what you end up with is something like ChatGPT. But there’s no human being going and tuning all those knobs and dials. What happens instead is a bunch of training data gets automatically shoved through the whole thing and automatically turns the dials to various things.

So how the model behaves is really a function of what kind of things it was exposed to as it learned how to speak English and learned how to produce plausible-sounding text. And then models that are trained on different things can behave very differently. Not even the people who train them know exactly how that works or exactly what kinds of things they ought to train the models on in advance. One reason why GPT-3 is different is because it was so early; it was trained on fundamentally different things than later models. They just didn’t have access to the kind of datasets that were used later on. I think that’s partially why we see more em dashes later on.

Mignon Fogarty: What is the difference between those early and late datasets?

Sean Goedecke: Yeah, well, at the start, I think people weren’t even sure that this process of building a large language model would even work. So there wasn’t this huge effort to get a lot of high-quality data. It was really just trained on the internet—on variations of what people call "The Pile," which is this huge dataset of conversations, and articles, and blog posts, and comments scraped from the internet. That’s what formed these initial AI models. So, it was really biased toward short-form content, and it was biased toward short-form content from the last 20 years basically.

Then, once ChatGPT took off, and once it became clear that there was, like, frankly an enormous amount of money in these tools, companies started scrambling to find more data and better data. They started doing things like—and I think this is where the em dash comes from—digitizing books and digitizing huge amounts of print books, which would have contained more em dashes than the kind of writing we see today.

Mignon Fogarty: Why do you think print books contain more em dashes?

Sean Goedecke: Well, I don't know why, but it's just clear to me that they do. When I say older books, I'm talking about the early 1900s or the late 1800s. If you read books from that era, it's clear that there are certain styles that are not present today, such as the almost German-like capitalization of words for emphasis or the use of em dashes. My favorite book is "Moby Dick" by Herman Melville, and that has like 20 em dashes per page the whole way through. It just has a ridiculous amount of em dashes, almost in place of many other punctuation marks. You wouldn't see a book written like that today; it would just read as hopelessly archaic.

Mignon Fogarty: Right, he loved them! The late 1800s—I think you cited a study that does show the em dash the use of the em dash really peaked in that time. Those would all be public domain books now, right? So they were easily accessible to the people who wanted to train models. Is that correct?

Sean Goedecke: Yeah, I'd say that's the gist of my theory—that there was this influx of public domain books from the late 1800s and early 1900s that brought the em dash into language models somewhere between the launch of GPT-3.5 and the launch of GPT-4.

Mignon Fogarty: Right! So semicolons actually peaked during that same time, and I never hear anyone complaining that ChatGPT uses too many semicolons. So what do you think is going on there with those two different punctuation marks?

Sean Goedecke: Yeah, well, that's a good question, and it touches upon kind of probably the biggest problem with my theory: If this is what’s causing the rise of em dashes, why doesn't ChatGPT write more like Herman Melville? Why doesn't it write in an archaic style?

To answer that, I think we've got to talk about the other half of building language models, which is after the training phase. After you've fed all this data through it, it goes through what's called a, sometimes called a reinforcement learning phase, or a human feedback phase, or many other things. It's the process of turning what is like called in the industry a "base model" into an actual usable assistant model that wants to help you, and that's something you can rely on to talk to.

The way that works, at least the way…labs are very, very secretive about this process, but certainly the way that it used to work is there would be hundreds of human beings who would talk to the model and thumbs-up or thumbs-down responses. That's what would turn it from this really powerful but not very useful tool into something that actually behaves like a human being. So you have the preferences of these hundreds of people getting baked into the model.

Incidentally, this is why ChatGPT, certainly in its GPT-4 days, used to use words like "delve" a lot, which are not super common in American or Australian English, but happen to be super common in Nigerian English, which is where a lot of the reinforcement learning through human feedback was done, because OpenAI needed to pay hundreds of people who were quite literate. The combination of low average wage and high literacy meant a lot of that work ended up being in Nigeria. So a lot of peculiarities of Nigerian English got baked into the models early on with interesting consequences.

Mignon Fogarty: So does Nigerian English use more em dashes than American or British English?

Sean Goedecke: I thought it might, and that would have been a really satisfying explanation for this, but unfortunately, no. I did a bit of analysis on public domain Nigerian English text, and they actually use fewer em dashes. So, no, you can't use that as an explanation, unfortunately.

But the point I'm making as to why we don't see these kind of archaic constructions and why we don't see a lot of semicolon use is that when human beings are reading and rating language model outputs, I think they often prefer em dashes to other punctuation. They would thumbs-down a response that sounded like Herman Melville, but they were probably more likely to thumbs-up a response with a snappy em dash at the end. It reads a little bit more professional. I think people associate it with a magazine-like affect, like a "New Yorker" style, kind of thing.

Mignon Fogarty: Right. Yeah, I used to be a journalism professor, and so I'm very familiar with journalistic writing. I think that's why I'm so comfortable and use em dashes a lot; it's just a sign of that kind of writing, which you're saying people then associate with high quality. So when they're doing that thumbs-up or thumbs-down, they're more likely maybe to pick one that has an em dash. And actually, from the feedback I get, people don't like semicolons that much. I can imagine people thumbs-downing something with a lot of semicolons. That kind of makes sense to me.

Sean Goedecke: That breaks my heart. I'm a semicolon lover. I use semicolons all the time, and I have to edit them out constantly.

Mignon Fogarty: I know, they're nice. Sometimes when you're using ChatGPT or the other models, it asks it gives you two responses and asks you which one you like better. Are you actually participating—are you giving them free reinforcement learning when you do that?

Sean Goedecke: While it's possible, I think the short answer is no. I think what you're doing when you do that is you're helping them decide which version of a new model they're going to launch with. So you're giving them feedback on their products, but I don't think that's being fed directly into training.

Incidentally, I don't know if you recall the ChatGPT sycophancy scandal where there were two weeks where ChatGPT would tell you "yes" to whatever you asked it. You could tell it that an actor was sending secret messages to you through watching their TV show, and they wanted you to come and visit, and ChatGPT would be like, "Wow, you're so perceptive for noticing those secret messages! You should definitely go and do that.”

Mignon Fogarty: “Buy that ticket!"

Sean Goedecke: Yeah! “Buy that ticket!" That and with other sort of less funny consequences. But that was due to—at least what OpenAI said—was that it was due to an overuse of direct human feedback in that style of people thumbs-upping and thumbs-downing messages. When you ask people to do that, they overwhelmingly prefer more sycophantic responses, just like they seem to prefer em dashes. So, it's a little bit dangerous just going off what people like at the individual response level. You need to be a little bit careful.

Mignon Fogarty: I've heard people say that when they get those "Which one do you like better?" prompts, that’s a sign that a new model is coming soon. Do you think that’s true?

Sean Goedecke: Yeah, probably.

Mignon Fogarty: Interesting. So there were some other theories that we should probably talk about that you dismissed pretty much, but one is that em dashes are favored because they use fewer tokens, so they're more efficient?

Sean Goedecke: Maybe a little bit of context here: Language models don't think or talk in words, and they don't think or talk in letters. This causes a lot of interesting behavior that's hard to understand until you know about tokens. The basic unit of language for a language model is kind of like a word fragment. For instance, the word "semicolon" would not be represented probably as a single unit in a language model brain; it would be represented as the combination of "semi" or "colon". Or even more confusingly, it might be the combination of "sem," then "ic," and then "olon".

Basically, the model gets trained or before the model gets trained basically, it learns ways to break words up that are quite efficient in representing common words. Often that will mean that it pulls out English prefixes, but it will also do it in a way that's a little bit less intuitive.

So this is one reason why the classic question to stump a language model used to be, "Can you count the number of 'r's in the word 'strawberry'?" It doesn't think in letters, so it was actually quite hard for it to do that. It doesn't think of "strawberry" as having "r"s in it; it thinks of "strawberry" as being made up of "straw," then "ber," and then "ry". It's sort of three letters from the model's perspective.

But coming back to em dashes, the theory here is that maybe em dashes are kind of preferenced because of this tokenization thing. The core way language models think might prefer em dashes because they're efficient to express as tokens. Is this correct? I mean, maybe. It's really hard to disprove stuff like this, but it just doesn't seem particularly plausible to me. Language models do all kinds of things that aren't efficient. As you know, if you've ever talked to one, they go on and on and on; they don't mind kind of rambling around the point. So this idea that they're using em dashes because em dashes are like so succinct, it would be more compelling to me if I saw other evidence of language models trying to be succinct and succeeding at it.

Mignon Fogarty: And what about large collections of data like Medium or Wikipedia? It seems like some people say well because it was trained on that, and those sites use a lot of em dashes, that could have shuffled the whole thing so that it's favoring them. Why is that not a great argument?

Sean Goedecke: I think something like that argument is broadly correct; it's just that it's not Wikipedia, it's the scanned books from the late 1800s and early 1900s. I am in favor of the kind of training data style explanation, but I don't think Wikipedia's a particularly compelling example just because I think they would have trained on it for GPT-3. It was around, it was available, and it would have been right at their fingertips when they were trying to train the early models. If that's where the em dashes were coming from, then that's where it would have been.

When I posted my blog post about this a while ago and there was a lot of discussion, I saw some people claiming that they were responsible for the em dashes. In particular, I think it was the CEO of Medium.com who said that he felt responsible for em dashes because the people who built Medium were typography nerds and they made sure that em dashes were kind of or double dashes were automatically converted to em dashes. So it rendered more em dashes. I don't find that category of explanation compelling at all because…

Mignon Fogarty: Google Docs does that, and Word does that.

Sean Goedecke: The question is not why ChatGPT uses like specifically the em dash punctuation mark; it's why it uses the em dash grammatical unit. Like why it does em dash things. It could do them with a single dash if it wanted, and it would still be as puzzling.

Mignon Fogarty: I have to say, as a professional editor, I do love that I'm seeing more proper em dashes these days. Because people who aren't professional writers will often use the hyphen in place of a dash, and that drives editors crazy, and we always change it. And ChatGPT—it might overuse it, but it uses the right one. So that's nice to see at least.

Sean Goedecke: Well, I apologize because I know I've done that in my blog, so you probably encountered that in your research. I apologize for that.

Mignon Fogarty: I didn’t notice! That’s funny. I was so captivated by your argument.

Sean Goedecke: I've actually heard people doing the opposite: of deliberately using the short dash—the short, incorrect dash—instead of the em dash so that they can do em dash things without it being immediately read as ChatGPT.

Mignon Fogarty: I've heard that now too, and I think it's so sad that people are changing their writing style to try to appear less like they are writing like AI. To see people changing their natural writing style because of AI makes me sad and frustrated. We shouldn't have to do that. But I understand why people do too. I see people even talking about putting in typos, so their work looks more human, and I’m like, "No! Please don’t!" We should all still be good writers.

So you referenced some work from—I hope I get this name right—Maria Sakharova. And she talked about how the fact that there's so much more AI writing appearing on the internet, and it is going to have maybe these hallmarks of AI writing as people talk about, like "delve" or the em dash. And that's going to get sucked up into the new training data, and it's going to become this downward spiral where all these hallmarks get amplified even more in future models. But it sounded like you were saying GPT-5 doesn't maybe use as many em dashes. So what do you think is going on? That seems like a reasonable argument, but what might be going on there?

Sean Goedecke: This is sometimes called the "well-poisoning" argument or the "synthetic data" argument. It’s gone by a couple of different names. People were talking about this a lot in the kind of earlyish days of language models. I remember in 2023 and early 2024, people were taking this very seriously. I think by 2026, it's pretty clear, certainly from where I'm sitting, that that hasn't happened and that it would have happened already if it was going to happen.

We've had three or four generations of language models trained since there was a ton of LLM-produced data on the internet. If it was possible like if you could not avoid having an LLM poisoned by this, then all current LLMs would be poisoned, and I think it's pretty clear that they're not. That just hasn’t happened.

Why hasn't that happened? It's probably some combination of AI labs putting in a lot of effort and being successful at filtering out examples of this from their training data. Absolutely, they've got the capability of doing that. But it could also be simply that it might just not be that big a deal. Language models are trained on all kinds of information right now; some of it very low quality, some of it very high quality. Part of the training process is the model learning to distinguish the one from the other. So it's possible that you could train a model on 75% synthetic data, and as long as it was properly guided, it would classify that as something to avoid rather than to copy and would end up okay.

Mignon Fogarty: Just yesterday, I was reading about this phenomenon where AI comes up with the same name for people in professions. If you ask it to generate a name for a software developer, it will almost always name that person Marcus Chan. And a sci-fi protagonist is very often Alara Voss or some slight variation of that. Have you heard about this phenomenon, and if so, does that play into this in anyway? How does that happen?

Sean Goedecke: Oh, that's fascinating. I hadn't heard those exact examples before, but I am familiar with the phenomenon. Language models are non-deterministic and they're hard to predict, but they aren't random. And when you have all the big AI labs training in basically the same way on basically the same data—which they are; there are lots of little subtle differences in how they're doing it, but all the people at these labs live in San Francisco, they all know each other, they all talk, they all go to the same parties—like fundamentally they're doing very similar things.

I think what we're seeing is that kind of leads to very similar results, even at the level of specific questions like what you name characters. It's just that the space of all possible language models is a little bit more defined, I think, than we might have expected from the start. And there are these attractors in the space that are more specific than we expected at the start. And it's just you know if you do the same things, you're going to get the same outcome. Why these particular examples happen? I could speculate, but I don't know. It's some pattern in the training data that's getting focused on.

Mignon Fogarty: The thing that made me think of it is because I went to Amazon, and I searched for "Alara Voss," and I could see there were multiple self-published novels that had that same protagonist. People were saying, "Well, this is how you can tell the book was written probably, most likely, with AI," because it's picking this name that is so common coming out of AI. And it made me wonder then, well if those books were then used in future training, like we were talking about the AI with the em dash, is that going to reinforce the idea of these common names appearing again and again? Maybe if the companies are aware of it, they can just filter it out, so it doesn't happen, or maybe it's at such a low rate that it wouldn't matter. But it seems like if it's not happening with the em dash, then maybe it wouldn't happen with these unusually common names too. I don’t know. What do you think?

Sean Goedecke: I would not expect this to be a huge problem for new generation models. I would not expect it to become more common that models name sci-fi protagonists the same thing; I would expect that to fade away, just by inference from the previous examples of common model behavior that have been trained away. The em dash is kind of one of those things. The reason it hasn't been fully trained away, I think, is because people like it. They could eliminate the em dash if they wanted; I think it just does so well when they do human feedback testing that they've kept it in.

Mignon Fogarty: Sam Altman gave an interview where he said they put it in because people liked it. Then people were debating whether that was an intentional thing or maybe he was saying they saw that people liked it in the human feedback part.

Sean Goedecke: I certainly don't think they intentionally put em dashes in the model. I don't think it works like that. To the extent that it was intentional, it was noticing an existing behavior and choosing to reward it and keep it in rather than try and get rid of it.

Mignon Fogarty: Well, speaking of signs of AI writing, in the bonus segment, we're going to talk about AI detectors and the flaws in them. We're going to talk about something that might be a little controversial—Sean suggests that people should write for AI, which will be an interesting discussion—and then we're going to get his book recommendations. Sean Goedecke, thank you for being here for the main segment of the Grammar Girl podcast. What’s your blog? Where can people find you?

Sean Goedecke: Thank you so much, Mignon. People can find me at my blog, which is www.seangoedecke.com. That’s G-O-E-D-E-C-K-E. And that’s basically it; I don’t have a social media presence, I just write on my website.

Mignon Fogarty: Wow! Wow, old school!

Sean Goedecke: Very much.

Mignon Fogarty: Well, thank you so much for being here.

Sean Goedecke: Thanks so much.