Grammar Girl Quick and Dirty Tips for Better Writing

What a ‘Science' magazine experiment says about the future of AI in journalism, with Abigail Eisenstadt.

Episode Summary

1131. This week, we talk with ‘Science' magazine senior writer Abigail Eisenstadt about her team's year-long experiment testing ChatGPT's ability to summarize research papers. We look at their methodology, the limitations they realized, and their main finding: that AI could “transcribe” scientific studies but failed to “translate” them with context.

Episode Notes

1131. This week, we talk with ‘Science' magazine senior writer Abigail Eisenstadt about her team's year-long experiment testing ChatGPT's ability to summarize research papers. We look at their methodology, the limitations they realized, and their main finding: that AI could “transcribe” scientific studies but failed to “translate” them with context. 

Read the report: https://www.science.org/do/10.5555/page.2385668/full/chatgpt_project_report_final.pdf

🔗 Share your familect recording in a WhatsApp chat or at 833-214-4475.

🔗 Watch my LinkedIn Learning writing courses.

🔗 Subscribe to the newsletter.

🔗 Take our advertising survey

🔗 Get the edited transcript.

🔗 Get Grammar Girl books

🔗 Join GrammarpaloozaGet ad-free and bonus episodes at Apple Podcasts or SubtextLearn more about the difference

| HOST: Mignon Fogarty

| VOICEMAIL: 833-214-GIRL (833-214-4475).

| Grammar Girl is part of the Quick and Dirty Tips podcast network.

| Theme music by Catherine Rannus.

| Grammar Girl Social Media: YouTubeTikTokFacebook.ThreadsInstagramLinkedInMastodonBluesky.

Episode Transcription

[Computer-generated transcript]

Mignon Fogarty: Grammar Girl here. I'm Mignon Fogarty, and today I'm here with Abigail Eisenstadt, a science writer at Science Magazine. I am really interested in speaking with her today because they did a study on AI writing that was much more expansive than what I see most people doing. You know, I see people give ChatGPT a try once or twice and, you know, form their opinion about it.

But as good science people do, they did a methodical test, and we're going to hear all about it today—what they found that it can and can't do and what it means for the future. Abigail, thanks so much for being here.

Abigail Eisenstadt: I am happy to be here. Thank you for inviting me.

Mignon Fogarty: Yeah. So can you describe for people what the science writers of Science did?

Abigail Eisenstadt: Yeah, so we are a press package team, and we send out summaries to reporters on a newsletter each Sunday. What we put in that newsletter is 250 to 350 words of a news brief, essentially. These are summaries that go out—there's usually five on average. What we wanted to do is see if ChatGPT Plus could emulate our specific style.

And so our style is pretty typical for a news pyramid, which is just to say the most important sentence is at the bottom. Whereas typically for a pyramid, you would expect the most important sentence to be actually at the bottom. So the most important sentence at the top. Then we do the background, then we do the methods, then we do a conclusion, and we wanted to see if ChatGPT could emulate that essentially. And as well, as science writers, we have to be very careful about the word choices we use because we don't want to say "groundbreaking," or "a new, you know, first-of-its-kind study." All science is built on the shoulders of other science, so you can't truly say anything is novel. You have to maintain context in whatever you do. Each week, we had ChatGPT analyze two papers that the writers on my team, including myself, would nominate. We compared its summaries to our summaries, so we had already written our summaries on it as well. Then we would look at the differences between the two to see if there were any hallucinations or any claims that we felt were a little too aggressive. Would it be something that we would be confident sending out to reporters, believing it would not break their trust? They could trust our credibility as representatives of research from our institution and if it even followed the style that we use. Over the course of that experiment, what we found is it did a good job transcribing studies, so it could summarize in sort of a layperson's abstract, but it didn't really translate those studies. It was missing the element of contextualization. It loved to use "groundbreaking," for example, and it didn't quite provide the narrative within this study exists within a field of other research. ChatGPT couldn't really do that part, which is so critical for science writing.

Mignon Fogarty: Can you explain a little bit more what you mean by the difference between transcribing and translating? Could you maybe give a concrete example for people?

Abigail Eisenstadt: Yeah, so I would try, I would describe translating as you take a sentence and you don't just restate the sentence with a different word for each, you know, each word; you also explain what the sentence means. Transcribing is just resummarizing; translating is really broaching that understanding element to it.

And again, I would just call it adding context for the sentence, providing the sentence beforehand and the sentence after. So you have a setup. 

Mignon Fogarty: Sure. And, so how many, you did this for a lot of weeks. You did a lot of different tests. How many did you do?

Abigail Eisenstadt: We ran this for a year, so, you know, give or take, I would say like our, the papers that we studied were over 60, so I would call around 64 papers. And we definitely did this for at least 40 to 50 weeks. So it was a pretty wide range of content for a pretty long time. But you know, there are some biases involved because we're writers, so we also had to be cognizant of that, which is why we spent such a long time doing the project to try to, you know, balance the human side with, you know, a little bit of a longitudinal side.

Mignon Fogarty: And, one thing that I think is important to mention is this was actually, in AI terms, quite a while ago. So that the timeframe. You know, one thing I was, as I was reading your report, you just referred to ChatGPT Plus, and I was like, well, which model did they use? And it turned out there were multiple models because this field advances so quickly. So can you talk a little bit about sort of the timeframe this happened in and how maybe you saw things change over that time?

Abigail Eisenstadt: Yeah, that's a great topic too, because I think when we started, it was around December 2023. We had this idea, and that was kind of before prompt engineering percolated into the public consciousness. I wasn't a developer, and no one on my team is, so our prompts were pretty, I would call them pretty basic in terms of what you would ask a human to do, rather than specific for what you would ask an AI to do, knowing what we know now. So we started with, I would say, just generic ChatGPT Plus. We had a coworker who would put the summaries in; we just tried to maintain the same model. It was all 4.0, I believe. If that's how you say it. As you can see, I haven't interfaced with the AI itself that much, but we tried to keep the model. We tried to account for model diversity by keeping those prompts the same, but those prompts stayed the same, and so they did not really evolve according to prompt engineering education as it entered, I guess, the zeitgeist over 2024.

Mignon Fogarty: Yeah, that was gonna be one of my questions actually. As the models changed, did you change your prompts to sort of adapt to how the models were changing?

Abigail Eisenstadt: At that point, we—I mean, you know, there were human biases involved. The paper topics were changing every week because we were covering, you know, at the press package, we covered studies from Science, Immunology, Science Magazine, Science Advances. So this is archaeology, neurology, and, you know, that's hard to standardize as well. So we kept them very much exactly the same. We had a very general summary, just like, "Write me a paper in layman's language," "Write me a summary in layperson's language on this one study." "Write me a precise summary on this study as you would for, you know, a high school reader," and then "Write me X, Y, Z," our specific outlining process prompt. We never changed that. So I think they were all written to how you would instruct a human who had a bit more creativity and ability to interpret the prompt.

Mignon Fogarty: Did you ever give it—did you give it examples of human-written summaries? So you would say, like, "Write a summary kinda like this," and give it like three examples or something like that?

Abigail Eisenstadt: Yeah, so we didn't give any examples, but we would say, "Write it akin to a news study or a news story that you might find in a magazine." Again, that hadn't really—the examples component hadn't really been on our radar when we started in January of 2024. I will note, too, our coworker was experimenting with those things, but at that point we didn't want to, you know, those are outlier data and it was interesting what it could do. So the main point I want to stress is that this is a very selective study with a lot of factors going in it. I would call it—well, my supervisor calls it sandboxing, and I think that’s a great term. Because, you know, we did what we could with what we had. I think the results are pretty representative of what a lot of people are finding, but also there are caveats. So I'm glad we can get into those.

Mignon Fogarty: Yeah. And so, I think your study ended right about the time when I think the first reasoning models became available. So it sounds like you didn't do it with any of the reasoning models. Is that right?

Abigail Eisenstadt: That is correct. Yeah.

Mignon Fogarty: Yeah. Yeah. And then did you—so how did you nominate, how did you choose the articles to nominate for AI treatment?

Abigail Eisenstadt: Yeah, so we kind of used a process similar to how we select our summaries, but these all—so the first qualification is it had to be something that we had already written a summary on because we have to compare it to. Then we had a series of reasons why it could be nominated. So there's technical jargon.

Is this something we really want to see the AI delve, quote unquote, for lack of a better word, delve into and see if it can unpack it accurately? Are there human subjects involved? Because when you're doing medical coverage, you want to make sure you treat any study with human subjects with sensitivity. There are phrases we will avoid. So you never want to say, "this study involving disease participants"; you always want to say "participants with X, Y, Z disease." It's a matter of personhood. 

We also had, is it just a fun topic? Like, are we just writing about mammoths for some reason? Is it a particularly controversial topic? We had a study come out on Facebook and election biases, so can the AI handle that and make sure to represent nuances that, you know, we would be very careful in discussing? And so those were the qualifying factors, and I had each writer complete a survey at the end during their assessment so that we could see the breakdown of what type of topics were nominated. But again, there was so much diversity. And also each study has diversity in how it's conducted, you know. So the AI is tackling a variety of different methods and approaches, so it's very hard to standardize, you know.

Mignon Fogarty: Yeah. And so as you discovered things, did you give it rules like, you know, always describe, you know, a person with this instead of, you know, like you're…?

Abigail Eisenstadt: What I would say happened over the course of this is we too learned how to evaluate more. But there were certain things like, okay, it's gonna say "people with X disease" versus, you know, "disease people." I’m going to flag that once and then ignore it, and we'll just assume that's a continuity we've already noticed.

There was a point where we finally tweaked the prompt a little bit to say, "please stop saying groundbreaking." Particularly because it just, it was, you know, it's something that you can't tune out when you read it, and you have to flag it as inaccurate every single time. So that was the one intervention I would say that we did. But, so not even subconsciously, I think just in general, we all kind of unilaterally agreed that this was not…

Mignon Fogarty: You never wanted to see that again.

Abigail Eisenstadt: Let it go. Yeah.

Mignon Fogarty: I'm curious, is this something that you, as the team of writers, decided that you were really curious about and wanted to do? Or is it something that management came to you and said, "Well, we want to see if it can do this. Will you please test it?"

Abigail Eisenstadt: So it was a little bit of column A and column B. I think when ChatGPT Plus first came on the scene, there was a lot of debate about whether the LLM would replace writing jobs. And so I would say that a lot of the writers on my team and myself had differing views, or not even differing views, just, you know, what does this mean for us?

I've always been of the stance that, you know, like any event, the Industrial Revolution or whatever, we will evolve and our skills will be suited to use a tool eventually, in theory. And I think perhaps I was blasé enough about it that supervisors noticed, and so they asked me to run it. Nobody would be fired; it was just a way to, you know, stay on top of the landscape and whatnot. And because, again, I was like, "Well, sure." I wasn't too concerned about it. And I think it was a really good experiment to do because I actually have more respect for these tools afterward. I think knowing limitations allows you to respect something more. 

Mignon Fogarty: Yeah, absolutely. And how did you evaluate the output?

Abigail Eisenstadt: So I tried my best without a scientific degree to create a survey where we had a few qualitative and a few quantitative answers. So what I did was, I would say, "Did this emulate? Was this compelling? Was this a compelling summary? Yes or no? Was this a summary that is feasibly able to blend into the rest of our press package?" Because if it's not, we couldn't use it. You know, can it stand alone, and go into the press package, and average all of the scores that came from the course of the year? They were all between two and three. There are decimal points in there, but I can't remember, yeah, out of five, which is not too bad. It does indicate there's a requisite for editing. The qualitative was just, I would ask the writers for their thoughts. Those are in the paper, but more as sentences, because I'm not going to be like, "Certain phrases keep emerging," because again, we're writers. We all have our own shtick that we lean into. So there is no way to quantify those.

Mignon Fogarty: Yeah. So the writers in your group were the ones evaluating the output and comparing it. Did you find, it sounds like you're saying it was pretty steady, like between two and three; there weren't some that were one and some that were five?

Abigail Eisenstadt: Yeah, so we had different formats of, that's another thing I should mention, which is why it was kind of hard to adapt to the prompts. We had, we would put in policy forums, which are something that Science Magazine specifically publishes, or we would have reviews or research resources, which are things that the sibling journals use. And so all of those it kind of performed differently. It did very well on reviews, I will say. But again, that aligns with the transcription. You know, it's good at resummarizing something, and a review is kind of a translation. So if you transcribe a translation, you do pretty good with that. In terms of research articles, there were a few where I received some feedback in a survey from my colleague Walter, who was like, "Are you sure, you sure you put the right paper in this? It's so off base.” So that was startling. And I had one that started talking about bacteria in the brain, which was also a little concerning.

But on the whole, yeah, they were pretty much. Pretty consistently at the level where they required a lot of editing to improve, but it wouldn't, I wouldn't dismiss them off the bat. The amount of editing that it would take to improve, I would not bother with.

Mignon Fogarty: Right.

Abigail Eisenstadt: Yeah.

Mignon Fogarty: Yeah. You know, in some ways, I'm not surprised that you didn't find that it was helpful to your group because you are probably some of the best science writers in the country, in the world, you know? And, you know, I think that one thing that people who've tested, ChatGPT and the like, have found is that, you know, for people who are already really good at what they do, it doesn't help them that much. They find that people at sort of the lower skill levels get more of a benefit than people at the higher skill levels. And so I wonder if you... I wonder if you feel like, you know, for a less skilled group, something like you did might be helpful.

Abigail Eisenstadt: So I think that it depends on audience a lot, and kind of the context you have that comes into it for sure. We actually had a study published at Science Advances, which is my... I report on Science Advances studies specifically. It's our open access journal, and it came out talking about how abstract language has changed from LLM use.

And I don't know if that's necessarily for the worst, because if you think about it, who's the audience for these papers? You know, often research papers involve a lot of passive voice. If LLMs can reduce that, that's pretty great because then it kind of percolates more into society that we don't need to use as much passive voice as there always is. So yeah, I think it depends. If you are writing for an audience of your peers and you're a researcher, you know, it's a good idea to have a little bit more colloquial influence in there. If you're writing as a science writer, you pretty much have your style down pat. Like, you have to think about all these different factors.

And our audience, specifically our audience at the press package, is reporters, so they're pretty sharp. And then, as a reporter, you have to walk the delicate line of your audience being the public. You have to be very careful about what you say because every word matters, and you don't want to misrepresent something, but you also want people to keep reading your study. So I think it does provide a useful tool for people who are not writers, in so much as it can influence in a way that changes the general trend for society, not in so much as each paper it tackles gets better.

Mignon Fogarty: Yeah, sort of a general flattening of style. And not long after you released your results, OpenAI released a study where they did something like exponentially bigger because, of course, everything they do is huge. And they looked at, you know, more than 1,300 tasks across 44 industries, and they had, you know, experts come in and write the prompts and then evaluate the output and everything, using the newest models available. And, you know, some of those were science writing tasks. Now they didn't break it out task by task, so even by science writing task. But I wonder if you saw that and if you had any thoughts about that after finishing up your study and then seeing that, and if you've had any thoughts on comparing and contrasting or anything like that.

Abigail Eisenstadt: Yeah, I had time to review that. And I think overall I have heard good things about Claude, and that seemed to be one of the main takeaways of the study in the weeks since putting out our own report. And, you know, again, I wasn't super aware of Claude when we started the experiments. But I will note also that the overall results for the study, the OpenAI study specifically, were that Claude was graded as better or equal to human writers, 47.5%. So again, that kind of brings back the issue of return on investment on editing. You know, that's not super convincing to me.

Mignon Fogarty: Right, and they didn't break it out, so we don't know for their study either if there were ones and fives, or if it was all 40. If they were all half as good or—

Abigail Eisenstadt: Yeah.

Mignon Fogarty: Yeah.

Abigail Eisenstadt: And then I also noticed that at the bottom of page three, which is this is very, but there was a note too that it's hard to, you know, see if the summaries are fully blind because a lot of these different models have their own quirks. Perhaps there was an ability of reviewers to see if it was human or not, you know? Or just assume because there's a lot of em dashes, as the myth goes. Maybe they assume it's ChatGPT. So, you know, I think there were flaws in that study as well. But I do think the results were promising. They noted that there was a heavy cost in that study, and that doesn't seem super attainable.

Mignon Fogarty: Right. I think the cost was in paying all the experts because they had experts with an average of 14 years' experience in each industry, and then they spent, you know, an extraordinary amount of time writing the prompts. So, you know, the prompts underwent five rounds of human review before they were tested. So, you know, they were, the thing about that study is it was as good as it possibly could be today, I think, is what that study says. So if you're writing the best prompts possible, by the experts in the world, then, you know, for writing, it was about. And again, evaluated by these experts. It was as good or better than the human output about half the time for writing and about 75% of the time for editing. So those were numbers to be aware of and that I'll be keeping my eye on, because those were pretty good numbers, actually.

Abigail Eisenstadt: I think, again, it depends on the field. I don't think, for my specific field, they're good numbers. I think they have great potential to be useful for other fields, you know?

Mignon Fogarty: Yeah, and I think it depends on how you count the time that then you have to spend cleaning it up for the things that don't work.

Abigail Eisenstadt: Yeah.

Mignon Fogarty: Whether that and the time going into creating the prompts. I mean, if you can use the prompt over and over again, then maybe that cost gets amortized across all your work. But if you're having to do that for every task individually, then it's too much, you know?

Abigail Eisenstadt: I can't imagine the level of expertise and training. I'm very impressed by the human reviewers and prompt engineers involved in this project as well.

Mignon Fogarty: Yeah. Yeah, it's a lot to keep an eye on in the future with the new models. Do you, I mean, do you have any interest in testing it again? I mean, you've done this huge test. I can see you being like, Ugh, we just, we're done with it. We don't want to do it anymore. We're good at our job. It's not saving us any time, but like then the models keep improving. So do you think that you'll test it again, or are you done?

Abigail Eisenstadt: I think it's difficult to say. We're definitely curious about different models, but to that point, they are evolving so quickly that I hesitate to do another longitudinal study because things might change so drastically. Again, we have thought about asking reporters if they'd be interested in a shorter-term study where they evaluate our summaries and the LLM summaries side by side about what model that would be either. But again, reporters' times are pretty limited, especially today, and so I don't know about the buy-in with that. Also, when you're selecting reporters, do we go for whoever wants to join? Do we pick, you know, outlets? And by technical, I mean who, like, not Wired because they already know a lot about AI.

Mignon Fogarty: Yeah.

Abigail Eisenstadt: So it's kind of, it would be very difficult, I think, to standardize in the same way that this study was, or the Sandboxed study was. So I don't know. It's difficult to say. We're definitely curious, and I think, I don't think we're done staying on top of it, but I don't know if we would launch something grand at scale.

Mignon Fogarty: Yeah. No, I mean, I commend you for doing it in the first place this way. I mean, it's so nice to see it all laid out like that. I had a follow-up. I had a follower on social media who is a science writer who asked me to ask you, since you spent so much time with these models, is there anything you think that they are good for?

Abigail Eisenstadt: Yeah, I actually think that they might have a lot of potential for social media. I would be very curious at how they do at distilling our summaries for social media. In that same vein, I have heard and I am very curious about, I've heard of writers putting their summaries or just their pieces. I don't know if they're summaries to say, but putting them into the models and asking what the main points are. And I'd be kind of curious if I could get something from that. You know, like if I put in a study on, you know, biomolecular engineering and I ask it, "What is the news here?" Maybe it can tell me if I'm actually presenting that correctly. So, I guess that would bring out more of the analytical side that we saw was lacking in this study. I keep calling it a study, but that was lacking in our experiment as well.

Mignon Fogarty: Yeah. Well, Abigail Eisenstadt from Science Magazine, science writer, thank you so much for doing the test and for talking about them with us here today.

Abigail Eisenstadt: Thank you for having me. It was great to think more about it and, you know, discuss it with you.

Mignon Fogarty: Yeah. So interesting.