No, LLMs Don’t “Get Lost” in Real Conversation – What the Research Actually Says

I came across a tweet (or whatever we call the messages on X-formerly-known-as-Twitter these days) on February 19^th with over 8,500 views claiming that new research from Microsoft and Salesforce should “scare every AI builder.” The thread declares that LLMs are fundamentally broken in multi-turn conversations, that “real conversations break every model on the market,” and that “nobody’s talking about it.”

Let me show you what the research actually says – and why this kind of sensationalist misrepresentation matters.

What the Tweet Claims

The viral thread makes several dramatic assertions:

LLMs drop from 90% to 65% performance just by “talking normally”
“Real conversations break every model on the market”
LLMs “fall in love with their first wrong answer and build on it”
Even reasoning models like o3 and DeepSeek R1 “failed just as badly”
The only fix is giving AI “everything upfront in one message”

The implication: casual conversation with AI is fundamentally broken and this is some shocking revelation being suppressed.

What the Research Actually Tested

The paper – “LLMs Get Lost In Multi-Turn Conversation” by Laban, Hayashi, Zhou, and Neville (arXiv:2505.06120) – examines something much more specific than “normal conversation.”

The researchers explicitly state their focus: “LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting” and they’re investigating what happens in “multi-turn settings” where tasks are underspecified (Abstract, page 1).

Here’s the key distinction the tweet completely misses: they’re testing scenarios where users iteratively refine task requirements across multiple turns, not casual conversation. Think:

User: “Write me a function”
User: “Actually, add error handling”
User: “Wait, also make it handle edge case X”
User: “One more thing, can you include requirement Y?”

This is task specification through conversation, not “talking normally breaks them.”

What They Actually Found

The research tested 15 LLMs across 200,000+ simulated conversations for six generation tasks (code, summarization, etc.). Here’s what they discovered:

Performance Degradation Breakdown:

Average drop: 39% across all tasks (page 1, Abstract)
Aptitude (capability) loss: relatively minor
Unreliability (gap between best and worst case): significant increase

The paper states: “We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely” (page 1, Abstract).

The Actual Problem: When tasks aren’t fully specified upfront, LLMs commit to interpretations based on incomplete information, then anchor to those early assumptions even when users provide clarifying information in later turns. The model has already “decided” what you want and struggles to revise that understanding.

Envision the following conversation between a mother and child:

Child: I want cake!

Mother [bakes a chocolate cake with chocolate frosting]

Child: I don’t want it!

Mother [maybe I didn’t put enough frosting on it] (proceeds to put more frosting on it)

Child: Why do you keep giving me that? I don’t want it!

Mother [maybe they wanted a round cake instead of a square one] (proceeds to bake second chocolate cake with extra frosting only in round pan this time)

Child: No! I don’t want it! Get away from me!

Mother: But you said you wanted cake!

Child: I wanted a vanilla cake!!!!

Mother: ….

In that example, the mother heard the child wanted cake, so she made one based on an incorrect assumption about flavor. Thus, every decision she made thereafter compounded her initial error because there was a fundamental flaw in her assumption: the kid wanted a vanilla cake, not a chocolate one.

That scenario highlights how important it is to ensure that you and the AI are on the same page from the beginning, thereby ensuring that errors don’t occur up front which then – as is the case with humans as well – keep compounding as the dialog continues.

As it says on page 4: “In multi-turn interactions, there are multiple opportunities for the LLM to misinterpret the user intent, and mistakes made early in the conversation can compound.”

What About Reasoning Models?

The tweet claims ChatGPT-o3 and DeepSeek R1 “failed just as badly” and that “extra thinking tokens did nothing.”

What the paper actually says (page 7, Section 5.2): “Reasoning models (DeepSeek-R1, o3-mini) show similar patterns to other models, with some improvement in aptitude but persistent reliability challenges in multi-turn settings.”

Translation: Reasoning models maintained slightly better baseline capability but still struggled with the premature commitment problem. That’s not “failed just as badly” – that’s “showed improvement in one dimension but not the other.”

Why This Matters

This research is useful. It identifies a specific pattern: when users refine underspecified tasks iteratively, LLMs anchor to early interpretations and struggle to incorporate later clarifications.

That’s actionable information for:

Developers building conversational AI systems
Users learning how to structure complex requests
Researchers working on improving multi-turn reasoning

But here’s what it’s NOT:

Proof that “normal conversation breaks AI”
Evidence that LLMs are fundamentally unreliable
Some suppressed scandal nobody’s discussing

It’s a documented limitation in a specific type of interaction. One that anyone who works collaboratively with LLMs has already encountered and learned to navigate. One that most human adults have experienced at least once in their lives with another human, as well.

The Real Takeaway for Practitioners

If you’re doing iterative task refinement with an LLM:

Be explicit about what changed and why. Don’t assume the model tracks your evolving requirements perfectly.
When adding new constraints, explicitly state they’re additions: “In addition to the previous requirements, also include X” rather than just “add X”
For complex tasks, consider starting over with a complete spec rather than layering clarifications onto an incomplete foundation
Understand that LLMs interpret based on what you give them. If your initial prompt was vague, their interpretation will reflect that vagueness.

This isn’t “everything must be one message” – it’s “communication requires clarity, especially when requirements evolve.” Ask your friendly neighborhood business analyst about that one.

How Misinformation Spreads

This tweet demonstrates a pattern I see constantly: someone screenshots academic research, strips away context and nuance, adds alarming framing, and suddenly thousands of people think they’ve discovered proof that AI is broken.

The actual researchers aren’t saying LLMs are fundamentally flawed. They’re documenting how underspecified multi-turn task refinement creates reliability challenges. That’s how science works – you identify specific patterns so systems can be improved.

In other words, they’re telling you what happens when your prompt is just vague enough that the model – from its first response – has made assumptions about exactly what you meant. That’s a you-problem, not an it-problem.

But “LLMs struggle with iterative task clarification in underspecified scenarios” doesn’t get 7,000 retweets.

“BREAKING: Real conversations break every AI model and nobody’s talking about it” does.

Accuracy matters.

When you see dramatic claims about research, read the actual paper or at the very least, ask an AI to summarize it for you in whatever way you understand best. Check what was tested. Look at what the researchers actually concluded. Don’t let sensationalism replace understanding.

The research paper is here: https://arxiv.org/abs/2505.06120
The viral tweet is here: https://x.com/hasantoxr/status/2024238760674959492?s=20

Read both. Decide for yourself.

But maybe start with asking: if this research really proved “normal conversation breaks all AI,” why would Microsoft and Salesforce publish it? And why would the researchers frame it as a specific limitation to address rather than a fundamental failure? Sometimes the most important question is the one that doesn’t fit in a tweet.