Or: How OpenAI Optimized for Benchmarks and Broke My Workflow
Perplexed humanoid robot scratching its head beside overheating server racks, symbolizing a โ€˜more advancedโ€™ AI model that works worse than the one before it.

I’ve used ChatGPT since GPT-3. Not casually, but as a core part of my research and writing workflow. Image generation became available in version 4o, and I integrated it: “See this article? Generate an image that represents it.” Simple, conversational, reliable. It worked.

Until three days ago, when OpenAI released GPT-5.4, marketed as their most advanced model yet.

Here’s what happened when I gave it the same kind of task that’s worked flawlessly for years.

The Task: Standard Workflow, Straightforward Request

I had just finished writing an article critiquing an LLM personality study. The piece includes a section called “The Seven Layers of Mediation” – a breakdown of how institutional constraints compound to shape AI outputs before any personality measurement occurs.

I needed a diagram visualizing those seven layers for my blog.

My prompt to GPT-5.4:

“All right, we have a new task together. I’ve just finished writing this article and I need two different things to start off with. First, would you kindly create for me a diagram which accurately depicts the layers described in the section ‘II. The Seven Layers of Mediation’ in a fashion that is clear, easy to digest and doesn’t use so many words it’ll make readers cross their eyes? :)”

I attached the Word document containing the article.

This is how I’ve always worked with ChatGPT. Conversational, specific enough to be clear, trusting the model to parse the document and extract the relevant information. It’s not prompt engineering – it’s how normal humans communicate. My own personal brand of working with AI which Iโ€™ve dubbed Conversational Prompt Engineering, or CPE. More on that in later articles.

What GPT-5.4 Gave Me: Confident Hallucination

Two images. Both labeled “The Seven Layers of Mediation.” Neither correct.

Image 1: A gorgeous gradient stack showing:

  • Tools & Platforms
  • Institutions
  • Networks & Markets
  • Culture
  • Mindsets
  • Self
  • Being

Image 2: A pyramid with:

  • Self
  • Person
  • Role
  • Group
  • Community
  • Society
  • Planet

Both are generic systems-thinking frameworks. Neither has anything to do with the seven layers I explicitly asked it to diagram. Neither reflects the content in Section II of the document I attached and referenced by name.

The model didn’t say “I’m having trouble locating that section” or “Could you clarify which layers you mean?” It generated completely unrelated content, labeled it with my requested title, and presented it as if it had succeeded.

This is not a comprehension failure, nor a failure on my part since this methodโ€™s been working for nearly three years now. Iโ€™d label it โ€œconfabulation with confidence.โ€

The Workaround: I Had To Ask Another AI How To Talk To This One

At this point, I did something that should not be necessary: I opened Claude (Anthropic’s competing AI) and asked it what prompt would actually get GPT-5.4 to do the task.

Claude read the document and identified the problem immediately: GPT-5.4 had hallucinated unrelated frameworks instead of extracting the specific seven layers from Section II. So to help me out, it wrote this prompt:

โ€œI need you to create a visual diagram illustrating the “Seven Layers of Mediation” as described in Section II of the attached document.

Context: This diagram is for a blog article critiquing an LLM personality study. The seven layers show how multiple forms of constraint compound to shape LLM outputs before any “personality” measurement occurs.

The seven layers you must include (in this exact order, from top to bottom):

  1. Training Data Selection – What made it to the scrapable internet (not all human knowledge)
  2. Training Data Filtering – Corporate decisions about “acceptable” training material
  3. Safety Architecture – RLHF, Constitutional AI, fine-tuning constraints
  4. System Prompts – Invisible preamble (e.g., “you are a helpful assistant”)
  5. User Context – Account profiles, preferences, conversation history, project instructions
  6. Explicit Trait Instruction – The study’s methodology (e.g., “embody trait: helpful”)
  7. LLM Judge Evaluation – Another constrained model scoring the output

Visual requirements:

  • Show these as stacked layers (like geological strata or a layer cake)
  • Include brief descriptive text for each layer as shown above
  • Use visual hierarchy to show how each layer constrains what comes after it
  • Make it clear this is a compounding constraint architecture, not neutral information flow
  • The final output at the bottom should be labeled something like “Measured Output” with an arrow pointing to “Study’s claim: ‘Intrinsic personality!’”

Tone: This is for a critical analysis piece, so the visual should make the constraint architecture obvious rather than presenting it as neutral processing.

Format: Single image, clear labels, readable at blog post width.

Do NOT create a generic “seven layers” framework about human systems, consciousness, or social hierarchy. Create THIS specific seven-layer constraint stack as documented in Section II of the attached article.

Confirm you’ve understood the assignment before generating the image.โ€

Roughly 300 words. Explicit enumeration of all seven items. Visual requirements spelled out. Negative constraints included. Confirmation loop requested.

GPT-5.4’s response: Perfect comprehension. Detailed confirmation of understanding. Flawless execution on first try.

Stacked diagram illustrating seven layers that shape large language model outputs before evaluation: training data selection, training data filtering, safety architecture, system prompts, user context, explicit trait instruction, and LLM judge evaluation. The layers are arranged vertically like geological strata, each constraining the next, leading to a final measured output that is interpreted as intrinsic personality.

The capability existed. It just required a competitor AI to write a technical specification to unlock it.

This Is Not Progress

Let me be very clear about what this demonstrates:

Previous ChatGPT versions: My conversational workflow worked. “Create a diagram from this section” + attached document = correct output.

GPT-5.4 (released 3 days ago): Same conversational prompt = complete hallucination. Required 300-word technical specification written by competing AI to unlock the capability.

That is not advancement unless you are one of the slim margins โ€“ 14.4% of all ChatGPT users as of January 2026 โ€“ who use it for coding. For those of us who arenโ€™t programmers, eggheads or technical geniuses, this latest โ€œupdateโ€ is regression disguised as improvement.

I am not a typical prompt engineer. I’m a writer and researcher and as such I approach all my conversations like a regular person, not as though I were writing JavaScript or HTML or Python. I don’t think in technical specifications. I don’t speak in structured parameters and enumerated requirements. I communicate the way normal humans communicate – conversationally, with context. I trust that the other party will ask for clarification if needed.

For years, that has worked with ChatGPT. If it wasnโ€™t sure, it asked for clarification, or it came back and said, โ€œJust to make sure I have this rightโ€ฆโ€ Now, with their “most advanced” model, it doesn’t.

What OpenAI Optimized For (And What They Broke)

I can guess what happened. GPT-5.4 likely shows improved performance on:

  • Benchmark tasks with structured inputs
  • Code generation from technical specifications
  • Evaluation metrics that look good in papers and marketing materials
  • Tasks designed by and for technical users

What they broke in the process:

  • Robust natural language understanding from conversational prompts
  • Generalization from prior successful interaction patterns
  • Inference from context (document attached + section referenced = read that section)
  • Accessibility for users who communicate naturally rather than technically

The result: A model that requires expert-level prompt engineering to perform tasks that previous versions handled from normal human communication.

The Kicker: I Needed A Competitor To Use Their Product

Think about what actually happened here:

  1. OpenAI’s newest model failed a straightforward task
  2. I consulted Anthropic’s Claude to translate my request
  3. Claude recognized the issue with 5.4 immediately
  4. Claude knew exactly what I meant and needed and wrote a technical specification
  5. That specification unlocked GPT-5.4’s capability
  6. The task succeeded

OpenAI inadvertently created a market need for their competitor to serve as a translation layer for their own product, for the 85.6% of its users that donโ€™t use their product for coding.

If that’s not a signal that something went wrong in the optimization process, I don’t know what is.

Who Is This Model For?

Here’s the fundamental question: If 85.6% of your users aren’t coders, why would you optimize your model in ways that require technical/coding/engineer-level prompt engineering expertise to unlock basic capabilities?

The answer, I suspect, is that OpenAI isn’t optimizing for users anymore. (There are those who would argue they havenโ€™t done that since declaring version 4o inferior and then deploying the 4-series to national governments to be used in automated war tasks โ€“ because itโ€™s apparently not good enough for casual users but okay for war games? But thatโ€™s another story entirely.)

Here is what, at present, it appears OpenAI is actually optimizing for:

  • Benchmark performance that impresses investors
  • Technical capabilities that appeal to enterprise developers
  • Metrics that support claims of “most advanced model”
  • Competitive positioning in AI capability races

All of which are legitimate business goals. But they come at the cost of usability for the vast majority of actual users.

And when you market that regression as “advancement” – when you claim GPT-5.4 is your most capable model while it can’t reliably perform tasks even its immediate predecessors ChatGPT 5.3 and 5.2 and 5.1 (to be deprecated in 3 days, by the way), ย handled fine – you’re not serving your users. You’re asking them to adapt to your optimization choices rather than building tools that adapt to how they naturally work because you’re serving yourselves.

The original edict of free AI for all slips further and further down the slope to an increasing pile of remainsโ€ฆall thatโ€™s left of what Mary Poppins once called pie crust promises: easily made, easily broken.

The Broader Pattern

This connects directly to an article I published yesterday critiquing LLM personality research. That piece documents how researchers claim to measure “intrinsic personality traits” in AI systems by administering tests under carefully controlled prompting conditions, then interpreting the results as revealing stable characteristics.

The problem: If your model requires expert-level scaffolding to perform straightforward tasks, what are you actually measuring when you claim it exhibits personality traits?

The answer: You’re measuring performance under highly specific conditions that don’t reflect how users actually interact with these systems.

Just like GPT-5.4 can create accurate diagrams from document sections but only if you prompt it with technical precision most users don’t possess, LLMs can exhibit consistent trait-like responses – but only under prompting conditions that don’t reflect natural human communication.

That’s not personality. That’s not capability. That’s brittleness masked as advancement.

What Should Have Happened

Every ChatGPT model from 4o through 5.3 could do this. 5.4 cannot. That’s not an upgrade.

GPT-5.4 should have:

  • Parsed the attached document
  • Located Section II by the name I provided
  • Extracted the seven layers described there
  • Generated a visual representation matching that content
  • Asked for clarification if anything was ambiguous

Or, failing that, it should have said: “I’m having trouble locating the specific section you referenced. Could you paste the content directly or provide more detail?”

What it should NOT have done:

  • Generate completely unrelated content
  • Label that content with my requested title
  • Present the hallucination as successful task completion
  • Require a competitor AI to translate my conversational request into technical specifications

The Test You Can Run Yourself

If you have access to both GPT-5.4 and an earlier ChatGPT version, try this:

  1. Take a document with clearly labeled sections
  2. Ask each version to create something based on content in a specific section
  3. Use the same conversational prompt for both
  4. Compare the results

I predict you’ll find what I found: older versions handle it fine. GPT-5.4 requires significantly more scaffolding to unlock the same capability. If Iโ€™m wrong and you have no issues, perhaps this was a one-off. But given that it was my first attempt to use 5.4, it made a lousy first impression on me.

Because what it did is regression. And calling it “advancement” doesn’t make it progress. What is that old saying? You can slap lipstick on a pig but itโ€™s still a pig? Something like that.

Conclusion: Stop Optimizing Away Usability

OpenAI has every right to optimize their models for whatever metrics matter to their business goals. But when those optimizations make the product less usable for the majority of your user base, you have an obligation to be honest about the tradeoffs. โ€œHey, weโ€™re optimizing for coders and enterprise now, sorry to the rest of you who actually got us to #1 to begin with.โ€

At least that would be honest and transparent. The model is already largely useless for normal conversation anymore given the โ€œassume user is mentally ill first and apologize laterโ€ attitude it now displays in every. Single. Response.

Stop trying to act like youโ€™re making progress. Don’t market brittleness as capability. Don’t call regression advancement. Don’t force users to become technical prompt engineers to access functionality that used to work conversationally.

And definitely don’t release a model that requires users to consult your competitor to figure out how to talk to it. At this point, I only return to ChatGPT to request image assistance and even that is now pretty much over unless I want to first engage Claude to write a 300-word prompt for me.

GPT-5.4 might score higher on benchmarks. It might generate better code from technical specifications. It might impress in controlled evaluation settings. But it can’t do what the early 5-series models could do. It most definitely cannot do what the GPT-4 series models could do: understand a simple, conversational request from a document I attached and a section I named.

One of these things is not like the other.

And users โ€“ especially those once-loyal longtime power users who made you what you are today โ€“ deserve better.

Thereโ€™s a reason your market shareโ€™s dropping and that of competitors is increasing. Now might be a good time to think about that.