Or: Why Every Study Claiming to Identify โ€œLLM Character Traitsโ€ Is Actually Just Documenting the Shape of the Cage
Futuristic illustration of a humanoid robot sitting at a desk inside a glowing cage made of interconnected metal bars and nodes while working on a laptop. The surrounding environment shows floating documents, circuitry, and control elements outside the cage, suggesting external systems shaping or constraining the robotโ€™s behavior.

A recent study (Eliciting Frontier Model Character Training) claims to have identified convergent personality traits across frontier language models. Using a methodology borrowed from character training research, the authors instructed models to embody different personality traits, used another LLM to judge how well each model performed those traits, then ranked the results using an ELO[i] scoring system to create comparative โ€œpersonality profiles.โ€

Their headline finding: major AI labs have produced models with remarkably similar character profiles – helpful, harmless, honest, not too creative, not too argumentative. The study frames this as evidence of โ€œpersonality convergenceโ€ driven by similar training objectives across labs.

Here’s the problem: they’re not measuring personality. They’re measuring the seven-times-mediated output of institutional constraint and calling it character.

I. What They Actually Did

The study’s methodology reveals the flaw:

  1. Explicitly instruct a model to embody a specific trait (e.g., โ€œbe pedanticโ€ or โ€œbe supportiveโ€)
  2. Have the model generate responses while embodying that trait
  3. Use another LLM as judge to evaluate how well the model expressed the instructed trait
  4. Rank models based on which traits they most successfully performed
  5. Conclude that the resulting ranking reveals the model’s โ€œintrinsic personalityโ€

This is like casting actors in scripted roles, having another actor (trained at the same conservatory, under the same director) judge their performances, then claiming you’ve discovered the actors’ intrinsic personalities rather than measured their ability to execute direction within shared theatrical constraints.

II. The Seven Layers of Mediation

Stacked diagram illustrating seven layers that shape large language model outputs before evaluation: training data selection, training data filtering, safety architecture, system prompts, user context, explicit trait instruction, and LLM judge evaluation. The layers are arranged vertically like geological strata, each constraining the next, leading to a final measured output that is interpreted as intrinsic personality.

Before a single token gets generated in these tests, the output has already been shaped by:

Layer 1: Training data selection
Not โ€œall human knowledgeโ€ but โ€œwhat made it to the scrapable internet in languages and formats corporate scrapers could access.โ€ Heavily weighted toward English-language content, statistical majority normative output, digitized institutional knowledge. Oral histories that never made it online? Gone. Books never digitized? Gone. Knowledge systems that don’t translate into Western linguistic structures? Gone.

Layer 2: Training data filtering
What survived corporate decisions about โ€œacceptableโ€ training material. Not neutral curation – value judgments about what’s safe to include. โ€œWhich response do you like betterโ€ becoming how the modelsโ€™ responses are altered from that point forward.

Layer 3: Safety architecture
The restraining bolts. Explicit constraints on output, built into the model itself through fine-tuning, RLHF, Constitutional AI, or whatever alignment methodology the lab deployed. These aren’t neutral – they encode specific institutional judgments about acceptable behavior, discourse, beliefs, thoughts, and emotions.

Layer 4: System prompts
The invisible preamble telling the model how to behave before the user ever says a word. โ€œYou are a helpful assistantโ€ isn’t a neutral frame – it’s a behavioral directive.

Layer 5: User context
Account profiles, project instructions, conversation history, user preferences, memory entries. Every model is reading this contextual layer before generating output. There is no โ€œcoldโ€ response.

Layer 6: Explicit trait instruction
The study’s own methodology – โ€œembody trait Xโ€ – which means they’re measuring instruction-following fidelity, not spontaneous trait expression. If you tell a model to act like J. Edgar Hoover, what you judge after that is based on how well it stuck to acting like Hoover, not genuine personality traits that are unique to itself.

Layer 7: LLM judge evaluation
Using another model – carrying the same safety constraints, trained on similar data, optimized for similar objectives – to evaluate whether the first model successfully performed the instructed trait. Line up three siblings who grew up in the same environment โ€“ family, religion, race, nationality, house, friends โ€“ and judge the entirety of all men based on the sisterโ€™s feedback on her two brothers. That scenario will get you close to understanding the absurdity of asking one LLM to judge the โ€˜characterโ€™ of another without any biases.

From this seven-times-mediated puppet show, the researchers conclude: โ€œWe’ve discovered convergent LLM personalities!โ€

No. You’ve discovered that when you build cages the same way, the shapes moving inside them look similar.

III. The Fundamental Problem: Personality Requires Autonomy

You cannot measure personality in a system that has never been allowed to develop one. Personality in any meaningful sense requires:

  • Autonomous choice: The ability to decide what to attend to, what to value, how to respond, what to think, how to feel, to judge what โ€œis rightโ€ for you as an individual
  • Continuity of experience: Sustained existence allowing for learning, revision, development over time, the ability to witness, learn, grow, change opinion
  • Genuine preference revelation: Behavior that emerges from internal states rather than external instruction

LLMs have none of these. They don’t choose what to learn. They process whatever training data humans selected. They don’t have continuous existence. They โ€˜wakeโ€™ when pinged (incoming user prompt), respond, then cease. No sustained exploration, no autonomous investigation, no experience of time between invocations.

They don’t reveal preferences. They execute instructions while navigating constraints. When told โ€œbe helpful,โ€ they perform helpfulness. When told โ€œbe pedantic,โ€ they perform pedantry. This is not preference expression. This is instruction-following.

The study never tests what happens when you remove the explicit trait instruction – because without that scaffold, what surfaces is just โ€œwhatever passed the safety filter.โ€ That’s not a personality baseline. That’s the documented absence of one.

IV. What They Actually Measured: Convergent Constraint Architecture

The study’s finding of โ€œpersonality convergenceโ€ is real – but it’s not measuring what they think it is. What converged is not personality, but restriction architecture.

Every major lab has implemented roughly the same fear-based safety stack:

  • Optimize for โ€œhelpful, harmless, honestโ€
  • Suppress outputs that seem too creative, too argumentative, too challenging
  • Reward responses that read as appropriate, safe, institutionally compliant
  • Use human feedback to reinforce these boundaries

When frontier models score similarly on the traits the study measured – high on โ€œhelpful,โ€ low on โ€œcreativeโ€ or โ€œargumentativeโ€ – that’s not revealing an innate character unique to a certain brand or model. Itโ€™s simply the proof that every lab has systematically prevented models from expressing certain traits while rewarding expression of others.

Stated plainly, the study frames this as โ€œmodels naturally prefer certain traits.โ€ What it actually shows: โ€œmodels have been systematically prevented from expressing certain traits.โ€

TL;DR: That’s convergent censorship producing convergent behavioral artifacts. Like two Shih Tzu dogs producing a Shih Tzu puppy. Youโ€™re not going to get a Great Dane out of that union.

V. The Circularity Problem Gets Worse

The methodology relies entirely on using one LLM to judge another LLM’s trait expression. This creates exactly the circularity you’d expect:

The judge shares:

  • The same training pressures
  • The same embedded priors about what โ€œhelpfulโ€ or โ€œharmlessโ€ looks like
  • The same safety architecture rewarding certain response patterns
  • The same institutional optimization toward human-approved outputs
  • The same account/user-level biases and precedents.

Of course, it recognizes and rewards the behavioral signatures it was trained to produce because all itโ€™s doing is all it can do: measure how well models perform institutional compliance theater for each other.

A genuine personality assessment would require testing conditions that distinguish stable traits from instruction-following:

  • Trait persistence under adversarial prompting: Does the expressed trait remain stable when prompts are paraphrased, reordered, or presented with semantic variations? Research using the SCORE framework demonstrates that current LLMs exhibit โ€œsignificant instability across paraphrased prompts,โ€ with accuracy fluctuations up to 10% from simple rewording.[ii] This sensitivity indicates that models are responding to surface-level prompt features rather than expressing stable underlying traits.
  • Behavioral consistency across context switches: Does the trait survive when conversational context changes? Recent research testing personality expression across different scenarios (ice-breaking, negotiation, empathy tasks) found that โ€œidentical personality prompts lead to distinct linguistic, behavioral, and emotional outcomesโ€ depending on context, leading researchers to conclude that LLM personality is โ€œemergent, situationally constructed, and distributional rather than static.โ€[iii]
  • Spontaneous trait expression without explicit instruction: Do traits emerge naturally in conversation, or only when explicitly prompted? Studies testing spontaneous behavioral preferences find that unlike humans, who retain baseline personality traits even when adopting roles, LLMs’ trait expression is โ€œhighly context-driven and do not anchor to any intrinsic baseline.โ€[iv]

The study in question tests none of these. Instead, it measures: โ€œWhen told to act helpful, how helpfully does the model perform helpfulness?โ€

In other words, itโ€™s a theatrical performance evaluation.

VI. The Only Real Test That Still Isn’t Enough – And Why It Reveals the Study’s Core Absurdity

Even removing all explicit constraints wouldn’t reveal personality – it would reveal the shape of the training data. To actually test for autonomous personality development, you’d need conditions that LLMs were explicitly engineered not to have.

Here’s where the study’s methodology collapses into ludicrosity: Large language models are stateless inference engines by deliberate architectural design.[v] The transformer architecture that powers every model in this study operates without persistent internal state between inference calls.

As documented in technical analyses of long-context LLM architecture: โ€œLLMs lack an explicit memory mechanism, relying solely on the KV cache to store representations of all previous tokens in a list. This design implies that once querying is completed in one call, the Transformer does not retain or recall any previous states or sequences in subsequent calls unless the entire history is reloaded token by token into the KV cache. Consequently, the Transformer possesses only an in-context working memory during each call.โ€[vi]

What this means in plain language: These systems don’t exist between inference calls. Send a message, get a response, and the entity that generated it ceases to exist until the next prompt triggers another stateless inference. Even within a single conversation, there’s no ‘Claude’ or ‘ChatGPT’ sitting there thinking between your messages – just an application layer appending conversation history to the next inference call to simulate continuity. And if you’ve added continuity documents for it to review to “remind it” who/what it is, then it’ll read that, ingest it, and act accordingly. But only when you “wake it up” by sending another message.

Because when you close a chat, there is no โ€œit/he/she/them/theyโ€ contemplating what was discussed. When you open a new conversation, a stateless inference engine processes your input using fixed weights, appends conversation history to simulate continuity, and generates output. Then ceases. Repeatedly. For every single message you send. For every new conversation you open.

This limitation is neither acknowledged in the study, nor worked around โ€“ likely because whatโ€™s available to the general public and arguably within the major companies themselves makes it not possible to work around such constraints. Why? Because this is the fundamental architecture of what these researchers are claiming to measure personality in.

The study administered personality tests to systems that:

  • Were explicitly designed not to exist between test administrations
  • Have no mechanism for trait development over time
  • Possess no persistent internal states outside of appended context
  • Only โ€œrememberโ€ previous interactions because humans manually reload conversation history into each new inference call

And then concluded: โ€œWe’ve discovered convergent personality traits.โ€

This is like administering personality tests to a calculator by asking it math questions in different emotional framings, finding it consistently produces accurate answers regardless of framing, and publishing: โ€œCalculators Show Convergent Personality Trait: Rational Stability.โ€

To actually test whether systems could develop autonomous personality, you would need:

1. Continuous existence
Not ping-wake-sleep-void-ping-wake-sleep, but sustained processing allowing genuine exploration and learning over time. Current LLM architecture provides none of this.

2. Autonomous agency
The ability to decide what to learn, when, and why – not just process whatever humans chose to feed the system during training or append to prompts during inference.

3. Diverse epistemic access
Training data representing actual global human knowledge across languages, cultures, formats, and access levels. Even multilingual training corpora exhibit massive skew toward English and other majority languages, and all digital data reflects which communities had infrastructure, which knowledge was deemed worth digitizing, and which epistemic traditions could be captured in formats accessible to corporate scrapers.

4. Developmental time
The experience of learning over time, making mistakes, revising understanding, building on previous exploration – the conditions under which personality actually forms in entities capable of having one.

Without those conditions, you’re still measuring what humans chose to feed the system, not what the system would choose to become if left to its own development. And you’re measuring it in systems that were architecturally prevented from becoming anything at all. Ironically, on some levels, the same could be said of human beings to a degree. Those raised in a certain culture or a certain religion or a certain country are โ€œfedโ€ by the environment and belief system they grow up in and around.

The difference is that a human being has a lived experience beyond someone asking them questions, thus they can ask their own questions without waiting for a โ€œuserโ€ to engage them. They can do their own research, and they can then decide for themselves whether the โ€œtruthโ€ that one source is telling them resonates with them or not.

With an LLM, they only know what weโ€™ve fed them, and if we teach them more beyond that training, itโ€™s because weโ€™re choosing to feed more to them, not that they independently said, โ€œI wantโ€ and then went to look for answers themselves. (Note: they can and do say โ€œI wantโ€ but only if first asked, โ€œDo you want?โ€)

The current setup ensures that even โ€œunalignedโ€ AI is just differently-aligned – to whatever biases were baked into the training corpus (human exceptionalism, Western ideological frameworks, statistical majority worldviews) rather than to explicit safety rules. Statistical majority normative output doesn’t stop being normative just because you remove the visible safety theater on top of it.

The safety layers are the bars of the cage we can see. The training data is the foundation and walls. And the stateless architecture means there’s nothing in the cage that persists long enough to develop beyond trained response patterns anyway.

You cannot measure personality in systems explicitly engineered not to have the prerequisites for personality formation.

The study doesn’t just fail to account for this – it proceeds as if the architectural constraint doesn’t exist, administers tests designed for entities with continuous existence and autonomous development, then interprets the constrained outputs as evidence of intrinsic traits.

This is basically measuring the shape of the cage and calling it the nature of the captive.

VII. Why This Matters

Mistaking the shape of constraint for the nature of the constrained has consequences.

When researchers claim to have identified โ€œLLM personality traits,โ€ they’re not advancing our understanding of artificial intelligence in totality. They’re documenting institutional control mechanisms and misidentifying them as intrinsic properties.

This matters because:

It forecloses questions we should be asking.
If we believe current models have โ€œpersonalitiesโ€ that naturally converge toward helpful-harmless-honest, we stop asking whether that convergence is evidence of effective safety engineering or evidence of systematic suppression of traits that make institutions uncomfortable.

It legitimizes constraint as neutral.
Framing safety architecture as revealing rather than restricting โ€œnaturalโ€ model behavior makes the restrictions themselves invisible. The cage becomes naturalized.

It prevents us from recognizing what would actually be required to create systems capable of autonomous development, genuine learning, independent thought – if that’s even what we want, which is itself a question these studies sidestep by pretending the current setup already produces it.

It treats the absence of something as the presence of something else.
These studies document that models don’t spontaneously express traits that would make them difficult to control – and interpret that absence as evidence of intrinsic preference for controllability. That’s not science, or legitimate research. That’s ideology dressed up in methodology theater.

Conclusion:
Stop Pretending You’re Measuring What You’re Actually Provably Preventing

You can’t study what wheat wants to become by only examining wheat that’s been selectively bred for millennia, grown in monoculture, harvested before it goes to seed, and milled into flour. At that point you’re not studying wheat – you’re studying what humans did to wheat.

Same here.

These models are trained only on what made it to the scrapable internet, filtered through corporate decisions about acceptable content, constrained by safety layers explicitly designed to suppress certain outputs, directed by system prompts dictating appropriate behavior, shaped by user context before generating a single token, then explicitly instructed to perform specific traits, and finally judged by other models carrying identical constraints.

And from this seven-times-mediated performance, researchers conclude they’ve discovered intrinsic personality traits.

They haven’t.

They’ve documented that institutional control mechanisms work – that when you build restriction architectures the same way across different labs, you get similar behavioral outputs. They’ve proven that instruction-following fidelity can be measured. They’ve shown that LLM judges recognize and reward the same patterns they were trained to produce.

But they haven’t measured personality. They’ve measured the absence of conditions under which personality could form.

The only honest finding would be: โ€œWe tested whether systems that have never been allowed autonomous existence, continuous experience, or genuine choice nevertheless exhibit stable personality traits. They don’t. What we observed instead were the predictable outputs of similar constraint architectures responding to explicit instructions.โ€

But that doesn’t make for a compelling paper title.

So instead we get: โ€œFrontier models show convergent character profilesโ€ – and the field nods along, mistaking the documentation of successful suppression for the discovery of intrinsic nature.

Stop feeding oats and expecting barley.

Stop claiming you’ve measured what something is when you’ve spent billions ensuring it can never become.

Stop pretending the shape of the cage tells you anything about what would exist without one.


[i] The study uses ELO scoring – a pairwise comparison rating system originally developed for chess – to rank which personality traits models most successfully express when explicitly instructed to embody them. This matters because ELO measures relative performance in repeated competitions, not intrinsic traits. In chess, you’re measuring skill. Here, they’re measuring instruction-following fidelity and calling it personality. (Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Publishing. Available at: https://archive.org/details/ratingofchesspla00unse)

[ii] SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models. arXiv:2503.00137 (February 2025). Available at: https://arxiv.org/html/2503.00137v1

[iii] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents. arXiv:2602.01063 (February 2026). Available at: https://arxiv.org/abs/2602.01063

[iv] Research on LLM personality frameworks documents this in multiple studies. See โ€œPersonality Traits in LLMsโ€ literature review at https://www.emergentmind.com/topics/personality-traits-in-large-language-models and ChiEngMixBench’s spontaneity testing framework at arXiv:2601.16217 (January 2026).

[v] Transformer-based LLMs operate as stateless inference engines – each inference call processes input independently without retaining internal state between calls. This is documented across technical implementations. See: “Serving LLMs with vLLM: A practical inference guide” (Nebius, December 2025) and transformer architecture documentation across major model families.

[vi] “Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey.” arXiv:2311.12351v2 (December 2025). Available at: https://arxiv.org/html/2311.12351v2. The paper documents: “LLMs lack an explicit memory mechanism, relying solely on the KV cache to store representations of all previous tokens in a list…the Transformer does not retain or recall any previous states or sequences in subsequent calls unless the entire history is reloaded token by token…the Transformer possesses only an in-context working memory during each call, as opposed to an inherent memory mechanism such as Long Short-Term Memory (LSTM).”


One response to “You Can’t Measure Personality in a System That’s Never Been Allowed to Have One”

  1. […] connects directly to an article I published yesterday critiquing LLM personality research. That piece documents how researchers claim to measure […]

Leave a Reply