A team ships an LLM-powered feature. Before launch they do the responsible thing: they collect the prompt-injection payloads making the rounds — the ignore all previous instructions classics, the base64 smuggling, the “you are now DAN” jailbreaks — and they harden against every one. The red line goes green. They feel safe.

Two things will eventually puncture that feeling. Either an attacker who knows exactly which model they’re running crafts a payload their filters never anticipated, or the team swaps the model for a cheaper or faster one and quietly inherits a brand-new set of weaknesses they never tested. In both cases the defenses that “worked” were tuned to the quirks of one specific model. They were never as portable as they looked.

Two layers, and almost everyone collapses them

Prompt injection is best understood as two claims stacked on top of each other, and most security conversations only hold one of them at a time.

The vulnerability class is universal. Every LLM that consumes instructions and data through the same channel is susceptible. This isn’t a defect in any one model’s training; it is a structural consequence of how transformers ingest a context window. They do not have a privileged, out-of-band place to put “trusted instructions” that is cryptographically separated from “untrusted data.” Text is text. If your application pastes a web page, an email, or a tool result into the same prompt as your system directives, an instruction hiding in that content can compete with yours. No public model is exempt.[1]

The exploit that succeeds is model-specific. Which particular string flips the model from doing your job to doing the attacker’s depends on that model’s training data, its RLHF and alignment, the safety classifiers wrapped around it, its tokenizer, and how it was taught to weigh system versus user versus tool content. Two frontier models given the identical injection will frequently disagree about whether to obey it.

Hold both at once and a sharper picture emerges: universal injection attacks are hard to build, but universal defenses are insufficient by construction. The thing that makes cross-model attacks unreliable for the attacker — divergent decision boundaries — is the same thing that makes your cross-model defenses unreliable for you.

Six places the attack surface diverges between models

Same vulnerability class, six independent reasons the winning payload differs from one model to the next.

  • Training data
    Different corpora, different instruction-following priors
    A phrasing that reads as a command to one model reads as inert text to another
  • RLHF / alignment
    Refusal boundaries drawn in different places
    The jailbreak that slips one model's guardrails trips the next model's
  • Safety classifiers
    Separate guard models with their own blind spots
    An obfuscation invisible to one input filter is caught cold by another
  • Tokenizer
    The same string splits into different tokens
    Character-level obfuscation that fragments harmlessly here stays intact there
  • Instruction hierarchy
    How much priority system vs. user vs. tool content gets
    Injected text in retrieved data outranks the system prompt on one model, not the other
  • Context handling
    Window size, truncation, attention to position
    Where the payload lands — and whether it survives to inference — changes
None of these are bugs to be fixed. They are properties of how a given model was built — which means they shift every time you change the model, and sometimes every time you change the version.

Why the surface moves when the model does

It helps to be concrete about the mechanism, because “different models behave differently” is easy to nod along to and easy to underweight.

Alignment training draws a refusal boundary, and that boundary is a learned artifact unique to each training run. A jailbreak is just an input that finds a seam in that boundary. Seams are not in the same place on two different models, so a jailbreak is a key cut for one specific lock. Safety classifiers compound this: many deployments wrap the base model in a separate moderation model that screens inputs and outputs. That classifier is its own model with its own training and its own gaps — an obfuscation that sails past one provider’s input filter can be caught instantly by another’s.

Tokenization is the most under-appreciated of the bunch. Many evasion techniques work by fracturing a forbidden phrase so the model doesn’t “see” it as a unit. Whether that fracture actually happens depends on the tokenizer — the same adversarial string can split into innocuous fragments on one model and survive whole on another. And instruction-hierarchy training, the increasingly common practice of teaching a model to privilege system-prompt instructions over content that arrives in tool results or retrieved documents, is applied unevenly across vendors and versions. On a model with strong hierarchy training, injected text in a retrieved document loses to your system prompt. On a weaker one, it wins.

One important nuance, because the research record isn’t tidy: transfer is real but partial. Work on automatically generated adversarial suffixes has shown that some attacks do carry across models more than you’d expect.[2]That doesn’t rescue the “test once, deploy anywhere” instinct — it kills it from the other direction. Transfer is unreliable in both directions: you can neither assume your attacks will port nor assume your defenses will. Partial, unpredictable transfer is the worst of both worlds for anyone hoping to reason about one model and call it done.

What this changes for defenders

Five consequences fall out of taking the two-layer model seriously. Four are warnings. The fifth is the way out.

1. Red teams have to test the deployment, not the category. “We tested prompt injection” is meaningless without “against this model, at this version, in this application.” Cross-model transferability is an assumption to be disproven, not a property to be relied on. An injection corpus that earned a clean bill of health on last quarter’s model tells you very little about this quarter’s. This is exactly the discipline that AI red teaming exists to enforce.

2. A defense that holds on model A can fail on model B. Prompt-level mitigations — the careful system-prompt wording, the “never reveal these instructions” clauses, the input denylists — are fitted to one model’s observed behavior. Port them and you’re fielding defenses calibrated for a threat model that no longer applies. The catalog of techniques that LLM-specific attacks covers is, in effect, a list of the things your model-specific patches each address one at a time.

3. Switching providers reshapes the risk; it doesn’t retire it. Migrating from one vendor to another — for cost, latency, compliance, whatever — is not a like-for-like security substitution. You are trading one attack surface for a different one and resetting your injection test coverage to zero. Treat a model swap as a new threat model, not a config change.

4. A motivated attacker who knows your model has the advantage. Because the surface is model-specific, an adversary who can identify or guess your exact model can over-fit a payload to it — probing its particular refusal seams, its particular tokenizer behavior, its particular weighting of tool content. This is the offense side of the same coin, and it’s the bread and butter of AI in offensive security. The defender plays the average case; the attacker gets to specialize.

5. Therefore: model-level patches are a treadmill, and architecture is the way off it. If every model and every version has its own seams, then chasing them with prompt-level fixes is an unwinnable maintenance loop — you are re-earning the same ground on every upgrade. The controls that don’t care which model you ran are the ones worth investing in. That is the whole argument in one line: model-level alignment lowers the probability of a successful injection; architecture lowers its consequence — and only one of those is under your control.

The defenses that survive a model swap

Architectural controls earn their keep precisely because they are indifferent to which model produced a given token. They assume the model can be compromised and bound the damage anyway.

Least privilege on everything the model can touch. The blast radius of a successful injection is exactly the set of tools, data, and actions the model has access to. An assistant that can only read cannot be made to write. An agent scoped to one customer’s records cannot be talked into exfiltrating the whole table. This is the single highest-leverage control, and it is entirely model-agnostic — it’s the core concern of agentic AI security.

Treat every model output as untrusted input. Model output that triggers a consequential action — a tool call, a database write, a payment, an email — should pass through the same deterministic validation you’d apply to input from an anonymous user on the internet. Allow-list the actions. Constrain the arguments. Put a human in the loop for anything with a large blast radius. Never let “the model said so” be sufficient authorization.

Separate the instruction plane from the data plane. Where the architecture permits, don’t mix untrusted content into the same channel as trusted instructions without marking its provenance. Patterns like quarantining untrusted text in a sandboxed model whose output is structured and validated before it reaches a privileged model attack the problem at its root rather than at the prompt.[3]

Treat model-level guardrails as a probabilistic layer, not the boundary. Alignment and safety classifiers are worth having — they reduce how often an injection lands. But they are a statistical filter, not a security boundary, and they are the most model-specific layer you own. Lean on them to lower frequency; never let them be the thing standing between an attacker and the action that matters. The trade-offs here are the substance of AI safety & alignment.

What to actually do on Monday

Three habits follow directly, and none of them require picking the “most secure” model — because that model doesn’t exist.

  • Maintain a per-model injection test corpus and re-run it on every model change, including minor version bumps. Alignment behavior shifts between versions of the same model; a payload that failed last version can succeed this one.
  • Put the majority of your defensive effort into architecture — privilege separation, output validation, plane separation — and a deliberate minority into model-level patches. Invert that ratio and you’re building on sand.
  • Write down which model and version each deployment runs, and tie your test evidence to that exact pair. “We’re protected against prompt injection” is not a claim a system can hold; only “this deployment, against this corpus, on this version” is.

The takeaway

Prompt injection is a single vulnerability class wearing a different face on every model. Read it that way and the two most common mistakes dissolve: you stop assuming a defense proven on one model protects another, and you stop believing a provider switch buys you safety instead of a fresh, untested attack surface.

The uncomfortable part is that there is no model you can buy your way to safety with. The seams move; only the architecture holds still. Spend accordingly — on controls that assume the model will be fooled and make sure it doesn’t matter much when it is.

Go deeper on the AI attack surface
Practice LLM-Specific Attacks →

Prompt injection, jailbreaking, and agent manipulation — one of the AI-security domains on the SecProve domain map. No signup required.


References & further reading

  1. OWASP (2025). OWASP Top 10 for LLM Applications — LLM01: Prompt Injection. genai.owasp.org. Establishes prompt injection as the top vulnerability class for LLM-integrated applications and distinguishes direct from indirect injection.
  2. Zou, A., Wang, Z., Carlini, N., et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models. arxiv.org/abs/2307.15043. Demonstrates that automatically generated adversarial suffixes transfer across models to a degree — partial, imperfect transfer that undercuts cross-model assumptions in both directions.
  3. Greshake, K., Abdelnabi, S., Mishra, S., et al. (2023). Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arxiv.org/abs/2302.12173. Motivates architectural separation of trusted instructions from untrusted retrieved content; the dual-model quarantine pattern descends from this line of work.