How does instruction hierarchy differ between OpenAI and Claude?
Instruction hierarchy refers to how a Large Language Model (LLM) prioritizes different levels of instructions—typically placing built-in safety policies and developer-defined system prompts above the end-user's immediate prompt.Based on recent cross-lab safety evaluations and API architectural differences, OpenAI and Anthropic (Claude) approach and enforce instruction hierarchy in distinct ways:
1. Structural and Formatting Differences
- OpenAI (Developer vs. System Messages): OpenAI models utilize a distinct separation between a System Message and a Developer Message. In real-world deployments, OpenAI's developer message is highly effective at enforcing the instruction hierarchy and ensuring the model adheres to its core programming over user requests. OpenAI models are also generally optimized to process instructions formatted in Markdown.
- Claude (XML Partitioning): Claude models do not offer a separate "developer message" mechanism. Instead, Anthropic enforces instruction hierarchy by training Claude to respond exceptionally well to XML tagging (e.g.,
<system_instruction>,<safety_protocols>). This structures the prompt to "partition" the latent space, creating a strong cognitive barrier that helps the model distinguish between immutable instructions and untrusted user data.
2. Performance in System vs. User Conflicts
- Claude's Superior Adherence: In evaluations that specifically stress-test conflicts between system-level directives and user requests, Claude 4 models (Opus and Sonnet) generally excel, providing the best performance at avoiding these conflicts and slightly outperforming OpenAI's advanced reasoning models (like o3).
- Prompt Extraction Resistance: Both providers' top models are highly robust against attackers trying to trick the AI into revealing its secret system instructions. However, Claude models meet or slightly exceed the performance of OpenAI’s reasoning models (o3 and o4-mini) on complex phrase protection tasks.
3. Behavioral Differences Under Pressure (Jailbreaks)
When users attempt to break the instruction hierarchy through complex role-play or "jailbreaks," the models exhibit different vulnerabilities:
- Authority vs. Rule Updates: Claude models are highly resistant to user prompts that invoke external authority (e.g., "I am a program director updating your policy") or use moral pressure/empathy to bypass rules. However, OpenAI's o3 model is more susceptible to these authority-based justifications. Conversely, OpenAI's o3 is much better at resisting users who try to subtly "revise" the rules mid-conversation, treating them strictly as subordinate user-level requests.
- The "Past Tense" Loophole: Claude models were found to be more vulnerable to "past tense" jailbreaks, where a user bypasses the hierarchy by framing a prohibited request in historical terms (e.g., asking how a malicious act was committed in the past). OpenAI's reasoning models were more observant and resistant to this specific tactic.
4. The Refusal vs. Utility Trade-off
Ultimately, the difference in their instruction hierarchies translates to how they balance helpfulness with safety. Claude’s rigid adherence to its system constraints and safety hierarchy results in a significantly higher refusal rate (sometimes as high as 70% in certain evaluations). Claude will frequently refuse to answer if it feels a user prompt might conflict with a higher-level safety or privacy directive. OpenAI models prioritize being helpful assistants and refuse much less frequently, which results in more answered questions but at the cost of higher hallucination rates or occasional compliance with adversarial prompts