Opus 4.7 vs 4.6 for Security Work: A Practical Model-Switching Guide

Anthropic released Claude Opus 4.7 today. For those of us doing pentesting, bug bounty, and offensive tool development, it's a mixed bag — not a universal upgrade. This post breaks down what actually changed, what it means for security workflows, and a practical rubric for switching between 4.6 and 4.7 mid-engagement.

The Short Version

4.7 is better at coding. 4.6 is better at long-context retrieval and terminal agentic work.

If your day is writing exploit PoCs, refactoring evasion tooling, or doing code review on an OSS target: 4.7 wins.

If your day is a multi-hour pentest engagement with piles of notes, scan output, and previous attempts stacked in context: 4.6 1M wins.

The benchmark data backs this up. We'll get to the numbers in a minute.

What Actually Changed in 4.7

The Good

Coding improvements. This is the headline feature. Self-verification, better reasoning on long agentic chains, measurable gains on SWE-bench and similar.
Vision. Screenshots, diagrams, Burp output — all better interpreted.
Deeper agentic thinking. 4.7 thinks harder at higher effort, particularly on later turns. This is a real upgrade for long attack chains.
Adaptive thinking. Instead of manually tuning budgets, the model decides how much to think. Less tuning, more trusting.

The Bad (For Us)

Three regressions matter for security work:

1. MRCR v2 at 1M tokens: 78.3% → 32.2%

MRCR measures needle-in-haystack retrieval. Anthropic's framing is that this benchmark is "synthetic distractor stacking" and Graphwalks is the better applied signal. Fair — but a 46-point drop on retrieval matters when you're asking the model "did we already try this credential against WinRM earlier in the session?" after four hours of enumeration.

2. Terminal-Bench 2.0: regression

GPT-5.4 now scores 75.1% here vs. Opus 4.7's 69.4%. For anyone whose day job is chaining terminal commands through SSH sessions, this is the one to watch.

3. BrowseComp: softens

Web browsing and recon-on-unfamiliar-targets performance dips. Hurts bug bounty scope reconnaissance.

The Weird

Thinking budgets deprecated. thinking: {type: "enabled"} and budget_tokens are being phased out. Adaptive thinking is the new paradigm.
Thinking display is off by default in 4.7. You used to get summarized thinking for free. Now thinking.display defaults to "omitted" — you have to explicitly request "summarized" to see reasoning.
Low-effort 4.7 ≈ medium-effort 4.6. The whole effort scale shifted. If your mental model was calibrated to 4.6's effort knob, recalibrate.

Why This Matters for Security Work Specifically

Pentesting isn't the same as writing a React component. The workflows that define our work have specific context-shape characteristics that interact with model capabilities in non-obvious ways.

Long Pentest Engagements

A typical engagement directory looks like this:

text

engagement/target/
├── CLAUDE.md           # Ongoing context: creds, foothold state, failed paths
├── notes.md            # Walkthrough-in-progress
├── nmap/
│   ├── initial.nmap
│   ├── allports.nmap
│   └── targeted.nmap
├── loot/
│   ├── creds.txt
│   └── hashes.txt
└── exploits/
    └── cve-2024-xxxx.py

By the end of a multi-host engagement I've typically loaded:

20-50KB of nmap output
100-500KB of linpeas/winpeas output
Multiple bash session transcripts
Credentials, hashes, ticket files, BloodHound JSON
Previous failed enumeration paths (so we don't repeat them)

This is a retrieval-heavy workload — the model constantly needs to pull specific facts from earlier in the context. "Did I already try that password against SMB? What was the SPN we found? Where did linpeas mention that SUID binary?"

This is exactly the workload MRCR is measuring. The 46-point drop in 4.7 isn't abstract — it's the thing that makes the model forget it already ran bloodhound-python 90 minutes ago.

Verdict: Stay on 4.6 1M for long engagements. Full stop.

Bug Bounty Recon

Bug bounty work varies, but the recon phase shares the same context-heavy character:

Scope lists and out-of-scope domain filters
Wayback archive results (often tens of thousands of URLs)
Subdomain enumeration output (crt.sh, subfinder, amass)
Previous dupe/triage notes from the program
Your own prior submission history

If you're running an OSS hunting workflow with a queue of targets, you want a model that can hold "we already looked at target X, found pattern Y, didn't see it in target Z" in its working memory.

Verdict: 4.6 1M for recon and triage. 4.7 for diving deep into one target's code.

Offensive Tool Development

This is where 4.7 shines. Writing evasion tooling, custom C2 components, AD attack automation — this is pure coding work with clear interfaces and relatively bounded context per-task.

When I sit down to write a new syscall invocation method in C with MinGW, I don't need 500KB of prior session context. I need a model that gets pointers right, understands Windows internals, and doesn't hallucinate WinAPI signatures.

Verdict: 4.7 for tool dev. This is the upgrade you wanted.

Exploit PoC Development

Similar to tool dev — bounded context, high reasoning demands. 4.7's self-verification is genuinely useful here: writing an exploit, then having the model sanity-check its own payload logic catches errors that would otherwise cost you a debug cycle.

Verdict: 4.7 for PoC development.

Code Review on OSS Targets

When you're auditing an OSS codebase for a bug bounty submission, you're doing deep reasoning over structured code. Graphwalks-style long-context reasoning matters more than needle retrieval. 4.7's coding improvements, vision upgrades (for reading architecture diagrams), and self-verification all pay off.

Verdict: 4.7 for code review.

The Practical Flow

Here's the rubric I'm using now. /model in Claude Code flips models instantly, so there's no reason to commit.

Starting a new engagement:

Default to Opus 4.6 (1M).
Stay there through recon, foothold, privesc, lateral movement.
Only switch if a specific sub-task justifies it (see below).

Mid-engagement, need to write significant code:

Example: building a custom loader, writing a Python exploit wrapper, porting a C PoC.
/model to Opus 4.7.
Write the code.
/model back to 4.6 to resume the engagement.

Bug bounty recon (broad):

Opus 4.6 (1M). Scope juggling and context accumulation are retrieval-heavy.

Bug bounty deep-dive (one target, code review):

Opus 4.7. Self-verification and coding gains matter here.

Writing offensive tooling from scratch:

Opus 4.7. Default for this.

Exploit PoC development:

Opus 4.7.

One Caveat on Mid-Session Switches

When you /model mid-session, conversation context carries over but the new model interprets it fresh. If you're deep in a complex exploitation phase and switch models, the new model doesn't have the same "feel" for where you are.

Practical rule: finish a phase before switching. Got a shell? Finish the privesc hunt first, then switch if needed. Mid-BloodHound analysis? Let 4.6 finish mapping the attack graph. Don't context-switch models mid-thought.

The Elephant in the Room

4.7 isn't Anthropic's best model. Mythos is. They're holding it back on safety grounds. 4.7 is, effectively, the consolation prize — a reliable upgrade on the coding dimension that they're comfortable shipping while the real model sits in the lab.

What this means practically:

Don't expect 4.7 to be the universal upgrade that 4.5 → 4.6 was.
Specialize your model selection. The era of "just use the newest" is over for now.
Watch the next two quarters. Mythos (or whatever they ship in its place) is where the real generational leap is.

What I'm Actually Doing

Concrete config changes after reading the release notes and running some informal tests:

Default model: Stayed on Opus 4.6 (1M). Long-engagement work is my primary workload.
When building custom tools: Explicitly launch Claude Code with --model claude-opus-4-7 for coding-focused sessions.
Opus-only policy: I don't downshift to Sonnet or Haiku. Security work rewards reasoning quality over speed — a fast wrong answer in an attack chain wastes more time than a slow right one.

I'll revisit once Anthropic publishes more applied benchmarks specifically around agentic coding with long retrieval components. Until then, the workflow above is my default.

Closing Thought

The frustrating thing about 4.7 isn't that it regressed on some benchmarks — it's that the regressions hit the exact workloads that matter for security engineering. Needle-retrieval at 1M context is the thing that makes long multi-host engagements tractable. Terminal-Bench is the thing that makes chained SSH sessions reliable. Those are the security-work benchmarks.

The coding gains are real and I'll use them. But for the core of what I do — long, stateful, retrieval-heavy engagements where the model is both operator and memory — 4.6 1M stays the default.

Pick your model by workload, not by version number. /model is free.