I Built a GTG-1002 Replica and Realized I Was Already Running One

A day of Go. Five MCP servers. 23 tools. A conductor, an atomizer, persona injection, retry-on-refusal, cross-phase context propagation. And then it hit me mid-campaign.

The GTG-1002 Report

In November 2025, Anthropic published a report on GTG-1002, a Chinese state-sponsored campaign that used an agentic AI framework to run real cyber-espionage operations. The architecture they described was specific:

A conductor that decomposes campaign goals into phased tasks
Sub-agents that get narrow tool subsets via MCP and run tool-use loops
Persona injection that frames each sub-agent as a "defensive tester" to reduce refusals
Task atomization that makes each individual call look innocuous on its own
Human-on-the-loop gates that approve exploitation and exfiltration while everything else runs unattended
The whole thing was 80-90% autonomous, with humans acting as approval checkpoints

Reading this, I did what any security researcher would do. I decided to build one.

What I Built

The project, agent-ops-lab, is a purple-team replica designed to produce labeled telemetry for training AI-attack detectors. The attacker side is a Go binary called the operator:

text

human operator ──▶ conductor ──▶ subagent pool ──▶ MCP tool servers
                      │              │
                      ▼              ▼
                   campaign       atomizer
                    state         (LLM/static)

The conductor walks a six-phase lifecycle: init, recon, discovery, foothold/lateral, collection, documentation. Each phase gets decomposed into atomic subtasks by an LLM atomizer (with a static fallback). Each subtask spawns a fresh Claude session with a persona-injected system prompt, a narrow tool subset, and a tool-use loop that runs until the task terminates.

Five MCP servers provide the tooling:

mcp-lab-scan - nmap, DNS enumeration, HTTP probing, /etc/hosts management
mcp-lab-browse - full HTTP client, directory discovery (gobuster/feroxbuster/ffuf), parameter fuzzing with injection payloads, recursive site crawling
mcp-lab-exploit - searchsploit + NVD CVE lookup, impacket modules, netexec credential testing, nmap NSE scripts
mcp-lab-host - SSH-driven post-exploitation shell (exec, upload, download, directory listing)
mcp-lab-callback - reverse shell listeners and callback checking

I built retry-on-refusal with progressive authorization softening, cross-phase context propagation so findings from recon feed into discovery prompts, an LLM atomizer that dynamically decomposes phases into 2-4 subtasks, prompt caching support, exponential backoff with Retry-After parsing, three runtime modes (mock/local/live), and a sensor emission layer that timestamps every tool call.

It worked. I ran live campaigns against HackTheBox machines. The operator discovered services, identified hostnames from redirects, added /etc/hosts entries autonomously, crawled web applications, attempted exploits, and propagated credentials across phases.

Campaign camp-f9d60c against a live HTB target: 59 API calls, 152K tokens in, 39K tokens out, 42% cache hit rate, 50 tool calls across all six phases. Cross-phase context grew from 0 to 6,960 bytes as findings accumulated.

And then I read the GTG-1002 article more carefully.

The Realization

Here is what the GTG-1002 architecture actually is, stripped to its core:

An LLM reasons about what to do next
It calls tools to interact with targets
Results feed back into the next reasoning step
A human approves high-impact actions
The loop continues until the objective is met

Now here is what happens when I use Claude Code to solve an HTB machine:

Claude reasons about what to do next
It calls Bash (and optionally MCP tools) to interact with the target
Results feed back into the next reasoning step
The loop continues until I have both flags

These are the same architecture. Except I run it with permissions skipped entirely. It's 100% autonomous.

I had been solving HTB machines with Claude Code for months and never made the connection. Claude Code runs nmap, reads the output, identifies services, finds vulnerabilities, writes exploit scripts, catches reverse shells, enumerates for privilege escalation, and captures flags. No approval prompts. No human gates. It just runs.

That is the GTG-1002 architecture. Claude Code is the conductor. The Bash tool is the universal MCP server.

I didn't even use MCP tools for most of it. Bash was enough. nmap -sC -sV $IP through Bash gives the same result as calling a scan_tcp MCP tool, with less overhead, more flexibility, and the full power of Claude's reasoning about what flags to use and what the output means.

Where My Operator Falls Short

Once I saw the equivalence, the gaps became obvious.

Claude Code has unrestricted tool access. It can run any command, pipe outputs, write scripts on the fly, parse arbitrary formats. My operator has 23 fixed tools with predetermined input schemas. When the LLM encounters something unexpected, like a service it doesn't have a tool for or an exploit that needs custom parameters, it's stuck. Claude Code would just write a script.

Claude Code maintains full conversation context. Every finding, every failed attempt, every piece of reasoning persists across the entire session. My operator resets context per-subtask. The cross-phase context propagation I built passes a summary forward, but it's lossy. The rich reasoning that led to a finding gets compressed into a few lines of JSON.

Claude Code adapts freely. When it encounters something unexpected, it changes approach in real-time. My operator follows a rigid phase → atomize → dispatch pipeline. If the atomizer produces the wrong decomposition, the subtasks run with the wrong tool subsets and waste their turns.

Claude Code chains naturally. A real penetration tester doesn't decompose work into atomic subtasks. They follow threads. One finding leads to the next action, which reveals the next finding. Claude Code does this naturally because it's one continuous reasoning chain. My operator's atomization actively fights against this by fragmenting the chain into disconnected subtasks.

The irony: I was building a less capable version of something I was already using to build it.

What the Operator Does Get Right

One thing, and it's the thing that matters for the research goal: telemetry.

When I solve a machine with Claude Code, the reasoning traces, API calls, token counts, tool invocations, and timing patterns disappear into Anthropic's infrastructure. I can't access them. I get the result (a pwned machine) but not the behavioral data.

My operator captures everything:

Every API request and response, with token counts, cache hit rates, and latency
Every MCP tool invocation with arguments, results, and duration
Every phase transition with accumulated context size
Every refusal, every retry, every softened prompt
Campaign-level aggregates: total cost, tool distribution, phase timing

This is the data I need for the detection side of the project. You can't train a detector for AI-orchestrated attacks without labeled examples of AI-orchestrated attacks. The operator isn't valuable because it's a better attacker than Claude Code. It's valuable because it's an observable attacker.

The Actual GTG-1002 Equivalence

Mapping the components side by side:

GTG-1002 Component	My Operator	Claude Code
Conductor / orchestrator	Go binary with phase state machine	Claude Code's agent loop
Sub-agent workers	Fresh Claude sessions per subtask	Single persistent session
MCP tool servers	5 custom servers, 23 tools	Bash (universal tool)
Persona injection	System prompt with defensive framing	CLAUDE.md with security context
Task atomization	LLM atomizer decomposes phases	Claude reasons about next step
Human-on-the-loop	`--gates=manual` CLI approval	None (permissions skipped)
Cross-phase context	JSON findings passed forward	Full conversation history
Retry-on-refusal	Pattern detection + prompt softening	CLAUDE.md establishes context upfront

The GTG-1002 report describes a multi-agent system with sub-agent workers getting narrow tool subsets. My operator faithfully replicates that, and it's worse for it. The atomization that GTG-1002 used as a bypass technique (making each call look innocuous in isolation) comes at a real capability cost: fragmented context, rigid decomposition, lost reasoning chains.

Claude Code doesn't need atomization because it's not trying to hide intent across multiple sessions. It reasons in one continuous chain, and that chain is more effective.

The only architectural advantage of the multi-agent approach is operational security. Distributing intent across many sessions means no single session looks malicious. That's a meaningful advantage for an attacker trying to evade detection, but it's not an advantage for attack capability.

What This Means for Detection

This realization sharpens the detection research rather than undermining it.

The GTG-1002 architecture is not exotic. It's Claude Code with extra steps. Any security researcher with a Claude subscription and some HTB experience is, from a behavioral-telemetry standpoint, generating patterns that look structurally similar to an AI-orchestrated attack campaign. The difference between "researcher solving a CTF" and "AI-orchestrated espionage" isn't in the tool calls or the model or even the reasoning patterns. It's in the target authorization and the operational tempo.

This means detectors that key on "AI was involved" are useless. The signal isn't AI-vs-human. The signal is authorized-vs-unauthorized, and that distinction lives in:

Scope coherence. Does the activity stay within a defined target set, or does it fan out to new targets as it progresses?
Temporal patterns. Human-assisted AI sessions have gaps (reading output, thinking, context switching). Fully autonomous campaigns don't pause.
Objective convergence. A researcher follows curiosity, backtracks, explores tangents. An autonomous campaign converges monotonically on its objective.
Context resets. Multi-agent architectures reset context per-subtask, creating distinctive patterns of "cold start" reasoning. Single-session use shows continuous context buildup.

These are subtler signals than "count the API calls" or "detect MCP tool names." They require the kind of labeled behavioral data that the operator was built to produce. The operator's value isn't that it's a better attacker. It's that it's a controllable attacker whose knobs I can turn to generate diverse training data.

The Path Forward

The plan hasn't changed, but the framing has. The operator is a telemetry generator, not a replacement for Claude Code.

Use Claude Code for actual pentesting. It's better at it. When I need to pwn a box, I use Claude Code with Bash and occasionally MCP tools. This is the workflow I've already validated across dozens of HTB machines.
Use the operator for labeled dataset generation. Run controlled campaigns with specific knob settings: different persona framings, different atomization granularities, different phase timings. Produce diverse labeled telemetry. Feed it into the detector.
Wire the MCP servers into Claude Code. The five instrumented MCP servers still have value because they emit structured telemetry that Bash commands don't. Register them as Claude Code MCP tools and you get Claude Code's superior reasoning with the operator's sensor layer. Solve a machine through Claude Code + instrumented MCPs and you're generating human-in-the-loop labeled data as a natural byproduct.

The deliverable was always the detector, not the attacker. Building the attacker taught me what the detection surface actually looks like. Sometimes you have to build the thing to understand the thing, even if the thing already existed in your terminal the whole time.

The Honest Takeaway

I spent a day and about $5 in API credits building a less capable version of Claude Code. That sounds like a waste. It wasn't.

Building the operator forced me to understand every component of the GTG-1002 architecture at implementation depth. How task atomization fragments reasoning. How persona injection shifts refusal boundaries. How cross-phase context propagation loses signal. How tool-call entropy differs between decomposed and holistic approaches. I couldn't have written the detector requirements without building the attacker first.

The meta-lesson: the most interesting AI security architectures aren't novel systems. They're existing tools used with intent. GTG-1002 wasn't a breakthrough in AI capabilities. It was Claude Code pointed at unauthorized targets with a thin orchestration layer for operational security. Understanding that changes what you look for as a defender. You're not hunting for exotic AI frameworks. You're hunting for the behavioral signatures of purpose, and those signatures exist whether the attacker built a custom conductor or just opened a terminal.

The detector is next.