Agent Decomposition: Tools, Skills, and Subagents

Anthropic's Will uses a Stock Pilot inventory assistant to show what happens when an agent grows into a 400-line prompt, too many tools, and opaque subagent wrappers. The workshop refactors the system toward Claude Managed Agents, modular skills, code execution primitives, and one native callable agent where isolation still matters.

Processed May 29, 2026

Infographic for decomposing an overgrown Claude agent into Managed Agents, skills, code execution primitives, and callable agents.

Executive Summary

Will from Anthropic's Applied AI team uses a live Code with Claude London workshop to show how agents degrade when each new business requirement is added directly to the core prompt. The example system, Stock Pilot, has grown into a 400-line system prompt with 12 custom tools and 3 custom subagent wrappers. The live workshop eval baseline using Claude Code running Opus 4.7 with extra high effort is 62%, with failures tied to context pollution, unclear instruction boundaries, and fragile handoffs.

The refactor moves the system away from a custom Messages API loop and onto Claude Managed Agents, so session state, sandboxing, security layers, and multi-user scaling become platform concerns rather than application glue. The prompt is then reduced to its global identity and baseline rules, while business procedures and domain logic move into skills that use progressive disclosure.

The workshop also narrows the tool surface. Instead of many specialized extraction tools, Claude gets computer-like primitives such as bash, read, and write, then writes Python scripts to filter large files before they enter context. Subagents are not treated as a default escape hatch: two wrappers are removed, while the forecasting workflow remains as a native callable agent for context isolation and cleaner logging. The workshop result is scoped to this demo: a 15-line prompt, 3 primitives, 1 callable agent, and a reported 92% eval peak.

Key Takeaways

Prompt bloat is an architecture problem, not only a writing problem: long, accumulated instructions can create rule conflicts and context pollution.
The workshop separates evals into single-turn regression tests and more complex multi-turn failure-mode tests.
Useful telemetry includes tokens, cost, latency, correctness, style, and tone, with LLM-as-judge grading used for non-deterministic qualities.
Claude Managed Agents shifts session routing, sandboxing, security layers, and scaling away from custom application orchestration.
Skills are used as modular packages for domain information, letting the model pull in task-specific context through progressive disclosure.
The core prompt should carry global identity and strict baseline rules, not every tactical business procedure.
General code execution primitives can replace brittle specialized tools when the agent needs to inspect or transform local data.
Letting Claude write small Python scripts over files can reduce token load compared with injecting raw datasets into context.
MCP is useful for standardized tools shared across clients, but overlapping MCP servers can pollute context and consume substantial token space.
Subagents remain useful when the workflow needs parallel work or an isolated fresh context, such as separating forecasting from the main planner.
Native callable agents improve observability over opaque custom tool-wrapped subagents by keeping logs, transcripts, and metrics in one managed flow.

Builder Implications

Audit system prompts for business procedures, seasonal policies, and static data that should move into skills or files.
Use Claude Managed Agents when session durability, sandboxing, and multi-user infrastructure are becoming application complexity.
Prefer general primitives for data work when Claude can safely write and run code against files instead of reading everything into context.
Keep subagents for workflows that truly need independent context windows, not as a way to hide an oversized prompt.
Evaluate latency together with correctness and token cost; the workshop does not imply every improvement lowers latency equally.
Build hill-climbing eval loops into deployment so each prompt, tool, or agent refactor is measured against previous behavior.

Things to Verify

The exact Claude Managed Agents skill payloads, schemas, and deployment APIs used in the workshop repository.
The configuration required for the workshop's UV project manager and deployment commands.
The LLM-as-judge criteria, weights, and scoring rules behind the reported eval movement.
The pricing, concurrency limits, and performance behavior for scaling Managed Agents into real production traffic.
How callable agents pass state back to the primary orchestrator without overflowing the parent context.
Which tools are built into Claude Managed Agents versus standard Anthropic SDK options.