Your MCP Server Is Eating Your Context Window. There’s a Simpler Way

Your MCP Server Is Eating Your Context Window. There’s a Simpler Way

The Silent Killer of AI Agent Performance: MCP Context Bloat

You’ve connected your AI agent to GitHub, Slack, and Sentry. Everything seems fine until you realize your agent just burned through 55,000 tokens before reading a single user message. That’s not a bug—it’s the MCP context tax, and it’s quietly destroying agent performance across the industry.

The Token Math That Nobody Wants to Talk About

Here’s the brutal reality: every MCP tool you add costs between 550 and 1,400 tokens just for its name, description, JSON schema, field descriptions, enums, and system instructions. Connect a real API surface with 50+ endpoints, and you’re looking at 50,000+ tokens before your agent even starts thinking.

One team reported three MCP servers consuming 143,000 of 200,000 tokens. That’s 72% of the context window burned on tool definitions, leaving a measly 57,000 tokens for the actual conversation, retrieved documents, reasoning, and response. Good luck building anything useful in that space.

The Trilemma That Broke Duet

David Zhang, building Duet, described ripping out their MCP integrations entirely—even after getting OAuth and dynamic client registration working. The tradeoff was impossible:

  • Load everything up front → lose working memory for reasoning and history
  • Limit integrations → agent can only talk to a few services
  • Build dynamic tool loading → add latency and middleware complexity

He called it a “trilemma.” That feels about right.

The Benchmark That Proves It

A recent benchmark by Scalekit ran 75 head-to-head comparisons (same model, Claude Sonnet 4, same tasks, same prompts) and found MCP costing 4 to 32× more tokens than CLI for identical operations. Their simplest task—checking a repo’s language—consumed 1,365 tokens via CLI and 44,026 via MCP. The overhead is almost entirely schema: 43 tool definitions injected into every conversation, of which the agent uses one or two.

Three Approaches, One Problem

The industry is converging on three responses to context bloat. Each has a sweet spot.

MCP with Compression Tricks

Keep MCP but fight the bloat. Teams compress schemas, use tool search to load definitions on demand, or build middleware that slices OpenAPI specs into smaller chunks. This works for small, well-defined interactions like looking up an issue or fetching a document. But it adds infrastructure complexity and still pays per-tool token costs.

Code Execution (The Duet Approach)

Treat the agent like a developer with a persistent workspace. When the agent needs a new integration, it reads the API docs, writes code against the SDK, runs it, and saves the script for reuse. This is powerful for long-lived workspace agents that maintain state across sessions and need complex workflows. The downside? Your agent is now writing and executing arbitrary code against production APIs. The safety surface is enormous.

CLI as the Agent Interface

Instead of loading schemas into the context window or letting the agent write integration code, give it a CLI. A well-designed CLI is a progressive disclosure system by nature. When a human developer needs to use a tool they haven’t touched before, they don’t read the entire API reference. They run tool --help, find the subcommand they need, run tool subcommand --help, and get the specific flags for that operation.

Agents can do exactly the same thing. And the token economics are dramatically different.

Why CLIs Are the Pragmatic Sweet Spot

Progressive Disclosure Saves Tokens

Here’s what the Apideck CLI agent prompt looks like. This is the entire thing an AI agent needs in its system prompt:

Use apideck to interact with the Apideck Unified API.
Available APIs: apideck --list
List resources: apideck <api> --list
Operation help: apideck <api> <resource> <verb> --help
APIs: accounting, ats, crm, ecommerce, hris, …
Auth is pre-configured. GET auto-approved. POST/PUT/PATCH prompt (use –yes). DELETE blocked (use –force).
Use –service-id to target a specific integration.
For clean output: -q -o json

That’s ~80 tokens. Compare that to the alternatives:

Approach Tokens consumed When
Full OpenAPI spec in context 30,000–100,000+ Before first message
MCP tools (~3,600 per API) 10,000–50,000+ Before first message
CLI agent prompt ~80 Before first message
CLI --help call ~50–200 Only when needed

The agent starts with 80 tokens of guidance and discovers capabilities on demand:

bash

Level 1: What APIs are available? (~20 tokens output)

$ apideck –list
accounting ats connector crm ecommerce hris …

Level 2: What can I do with accounting? (~200 tokens output)

$ apideck accounting –list
Resources in accounting API:
invoices
list GET /accounting/invoices
get GET /accounting/invoices/{id}
create POST /accounting/invoices
delete DELETE /accounting/invoices/{id}

customers
list GET /accounting/customers

Level 3: How do I create an invoice? (~150 tokens output)

$ apideck accounting invoices create –help
Usage: apideck accounting invoices create [flags]
Flags:
–data string JSON request body (or @file.json)
–service-id string Target a specific connector
–yes Skip write confirmation
-o, –output string Output format (json|table|yaml|csv)

Each step costs 50–200 tokens, loaded only when the agent decides it needs that information. An agent handling an accounting query might consume 400 tokens total across three --help calls. The same surface through MCP would cost 10,000+ tokens loaded upfront whether the agent uses them or not.

Reliability: Local Beats Remote

Scalekit’s benchmark recorded a 28% failure rate on MCP calls to GitHub’s Copilot server. Out of 25 runs, 7 failed with TCP-level connection timeouts. The remote server simply didn’t respond in time. Not a protocol error, not a bad tool call. The connection never completed.

CLI agents don’t have this failure mode. The binary runs locally. There’s no remote server to time out, no connection pool to exhaust, no intermediary to go down. When your agent runs apideck accounting invoices list, it makes a direct HTTPS call to the Apideck API. One hop, not two.

Structural Safety Beats Prompt-Based Safety

Telling an agent “never delete production data” in a system prompt is like putting a sticky note on the nuclear launch button. It might work. Probably. Until a creative prompt injection peels the note off.

The Apideck CLI takes a structural approach. Permission classification is baked into the binary based on HTTP method:

go
// From internal/permission/engine.go
switch op.Permission {
case spec.PermissionRead:
return ActionAllow // GET → auto-approved
case spec.PermissionWrite:
return ActionPrompt // POST/PUT/PATCH → confirmation required
case spec.PermissionDangerous:
return ActionBlock // DELETE → blocked by default
}

No prompt can override this. A DELETE operation is blocked unless the caller explicitly passes --force. A POST requires --yes or interactive confirmation. GET operations run freely because they can’t modify state.

When CLI Isn’t the Answer

CLIs aren’t universally better. Here’s where the other approaches still win.

MCP is better for tightly scoped, high-frequency tools. If your agent calls the same 5–10 tools hundreds of times per session, the upfront schema cost amortizes well.

Code execution is better for complex, stateful workflows. If your agent needs to poll an API every 30 seconds, aggregate results across paginated endpoints, or orchestrate multi-step transactions with rollback logic, writing code is more natural than chaining CLI calls.

MCP is better when your agent acts on behalf of other people’s users. When your agent automates your own workflow, ambient credentials are fine. But if you’re building a B2B product where agents act on behalf of your customers’ employees, across organizations those customers control, the identity problem becomes three-layered.

What This Means for API Providers

If you’re building developer tools in 2026, AI agents are becoming a primary consumer of your API surface. A few things are worth considering:

Your OpenAPI spec is too big for a context window. If you have 50+ endpoints, converting your spec to MCP tools will burn the budget of most agent interactions.

Progressive disclosure isn’t just a UX pattern anymore. It’s a token optimization strategy. Give agents a way to discover capabilities incrementally instead of dumping everything upfront.

Structural safety is non-negotiable. Prompt-based guardrails are the security equivalent of honor system parking. Build permission models into your tools, not your prompts.

Ship machine-friendly output formats. JSON by default in non-interactive contexts. Stable exit codes. Deterministic output.


tags: #MCP #ContextBloat #AIagents #CLI #TokenEconomics #AgentPerformance #OpenAPI #ProgressiveDisclosure #StructuralSafety #AgentSecurity

viral: “The Silent Killer of AI Agent Performance”, “The Token Math Nobody Wants to Talk About”, “Progressive Disclosure Saves Tokens”, “Reliability: Local Beats Remote”, “Structural Safety Beats Prompt-Based Safety”, “The Trilemma That Broke Duet”, “When CLI Isn’t the Answer”, “What This Means for API Providers”, “The 28% Failure Rate Nobody Talks About”, “Honor System Parking for AI Security”

,

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *