Output Shape Is Cost: Why We Hand-Roll Our MCP Servers
We ripped out our entire integration layer 48 hours before our first live demos. Not because it was broken. It worked perfectly in development. It worked perfectly in testing. It worked perfectly right up until we put it inside a budget-constrained agent loop — and then every sub-agent we spawned started dying mid-task.
This is the story of what went wrong, what we learned, and the one architectural principle we wish someone had told us sooner: in LLM-powered systems, output shape is cost.
Key Takeaway: When a tool returns 50KB of JSON instead of 500 bytes of markdown, it’s not just verbose — it’s 100x more expensive in token cost. Under fixed budgets and turn limits, that difference is the difference between an agent that completes its task and one that dies mid-sentence. If you’re building multi-agent systems with MCP, the shape of your tool outputs is the most important cost lever you’re not thinking about.
The Setup
Luminary Lane is an AI-powered brand automation platform. At its core, it runs a loop: SENSE (read signals from your marketing channels) → THINK (generate strategy) → ACT (execute campaigns) → LEARN (measure and adapt). Each phase is handled by specialized AI agents that need to talk to external services — social media APIs, analytics platforms, email tools, CMS systems.
To connect all these services, we needed an integration layer. The Model Context Protocol (MCP) is the natural choice — it’s the emerging standard for giving AI agents access to external tools. We had two options:
- Use a third-party MCP provider that offers pre-built connectors for hundreds of services, with managed auth.
- Hand-roll our own MCP servers for each integration, purpose-built for our agents’ specific needs.
We chose option one. It was faster. It covered more services. The auth was handled for us. Running it locally under a Claude Code subscription, it got us all the data we wanted. For weeks, we thought we’d found the perfect solution — auth and MCP in one package.
Some early tests failed, so we kept bumping the budget — which seemed to temporarily solve the problem. But as we added more integrations to each brand, the costs started ballooning. Then came demo week.
What “Production” Actually Means for AI Agents
Here’s what most tutorials and demos don’t tell you about running AI agents in production: the economics are completely different from development.
In development, you’re typically running agents under a subscription model or with generous API credits. Context windows are big. Turn counts are unlimited. Token budgets are “whatever it takes to get the right answer.” The feedback loop is you, sitting at a terminal, watching the agent work.
In production, everything is constrained:
-
Token budget per interaction: We enforce a hard cap of ~$0.50 per agent interaction. This isn’t arbitrary — it’s the ceiling that makes our unit economics work. An agent that costs $5 per cycle to read your LinkedIn analytics and draft a post is a product that loses money on every customer.
-
Turn count limits: Each sub-agent gets a finite number of tool calls. This prevents runaway loops and keeps costs predictable. If an agent can’t finish its task in N turns, it summarizes what it has and stops gracefully.
-
Context window pressure: Every byte of tool output sits in the agent’s context window. Large outputs don’t just cost more in tokens — they crowd out the agent’s ability to reason about what it’s read.
These constraints aren’t bugs. They’re design decisions. A budget-constrained agent is a productizable agent. The question is whether your tool layer respects those constraints.
What Broke
Our third-party MCP tools returned responses as raw JSON objects. Comprehensive, well-structured, complete JSON objects. The kind of JSON you’d want if you were building a dashboard or a data pipeline.
Here’s a representative example of what a social media analytics response looked like (simplified and anonymized):
{
"data": {
"account": {
"id": "acc_12345",
"platform": "linkedin",
"metadata": {
"created_at": "2024-01-15T00:00:00Z",
"updated_at": "2026-04-09T14:32:00Z",
"permissions": ["read_posts", "read_analytics", "write_posts"],
"rate_limits": { "remaining": 487, "reset_at": "2026-04-09T15:00:00Z" }
}
},
"posts": [
{
"id": "post_abc",
"type": "ARTICLE",
"created_at": "2026-04-08T09:00:00Z",
"content": { "text": "...", "media": [...], "hashtags": [...] },
"metrics": {
"impressions": 1247,
"clicks": 89,
"reactions": { "like": 34, "celebrate": 12, "insightful": 8 },
"comments": 7,
"shares": 3,
"engagement_rate": 0.0432
},
"audience": {
"demographics": { ... },
"top_companies": [ ... ],
"top_titles": [ ... ]
}
}
]
},
"pagination": { ... },
"rate_limit_info": { ... }
}
Multiply this by 20 posts, with full audience demographics per post, and you’re looking at 30-50KB of JSON per tool call.
Now here’s the same information, the way our hand-rolled MCP returns it:
## LinkedIn Performance (Last 7 Days)
Top posts by engagement:
1. "Why onboarding is make-or-break for AI products" — 1,247 impressions, 4.3% engagement, 7 comments
2. "The hidden cost of chat-based marketing tools" — 892 impressions, 3.1% engagement, 4 comments
3. "Our SENSE→THINK→ACT loop explained" — 634 impressions, 2.8% engagement, 2 comments
Trends: Article posts outperform short updates 3:1. Tuesday/Wednesday posts get 2x engagement. Comments come from CTOs and founders (your target persona).
Same information. Maybe 400 bytes instead of 40KB. 100x cheaper in tokens. And critically, the markdown version is pre-analyzed — the agent doesn’t need to spend additional turns parsing, filtering, and summarizing the raw data. The MCP server already did that work.
The Death Spiral
Here’s what happened when our agents ran with the large JSON responses:
Turn 1: Agent calls the analytics tool. Gets 40KB of JSON. Context window now 60% full (including system prompt + conversation history).
Turn 2: Agent tries to parse and summarize the JSON. Burns a full turn just understanding the data. Context window now 75% full.
Turn 3: Agent calls the content generation tool. Gets another 20KB of response with template options, formatting metadata, content policy details.
Turn 4: Agent has almost no context window left for actual reasoning. It attempts to draft content but the output is truncated or incoherent.
Turn 5: Budget exhausted. Agent dies mid-task.
Or worse — the agent hits the turn count limit at Turn 3 and never gets to the actual content generation step. It produces a partial summary: “I retrieved your LinkedIn analytics and found some interesting patterns. Your top post was…” and that’s it. Task incomplete.
This happened in production, 48 hours before our first live demos.
The Workarounds That Failed
When you’re an optimist and an engineer, your first instinct is to fix the integration, not replace it. We tried everything:
Proxy server: Put a thin server between the agent and the MCP that would intercept responses and compress them. Problem: the MCP protocol doesn’t make this trivial, and we were adding latency and another failure point. Abandoned after 4-5 hours.
Client-side preprocessing: Write a wrapper that calls the MCP tools and preprocesses the output before handing it to the agent. Problem: the preprocessing itself consumed agent turns. We were spending budget on reformatting data instead of doing the task. Net effect: fewer turns available, not more.
CLI bridge: Build a command-line tool that would call the APIs directly, format the output as markdown, and expose it as a local MCP. This was closest to working, but we were essentially rebuilding the entire integration from scratch with extra steps. At which point — why not just build the MCP directly?
Each workaround failed for the same fundamental reason: you can’t efficiently fix output shape after the fact. The cost is incurred the moment the data enters the agent’s context. Compressing it downstream is throwing good money after bad.
The Fix: Purpose-Built MCPs
We went back to what we’d been doing before we adopted the third-party solution: hand-rolling MCP servers for each integration.
Here’s what “hand-rolled” means in practice:
Limited tool surface. Our LinkedIn MCP doesn’t expose 47 tools for every possible LinkedIn API operation. It exposes 4: get_performance_summary, get_post_details, search_content, and schedule_post. Each tool does exactly one thing the agent needs.
Markdown output. Every tool returns pre-formatted text or markdown, not raw API responses. The MCP server does the parsing, filtering, aggregation, and summarization. The agent receives a document it can immediately reason about.
Budget-aware design. Each tool is designed to return responses under 2KB. If the underlying API returns more data than fits in 2KB of markdown, the MCP server summarizes it and offers a get_details follow-up tool for the agent to drill into specific items.
Task-specific shaping. Our email analytics MCP doesn’t return the same shape as our social analytics MCP, even though both query similar types of data. Each is shaped for what the consuming agent needs. The SENSE agent gets trend summaries. The ACT agent gets specific content recommendations. Same data source, different output shapes, different costs.
The result: our agents now complete full SENSE → THINK → ACT cycles well within the $0.50 budget cap. Most interactions come in under $0.20. The primary gate is now the cost budget per agent — not the turn count, which we’ve been able to relax since the output sizes dropped.
The Principle: Output Shape Is Cost
If you’re building multi-agent systems with MCP, here’s the principle we wish we’d internalized earlier:
The shape of your tool outputs is the single biggest cost lever in your agent architecture.
Not the model you choose. Not the prompt length. Not the number of tools. The output shape — how much data each tool call puts into the agent’s context, and how much work the agent has to do to extract value from that data.
This creates a clear hierarchy for MCP design:
| Output Type | Tokens (approx.) | Agent Work Needed | Effective Cost |
|---|---|---|---|
| Pre-analyzed markdown summary | 100-500 | None — ready to use | Lowest |
| Filtered text with key metrics | 500-2,000 | Light parsing | Low |
| Structured JSON (relevant fields only) | 2,000-10,000 | Moderate parsing + summarization | Medium |
| Raw API JSON (full response) | 10,000-100,000 | Heavy parsing + filtering + summarization | Highest |
The bottom of this table is where generic, third-party MCP servers live. They return raw API responses because they have to serve every possible use case. The top is where purpose-built MCP servers live. They return exactly what your agent needs because they’re designed for exactly your agent.
When to Use Generic MCP Servers (And When Not To)
Generic third-party MCP servers are not bad tools. They’re solving a real problem: the cold-start integration problem. When you’re prototyping, exploring an API, or building a one-off automation, the ability to connect to 500+ services instantly is genuinely valuable.
Use them for:
- Prototyping and exploration — when you’re figuring out what data you need from a service
- One-shot automations — where cost per run doesn’t matter because you’re running it once
- Development and testing — under subscription models where token cost is amortized
Don’t use them for:
- Production agent loops — where every token counts against a per-interaction budget
- Multi-agent systems — where output from one tool becomes context for multiple downstream agents
- High-frequency operations — where the same tool is called hundreds of times per day
The dividing line is simple: if you’re counting tokens, roll your own.
The Auth Question
“But what about authentication? That’s the hard part.”
It’s not. It’s long — each OAuth flow takes time to implement — but it’s well-documented for every major platform. We had our team member handle the two most complex ones (Instagram and Facebook OAuth) in a couple of focused sessions. Everything else (LinkedIn, analytics, email, CMS) uses straightforward API key authentication.
If you’re worried about auth complexity being a reason to depend on a third-party integration layer, measure the actual time. Compare that to the ongoing cost of bloated tool outputs multiplied by every agent interaction, every day, forever. The math is clear.
What We Ship Now
Our current MCP architecture follows three rules:
-
One MCP server per domain. LinkedIn, email, analytics, CMS — each gets its own server with its own output contract. No mega-server that does everything.
-
Markdown or plain text output only. If an agent receives JSON from a tool, something is wrong. The MCP server’s job is to translate API responses into documents the agent can reason about immediately.
-
2KB output ceiling. Any tool that would naturally return more than 2KB of markdown offers a summary + drill-down pattern instead. The agent gets the overview first and can request details if the task requires it.
These rules keep our agents fast, cheap, and predictable. An agent running a daily brand monitoring cycle across five channels completes in under 90 seconds and costs less than $0.20 in API calls.
For Builders: The Checklist
If you’re building multi-agent systems with MCP, audit your tool outputs:
- Measure your tool output sizes. Log the token count of every tool response for a week. You’ll be surprised.
- Identify the top 3 costliest tools. These are your optimization targets.
- Ask: does the agent use all of this data? If the agent is summarizing a 40KB JSON response down to 3 bullet points, the MCP server should be returning those 3 bullet points.
- Set a per-tool output budget. We use 2KB. Yours might be different. But have a number.
- Test under production constraints. Run your agents with real budget caps and turn limits, not unlimited dev credentials. The behavior difference will shock you.
- Separate prototyping from production. Use generic tools to explore. Ship with purpose-built tools.
A Note on Ecosystem Maturity
The MCP ecosystem is young. Third-party providers are building for the broadest possible audience, which means comprehensive JSON responses that serve dashboards, data pipelines, and AI agents alike. That’s the right call for the ecosystem’s growth phase.
But as multi-agent systems move into production, we’ll see the same pattern that played out in microservices: generic, do-everything connectors give way to purpose-built, domain-specific services optimized for their consumers. The teams that figure this out early — that treat tool output shape as a first-class architectural concern — will build agents that are cheaper, faster, and more reliable than those that don’t.
The principle is simple: every byte that enters your agent’s context window is a cost. Design accordingly.
This post is part of our engineering series on building production AI agent systems. Luminary Lane is an AI-powered brand automation platform that runs autonomous marketing for growing businesses. We write about what we learn building it.