Parallel API Pricing: What Concurrent Calls Actually Cost

Matthew Diakonov·April 10, 2026·12 min read

parallel-api pricing api-costs concurrency ai-agent llm optimization

Parallel API Pricing: What Concurrent Calls Actually Cost

Running one API call at a time is simple to budget for. Running five, ten, or fifty calls in parallel changes the math entirely. Not because providers charge more per token for concurrent requests (they don't), but because parallelism unlocks consumption patterns that make costs spike in ways sequential usage never does.

We run parallel Claude Code agents daily to build Fazm, a macOS desktop AI agent. After tracking over 100M tokens across parallel sessions, here is what we learned about the real cost structure.

How Parallel API Pricing Actually Works

Most LLM providers price on a per-token basis regardless of concurrency. You pay the same rate for a token whether it arrives in your only request or your fiftieth simultaneous one. The pricing model is straightforward:

| Provider | Input (per 1M tokens) | Output (per 1M tokens) | Concurrency surcharge | |---|---|---|---| | Anthropic Claude Sonnet 4.6 | $3.00 | $15.00 | None | | Anthropic Claude Opus 4.6 | $15.00 | $75.00 | None | | OpenAI GPT-4.1 | $2.00 | $8.00 | None | | OpenAI o3 | $10.00 | $40.00 | None | | Google Gemini 2.5 Pro | $1.25 | $10.00 | None | | DeepSeek V3 | $0.27 | $1.10 | None |

No provider charges a concurrency premium at the token level. The pricing page looks the same whether you send 1 request per minute or 100. So where does the cost multiplication come from?

The Hidden Cost Multiplier: Duplicated Context

The real expense in parallel API usage is not the parallelism itself. It is the duplicated context that each parallel session carries independently.

When you run five agents in parallel, each one loads its own context window. That means five copies of your system prompt, five copies of the codebase context, five copies of the conversation history. If each session loads 80K tokens of context, that is 400K input tokens just to get started, not 80K.

This is not a hypothetical problem. When we tracked 100M tokens across parallel Claude Code sessions, 99.4% were input tokens. The model reads far more than it writes, and parallelism multiplies the reading.

Provider Rate Limits: The Real Constraint on Parallel Usage

While tokens cost the same regardless of concurrency, every provider enforces rate limits that cap how much parallelism you can actually use:

| Provider | Requests/min (default) | Tokens/min (default) | How to increase | |---|---|---|---| | Anthropic | 50 | 40,000 | Usage tier auto-upgrade, or request increase | | OpenAI | 500 | 30,000 | Tier 1-5 auto-upgrade based on spend | | Google AI | 360 | 4,000,000 | Pay-as-you-go tiers | | DeepSeek | 60 | N/A | Contact sales |

At the default Anthropic tier, 50 requests per minute with 80K context means you can sustain about 5 parallel agents before you hit the token-per-minute ceiling. Going beyond that requires spending enough to unlock higher tiers, which happens automatically as your cumulative spend increases.

Warning

Rate limit errors (HTTP 429) in parallel workflows are expensive even when retried. Each retry re-sends the full context, so a single 429 on an 80K-token request wastes $0.24 in input tokens when it eventually succeeds. Build backoff into your orchestration layer, not your prompts.

Batch APIs: The Parallel Pricing Discount

If your workload can tolerate latency (minutes to hours instead of seconds), batch APIs offer the most cost-effective way to run parallel operations:

| Provider | Batch discount | Turnaround time | Use case | |---|---|---|---| | Anthropic Message Batches | 50% off | Up to 24 hours | Evaluation runs, bulk classification | | OpenAI Batch API | 50% off | Up to 24 hours | Data processing, content generation | | Google Batch Prediction | ~40% off | Varies | Large-scale inference |

With Anthropic's Message Batches API, you submit up to 10,000 requests in a single batch. Each request runs at half price. For workloads like running test suites, evaluating prompt variants, or processing document backlogs, this is the most practical way to get parallel execution at reduced cost.

# Anthropic Message Batches API example
import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"task-{i}",
            "params": {
                "model": "claude-sonnet-4-6-20250514",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": prompt}
                ]
            }
        }
        for i, prompt in enumerate(my_prompts)
    ]
)
# Results arrive within 24 hours at 50% token cost
print(f"Batch {batch.id} submitted with {len(my_prompts)} requests")

Prompt Caching: Cutting Parallel Costs by 90%

The single most effective optimization for parallel API pricing is prompt caching. When multiple parallel sessions share a common system prompt or context prefix, cached tokens cost a fraction of full-price tokens:

| Provider | Cache write cost | Cache read cost | Savings on cache hit | |---|---|---|---| | Anthropic | 1.25x base input | 0.1x base input | 90% off cached portion | | OpenAI | 1x base input | 0.5x base input | 50% off cached portion | | Google | 1x base input | 0.25x base input | 75% off cached portion |

For parallel agents that share a large system prompt (say 60K tokens of codebase context), only the first request pays full price. Subsequent parallel requests hit the cache and pay 10% on Anthropic. That turns a 5x cost multiplier into roughly a 1.4x multiplier.

# Cost comparison: 5 parallel agents, 80K context each
# Without caching: 5 x 80K x $3/MTok = $1.20 per round
# With caching (60K cached, 20K unique per agent):
#   First agent: 60K x $3.75/MTok + 20K x $3/MTok = $0.285
#   Agents 2-5:  60K x $0.30/MTok + 20K x $3/MTok = $0.312 (4 agents)
#   Total: $0.285 + $0.312 = $0.597 per round (50% savings)

Model Routing: Not All Parallel Tasks Need the Same Model

Another pattern that reduces parallel API costs is routing different parallel tasks to different models based on complexity. A classification subtask does not need Opus pricing; Haiku handles it fine at 1/60th the cost.

In our setup, roughly 70% of parallel subtasks (file reads, search, classification) run on Haiku, 25% on Sonnet (code generation, review), and only 5% on Opus (architectural decisions). This cuts the blended cost per parallel operation by about 60% compared to running everything on Sonnet.

Common Pitfalls

Ignoring input token dominance. Developers focus on output pricing, but in parallel workloads, input tokens account for 95%+ of total cost. Optimizing output token usage barely moves the needle; reducing context size and enabling caching is what matters.
Not accounting for retries. A parallel batch of 20 requests where 3 hit rate limits does not cost 20 requests. It costs 23, because each retry re-sends the full context. With 80K contexts, three retries add $0.72 in wasted input tokens on Sonnet.
Running all tasks at the same priority. Most providers offer different rate limit pools for batch vs. interactive requests. If half your parallel tasks are not time-sensitive, moving them to batch processing saves 50% and frees up rate limit headroom for the interactive ones.
Forgetting cache TTLs. Anthropic's prompt cache has a 5-minute TTL (extended to 1 hour with a cache breakpoint). If your parallel agents do not all start within that window, later agents pay full price instead of cache price. Stagger launches within the TTL, not outside it.

Cost Optimization Checklist

Here is a practical checklist for managing parallel API costs:

Enable prompt caching on shared context (system prompts, codebase files)

Route simple subtasks to cheaper models (Haiku for classification, Sonnet for generation)

Use batch APIs for non-urgent parallel work (50% savings on Anthropic and OpenAI)

Minimize per-session context by scoping CLAUDE.md and file reads to what each agent needs

Build retry budgets into cost estimates (add 10-15% for rate limit retries)

Monitor token usage per agent, not just total spend, to find the expensive outliers

Stagger parallel agent launches within cache TTL windows (5 min for Anthropic)

Real-World Monthly Cost Comparison

Here is what parallel API usage looks like at different scales, using Claude Sonnet 4.6 pricing ($3/$15 per MTok):

| Scenario | Agents | Calls/day | Context size | Monthly input cost | With caching | |---|---|---|---|---|---| | Solo developer | 1 | 50 | 40K tokens | ~$180 | ~$45 | | Small team | 3 | 150 | 60K tokens | ~$810 | ~$200 | | Heavy parallel | 5 | 300 | 80K tokens | ~$2,160 | ~$540 | | CI/CD pipeline | 10 | 500 | 100K tokens | ~$4,500 | ~$900 |

The "with caching" column assumes 75% of context is cacheable across sessions. Real savings depend on your cache hit rate, which depends on how much shared context your parallel agents use.

Wrapping Up

Parallel API pricing is deceptively simple at the token level: no provider charges extra for concurrency. The real cost comes from duplicated context across parallel sessions, which makes input token optimization the single highest-leverage area. Enable prompt caching, route tasks to appropriately-sized models, and use batch APIs for anything that can wait. Those three changes alone can cut parallel API costs by 60-80%.

Fazm is an open source macOS AI agent. Open source on GitHub.

Parallel API Pricing: What Concurrent Calls Actually Cost

Parallel API Pricing: What Concurrent Calls Actually Cost

How Parallel API Pricing Actually Works

The Hidden Cost Multiplier: Duplicated Context

Provider Rate Limits: The Real Constraint on Parallel Usage

Batch APIs: The Parallel Pricing Discount

Prompt Caching: Cutting Parallel Costs by 90%

Model Routing: Not All Parallel Tasks Need the Same Model

Common Pitfalls

Cost Optimization Checklist

Real-World Monthly Cost Comparison

Wrapping Up

Related Posts

AI Pricing Is Unsustainable - API Costs Are Rising with Agent Usage

Anthropic Claude Regional Pricing Differences - What You Actually Pay by Country

AI Agents That Optimize Themselves Instead of Doing the Actual Task