New Startups Building AI Agent Infrastructure in 2025 and 2026

Matthew Diakonov··12 min read

New Startups Building AI Agent Infrastructure in 2025 and 2026

The AI agent infrastructure space exploded between mid-2025 and early 2026. Dozens of startups launched to solve the same core problem: AI models can reason, but they cannot act. Bridging that gap requires infrastructure that connects language models to operating systems, browsers, APIs, and desktop applications. This post maps the new startups doing that work, organized by the layer of the stack they target.

The Three Infrastructure Layers

AI agent infrastructure splits into three distinct layers, and most startups pick one to own. Understanding which layer a startup targets tells you more about its architecture than any pitch deck.

AI Agent Infrastructure StackAPI LayerTool callingAuth, rate limitsSchema validationDesktop LayerScreen readingInput simulationApp controlLinux/OS LayerSandboxingProcess isolationResource controlOrchestration Layer (ties all three together)Agent frameworks, memory, planning, multi-agent coordination

The API layer handles how agents call external services. The desktop layer handles how agents see and interact with GUIs. The Linux/OS layer handles sandboxing, process isolation, and system-level resource management. A few startups try to span all three through orchestration frameworks.

API Layer Startups

The API layer is where most of the startup activity concentrated in 2025. The problem: LLMs need structured ways to call tools, authenticate with services, and handle responses.

| Startup | Founded | What it does | Key technical choice | |---|---|---|---| | Composio | 2024 | Managed tool integrations for agents (400+ app connectors) | Pre-built auth flows, handles OAuth per-user | | Arcade AI | 2024 | Tool-use infrastructure with authorization baked in | Per-tool auth scopes, not per-agent | | Toolhouse | 2024 | Tool execution cloud for LLM function calling | Optimized for low-latency tool dispatch | | Mintlify | 2023 | API documentation that generates tool schemas automatically | Parses OpenAPI specs into agent-ready formats | | Stainless | 2022 | SDK generation from API specs | Typed SDKs that agents can call without hallucinating params |

The common pattern: these startups sit between the LLM and the target API. They handle authentication, parameter validation, rate limiting, and response formatting so that agent developers do not have to build that plumbing for every integration.

Note

Anthropic's Model Context Protocol (MCP), released in late 2024, changed this layer significantly. MCP standardizes how agents discover and call tools, which means API-layer startups now compete on execution quality and connector breadth rather than protocol design.

What the API layer gets right

The best API-layer startups solved the authentication problem. Before 2025, every agent framework re-invented OAuth flows per integration. If your agent needed to read Gmail and write to Notion, you built two separate auth pipelines. Composio and Arcade AI collapsed that into a single SDK call.

# Example: Composio's approach to multi-service auth
from composio import ComposioToolSet

toolset = ComposioToolSet()
# One line gives the agent authenticated access to Gmail
tools = toolset.get_tools(actions=["GMAIL_SEND_EMAIL"])

The tradeoff is vendor lock-in. Your agent's capabilities become tied to the startup's connector library, and if they do not support a service you need, you are back to building it yourself.

Desktop Layer Startups

Desktop automation for AI agents is harder than API integration because GUIs were designed for humans, not programs. The startups in this layer are building the eyes and hands that let agents interact with desktop applications.

| Startup | OS Support | Approach | Latency per action | |---|---|---|---| | Anthropic (computer use) | Linux (primary), macOS, Windows | Screenshot + coordinate-based clicking | ~2-5s per action | | Twin Labs | macOS, Windows | Accessibility tree + vision hybrid | ~1-3s per action | | Induced AI | Windows, macOS | Browser and desktop RPA with LLM planning | ~1-2s per action | | Fazm | macOS | Native accessibility APIs + app-specific integrations | ~200ms per action | | Screenpi.pe | macOS, Linux | Continuous screen capture + OCR pipeline | Continuous (not action-based) |

The core architectural question for desktop agents: do you use screenshots (vision-based) or accessibility APIs (structured data)?

Vision-based (screenshots): works with any app, no integration needed, but slow (~2-5s latency) and fragile when UI changes
Accessibility API-based: structured element data, fast (~200ms), reliable element targeting, but only works on apps that expose accessibility trees
Hybrid: combines both approaches, using accessibility APIs when available and falling back to vision when not

Most startups that launched in 2025 started with pure vision. By early 2026, the ones that survived had added accessibility API support because vision-only agents are too slow for production workloads.

Linux-Specific Infrastructure

Linux agent infrastructure lags behind macOS and Windows, but a cluster of startups and open source projects emerged in late 2025 to close the gap.

The Linux challenge is fragmentation. On macOS, every app speaks the same accessibility protocol (AXUIElement). On Linux, you have AT-SPI on GNOME, a different accessibility story on KDE, X11 for legacy window management, and Wayland compositors that each handle input simulation differently.

| Component | X11 | Wayland | What agents need | |---|---|---|---| | Screenshot | xdotool/scrot | grim/wl-copy (compositor-specific) | Consistent frame capture | | Input simulation | xdotool | wtype/ydotool (requires permissions) | Click and type at coordinates | | Window enumeration | wmctrl/xprop | wlr-foreign-toplevel (if compositor supports it) | Find and focus target windows | | Accessibility tree | AT-SPI via D-Bus | AT-SPI via D-Bus (same) | Read UI element structure |

Warning

Wayland's security model intentionally blocks the input simulation patterns that agent infrastructure depends on. On X11, any process can send synthetic keyboard and mouse events to any window. On Wayland, you need compositor-specific protocols or elevated permissions. This is a security feature, not a bug, but it makes building Linux desktop agents significantly harder.

Sandboxed execution environments

One area where Linux infrastructure leads: sandboxed agent execution. Because Linux containers are a mature technology, several startups built agent sandboxes on top of them.

# E2B's approach: spin up a sandboxed VM per agent session
# Each agent gets its own filesystem, network, and process space
from e2b import Sandbox

sandbox = Sandbox()
result = sandbox.run_code("import os; print(os.listdir('/'))")
# Agent cannot escape the sandbox, cannot access host filesystem

E2B (founded 2023, raised Series A in 2025) provides cloud sandboxes specifically for AI agent code execution. Modal and Fly.io serve adjacent use cases but are not agent-specific. The key insight: agents that execute arbitrary code need isolation that is stronger than a Docker container but lighter than a full VM. E2B uses Firecracker microVMs for this.

Orchestration Frameworks

The orchestration layer ties API calls, desktop actions, and system operations into coherent agent workflows. This is where the most open source activity happened in 2025.

| Framework | Language | Key differentiator | GitHub stars (Apr 2026) | |---|---|---|---| | LangGraph (LangChain) | Python/JS | Graph-based agent workflows with persistence | ~15k | | CrewAI | Python | Multi-agent role-based collaboration | ~25k | | AutoGen (Microsoft) | Python | Multi-agent conversation patterns | ~40k | | Pydantic AI | Python | Type-safe agent framework with dependency injection | ~8k | | Claude Agent SDK | Python | Anthropic's official agent framework with tool use | ~3k | | Mastra | TypeScript | Agent framework with built-in memory and RAG | ~10k |

Most orchestration frameworks are open source, not venture-backed startups. The ones that did raise money (LangChain raised $25M+ in 2024, CrewAI raised in 2025) monetize through hosted platforms rather than the framework itself.

The build vs. buy decision

The practical question for anyone building agents in 2026: do you use a framework or build from scratch?

Build vs. Buy Decision TreeNeed multi-agent?YesNoUse CrewAI / AutoGen /LangGraphRaw SDK + tool calling(less overhead)

If your agent does one thing well (answer questions from docs, process invoices, write code), you probably do not need a framework. A direct API call to Claude or GPT with a few tools attached is simpler and faster. Frameworks add value when you need persistent state, multi-step planning, or coordination between multiple agents.

What Changed Between 2025 and 2026

The agent infrastructure landscape shifted in several concrete ways over the past year.

Late 2024 to mid-2025: The "wrapper" era. Most agent startups were thin wrappers around GPT-4 or Claude with a UI on top. Infrastructure was an afterthought. The running joke in YC S24 was that every other company was "ChatGPT but for X."

Mid-2025 to late 2025: Infrastructure startups emerged. The wrapper companies that survived realized they needed real infrastructure: auth management, tool execution, sandboxing, observability. Composio, E2B, and Arcade AI grew fastest during this period.

Late 2025 to early 2026: The protocol wars. Anthropic released MCP, OpenAI pushed its own tool-calling conventions, and Google launched Gemini's function calling format. Startups had to pick which protocols to support, and many chose to support all three, becoming translation layers.

Early 2026: Consolidation. Several API-layer startups merged or were acqui-hired. Desktop agent startups narrowed their OS focus. The survivors have real revenue and production deployments.

Common Pitfalls

  • Betting on one LLM provider. Startups that built exclusively for OpenAI's function calling format in 2025 scrambled when their customers wanted Claude or Gemini support. Build provider-agnostic from day one.

  • Ignoring latency. A desktop agent that takes 5 seconds per action is a demo. One that takes 200ms per action is a product. The gap between these two is entirely infrastructure: accessibility APIs vs. screenshots, local inference vs. cloud roundtrips, precomputed UI maps vs. real-time parsing.

  • Over-engineering multi-agent systems. Most production agent deployments in 2026 use a single agent with multiple tools, not a fleet of specialized agents talking to each other. Multi-agent architectures add coordination overhead that is rarely justified unless your problem genuinely requires parallel execution or conflicting perspectives.

  • Neglecting Linux. If your agent infrastructure only runs on macOS and Windows, you are cutting out the developer and DevOps audience that is most likely to adopt AI agents for automation. Linux support is table stakes for infrastructure-layer startups.

A Minimal Agent Infrastructure Stack

If you are building an AI agent today and want to use the best available startup infrastructure, here is what a practical stack looks like:

# 1. LLM provider (pick one, abstract the interface)
import anthropic
client = anthropic.Anthropic()

# 2. Tool execution (Composio for managed integrations)
from composio import ComposioToolSet
toolset = ComposioToolSet()

# 3. Sandboxed code execution (E2B for untrusted code)
from e2b import Sandbox

# 4. Desktop automation (platform-specific)
# macOS: Fazm or native accessibility APIs
# Linux: AT-SPI + ydotool
# Windows: UI Automation COM interface

# 5. Observability (pick one)
# LangSmith, Braintrust, or plain structured logging

The total cost for this stack at low volume: LLM API costs (variable, typically $0.01-0.10 per agent action) plus E2B sandbox time (~$0.001 per second) plus Composio's free tier for up to 1,000 tool calls per month. You can build a production agent for under $50/month in infrastructure costs at early-stage volumes.

Wrapping Up

The AI agent infrastructure space in 2025 and 2026 split into clear layers: API integration, desktop automation, Linux/OS sandboxing, and orchestration. The startups that survived the wrapper era are the ones solving real infrastructure problems like authentication, latency, and isolation. If you are building agents today, the infrastructure is finally mature enough that you do not have to build everything from scratch.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts