Agent Frameworks Are Infrastructure Now

Outcome focus: Reframed agent-framework selection away from tier lists and toward an operating contract for which primitives the framework owns, which ones the team must own, and where bespoke orchestration is still justified.

The agent framework tier list is the wrong artifact.

It feels useful because the market is noisy. LangGraph, CrewAI, OpenAI Agents SDK, Claude Agent SDK, Google ADK, Semantic Kernel, Pydantic AI, LlamaIndex, Mastra, Agno, AutoGen descendants, hosted agent platforms, MCP servers, observability stacks, eval harnesses, and workflow engines are all competing for the same architecture diagram.

But ranking them hides the decision that matters.

The question is not "which framework is best?" The question is which infrastructure primitives you are willing to outsource, which ones you still need to own, and which failure mode you can tolerate when the agent moves from demo to production.

I learned this the uncomfortable way on an internal orchestration prototype. The first version had a hand-rolled loop, bespoke tool adapters, a JSON file for state, and logs that looked fine until a tool retried after a partial write. It was not a research problem. It was missing infrastructure: tool contracts, durable state, approval gates, trace structure, recovery semantics, and a way to replay failures.

By 2026, the ecosystem has shifted far enough that rebuilding all of those pieces from scratch is usually a bad default.

The Primitive Map#

Use this map before picking a framework:

Primitive	What it owns	Questions to ask
Agent loop	model turns, tool calls, stop conditions	Can I interrupt, resume, inspect, and bound the loop?
Tool contract	schemas, adapters, tool discovery	Does it support MCP, function tools, auth, and tool budgets?
Control flow	graph, workflow, routing, retries	Is state explicit or buried in chat history?
Persistence	checkpoints, sessions, memory, artifacts	What survives process death or provider timeout?
Delegation	handoffs, subagents, crews, managers	Can I trace child work back to the parent decision?
Human gates	approvals, edits, review, escalation	Can a person inspect state before the action fires?
Observability	traces, spans, costs, token usage	Can I debug one run and monitor production drift?
Evaluation	datasets, regressions, judges, reports	Can I block release when behavior regresses?
Deployment	runtime, permissions, secrets, scaling	Can I operate this like software, not a notebook?

If the framework does not own a primitive, your application owns it.

That is not automatically bad. It is only bad when the team assumes the framework covers a boundary that it actually leaves to application code.

The Framework Philosophies#

The useful taxonomy is by philosophy, not brand.

Philosophy	Examples	Best fit	Risk
Graph-centric orchestration	LangGraph, Google ADK graph workflows, Pydantic Graph	long-running, stateful workflows with explicit control flow	graph complexity becomes its own product
Role and team orchestration	CrewAI, Claude subagents, manager-style SDK patterns	work that decomposes into specialist roles	role theater without hard output contracts
Provider-native runtime	OpenAI Agents SDK, Claude Agent SDK	deep model/tool integration, fast adoption, hosted features	coupling to one provider's agent model
Enterprise middleware	Semantic Kernel, Google ADK	integrating agents with existing app platforms and enterprise APIs	middleware abstraction hides agent-specific state
Type-safe Python agent layer	Pydantic AI	typed outputs, dependency injection, evals, Python service teams	typed happy path can still miss runtime policy
Retrieval-native agent layer	LlamaIndex	agents over documents, indexes, retrieval tools, data systems	retrieval framework becomes application framework by accident
Product/app framework	Mastra, Agno	TypeScript product surfaces or own-agent-platform use cases	platform ambition before workload clarity

This is why "use LangGraph" and "use CrewAI" are incomplete answers.

LangGraph's own docs describe it as low-level orchestration for long-running, stateful agents, with durable execution, streaming, human-in-the-loop, and persistence. That is a very different promise from CrewAI's positioning around collaborative agents, crews, flows, guardrails, memory, knowledge, and observability. OpenAI's Agents SDK emphasizes a small set of primitives: agents, handoffs, guardrails, function tools, MCP integration, sessions, and tracing. Claude's Agent SDK exposes Claude Code as a library with built-in file, shell, code-search, web, hooks, subagents, MCP, permissions, sessions, checkpointing, and OpenTelemetry support.

Those are not interchangeable packages with different logos.

They are different answers to where orchestration should live.

MCP Is a Connector Standard, Not a Safety Boundary#

Model Context Protocol matters because it standardizes how AI applications connect to external tools, data sources, and workflows. The "USB-C port for AI applications" metaphor is useful as long as it does not get stretched into a security claim.

MCP reduces bespoke wiring. It does not decide whether the agent should be allowed to call delete_customer, whether the user identity should propagate to the database, how many tool calls a run can spend, which tool result is too stale to trust, or how a failed partial action should roll back.

That gap shows up in the research literature too. A 2026 paper on deploying agents with MCP argues that production systems need mechanisms beyond the protocol: identity-scoped routing, adaptive timeouts, structured error recovery, and observability. Another study of MCP tool descriptions found that tool descriptions themselves can mislead agents and that richer descriptions improve some outcomes while increasing step count and cost.

Observability Is Part of the Runtime#

Basic application monitoring is not enough once agents start choosing tools.

You need traces that preserve the causal chain: prompt, model, tool schema, selected tool, arguments, result, retries, cost, latency, approval, state transition, and final output. OpenTelemetry's semantic conventions now include generative AI and MCP-related areas. Agent observability products have converged on the same idea from different angles: LangSmith focuses on traces, evaluation, prompt engineering, and deployment across frameworks; Langfuse emphasizes request-level tracing with prompts, responses, token usage, latency, tools, retrieval steps, and cost; Phoenix combines tracing, evaluation, prompt iteration, datasets, and experiments on top of OpenTelemetry and OpenInference instrumentation.

The runtime decision and the observability decision are now coupled.

If your framework emits useful spans, preserves tool arguments, and supports replay or datasets, you can debug behavior. If it gives you only text logs, you will rebuild trace semantics later during the first incident.

Provider-Native SDKs Are Not Merely Lock-In#

Provider-native frameworks are easy to dismiss as lock-in. Sometimes that is fair. Sometimes the integration is the product.

OpenAI's Agents SDK includes the managed loop, function tools, MCP server tool calling, sessions, human-in-the-loop, tracing, guardrails, realtime agents, and sandbox agents. Claude's Agent SDK ships with code and workspace tools, hooks, subagents, permissions, MCP, sessions, checkpointing, and OpenTelemetry. Google ADK presents itself as an open-source framework for production agents across Python, TypeScript, Go, Java, and Kotlin, with graph workflows, multi-agent workflows, MCP, A2A, deployment, observability, evaluation, and safety sections.

If your agent mostly operates inside one provider's tool and model ecosystem, native SDKs can remove a lot of scaffolding. The tradeoff is portability and policy ownership. You still need to define what can be done, by whom, under which identity, with which audit record, and under which rollback rule.

The Build-Versus-Adopt Decision#

I would only build bespoke orchestration when one of these is true:

the domain has unusual state semantics that a general framework fights;
compliance requires a custom audit, identity, or approval model;
latency or cost constraints require a very narrow loop;
the agent is embedded in an existing workflow runtime that already owns durability;
the team is building an agent platform, not one agent.

Otherwise, adopt the nearest mature primitive and spend engineering effort on policy, evals, and product boundaries.

That does not mean surrender architecture to the framework. It means refusing to rebuild infrastructure that has become commodity.

A Selection Contract#

Before choosing a framework, write this down:

agent-framework-selection.yaml

workflow:
  name: "claims intake document analyst"
  user_visible_decision: "prepare claim summary for adjuster review"
  max_run_minutes: 15
  human_approval_required_for:
    - "customer-facing messages"
    - "claim status mutation"
    - "payment recommendation"
 
tooling:
  connector_standard: "MCP where available"
  required_tools:
    - "document_store.search"
    - "policy_db.lookup"
    - "claim_system.create_note"
  tool_auth_model: "user-scoped service token"
  tool_budget:
    max_calls: 40
    max_wall_seconds: 600
 
state:
  needs_checkpointing: true
  resumable_after_worker_restart: true
  state_owner: "workflow runtime"
  artifact_store: "object storage with run id"
 
orchestration:
  needs_graph_control_flow: true
  needs_multi_agent_roles: false
  needs_sandboxed_workspace: false
  needs_realtime_voice: false
 
observability:
  trace_required: true
  span_fields:
    - "model"
    - "prompt_version"
    - "tool_name"
    - "tool_args_hash"
    - "tool_latency_ms"
    - "approval_decision"
    - "cost_usd"
  eval_gate: "claims-golden-v3"
 
deployment:
  secrets_boundary: "runtime-managed"
  rollback: "disable agent route and fall back to manual review"

Once that contract exists, the framework choice becomes less mystical.

Need explicit graph control flow and checkpointed state? LangGraph or ADK moves up. Need role-based collaboration around business tasks? CrewAI might fit. Need coding/workspace agents with built-in file and shell tools? Claude Agent SDK or OpenAI sandbox agents deserve a look. Need typed Python outputs, dependency injection, and evals close to service code? Pydantic AI is attractive. Need retrieval-heavy workflows over indexes and query engines? LlamaIndex belongs in the conversation. Need TypeScript product embedding with app workflows? Mastra may be more natural than a Python-first stack. Need .NET and enterprise plugin integration? Semantic Kernel is not just another Python package.

The Failure I Would Design Against#

The production failure is not "the agent gave a weird answer." That is the easy failure.

The harder failure is an agent that:

calls the right tool under the wrong identity;
retries a mutation after a timeout and duplicates work;
loses state after a worker restart;
cannot explain why a subagent made a recommendation;
streams a tool result into a prompt without redaction;
has no dataset proving that a prompt change preserved the old behavior;
logs the final answer but not the tool path that produced it.

That is why agent frameworks are infrastructure now. The framework is not a wrapper around prompts. It is a decision about state, tools, permissions, traces, evaluation, and recovery.

Choose the smallest framework that owns the primitives you do not want to rebuild. Then be very explicit about the primitives it does not own.