Skip to main content

    Building AI Agents for Production: A Practical Guide with LangGraph, CrewAI, and Claude

    Agentic AICloudmess Team10 min readFebruary 1, 2026

    Why Most AI Agent Demos Fail in Production

    Building an AI agent demo that handles the happy path takes a day. Building an agent that handles the messy reality of production takes months. The gap comes from four categories of failure that demos never encounter. First, reliability: agents make non-deterministic decisions. The same input can produce different tool-calling sequences, leading to inconsistent outcomes. In a demo you retry until it works. In production, you need guarantees. Second, error handling: external tools fail, APIs time out, rate limits hit, and data is malformed. A demo agent crashes. A production agent needs graceful degradation, retries with exponential backoff, and fallback strategies. Third, cost control: an agent in a loop can burn through thousands of dollars in API calls in minutes if it enters a recursive reasoning cycle. Demos do not have budgets. Production agents need token budgets, iteration limits, and circuit breakers. Fourth, observability: when an agent gives a wrong answer or takes an unexpected action, you need to trace exactly what happened. Which tools did it call? What data did it receive? What reasoning led to the decision? Without this traceability, debugging agents is nearly impossible.

    Choosing an Agent Framework: LangGraph vs CrewAI vs Claude Tool Use

    The framework choice depends on your agent's complexity and your team's needs. For single-agent workflows with tool use (the most common pattern), Claude's native tool use API is the simplest option. You define tools as JSON schemas, send them in the API request, and Claude returns tool_use content blocks when it wants to invoke a tool. Your application executes the tool and sends the result back. This loop continues until Claude responds with text instead of a tool call. It requires no framework, just a loop and some JSON parsing. For complex multi-step workflows with branching logic, human-in-the-loop approvals, and persistent state, LangGraph (from LangChain) is our recommendation. LangGraph models agents as state machines with nodes (functions that transform state) and edges (conditional transitions between nodes). You define the graph explicitly, which means you control the execution flow. A typical LangGraph agent has nodes for reasoning (the LLM decides what to do), tool execution (run the selected tool), validation (check the tool output), and human review (pause for approval on high-stakes actions). The graph is compiled and runs with checkpointing, so you can resume from any state after a failure. For multi-agent systems where specialized agents collaborate (for example, a research agent, a coding agent, and a review agent working together), CrewAI provides a higher-level abstraction. You define agents with roles, goals, and available tools, then define tasks and assign them to agents. CrewAI handles delegation and communication between agents. It is faster to prototype but gives you less control over the execution flow than LangGraph.

    Production Patterns: State Management, Error Handling, and Guardrails

    Three patterns are essential for production agents. Pattern 1, persistent state with checkpointing: use LangGraph's SqliteSaver or PostgresSaver to checkpoint agent state after every step. If the agent crashes mid-execution (Lambda timeout, container restart, API error), it resumes from the last checkpoint instead of starting over. For our ECS-deployed agents, we use PostgresSaver with an RDS instance, giving us both durability and the ability to query agent execution history. Pattern 2, structured error handling: wrap every tool call in a try/except block that catches specific exceptions (timeout, rate limit, validation error, authentication failure) and returns structured error messages to the LLM. Do not let raw stack traces reach the model. Instead, return a clear message like 'The database query timed out after 30 seconds. The table may be under heavy load. Try a simpler query or wait and retry.' This gives the model enough context to adjust its strategy. Set a max_iterations limit (we use 15 for most agents) and a max_tokens_budget (100,000 tokens per execution) to prevent infinite loops. When either limit is reached, the agent returns its best partial result with an explanation of what it could not complete. Pattern 3, guardrails and approval gates: for agents that take actions (not just read data), implement an approval gate for high-risk operations. In LangGraph, this is a conditional edge that routes to a human_review node when the proposed action matches a risk criterion (for example, any database write, any external API call that modifies state, any action involving PII). The human_review node pauses execution, sends a notification to Slack with the proposed action and reasoning, and waits for approval before proceeding. We also implement output guardrails using Bedrock Guardrails or custom validation functions that check agent responses for PII leakage, off-topic content, and formatting compliance before returning them to the user.

    Observability and Evaluation for Agents

    Agent observability requires tracing at a higher granularity than typical LLM applications because a single agent execution may involve 5 to 20 LLM calls and 10 to 30 tool invocations. We use Langfuse as our primary tracing platform, with the following instrumentation strategy. Each agent execution gets a unique trace ID. Within the trace, each LLM call is logged as a generation span (capturing the full prompt, model response, token counts, and latency). Each tool invocation is logged as a span with the tool name, input arguments, output, and execution time. Conditional branching decisions are logged as events with the decision rationale. This produces a complete execution timeline that lets you replay any agent run step by step. For evaluation, we maintain a test suite of 50 to 100 scenarios per agent, each with a defined starting state, expected tool calls (in order), and expected final output. We run this suite on every code change using a GitHub Actions workflow that spins up a test environment, executes all scenarios, and compares results against baselines. Metrics we track include: task completion rate (did the agent achieve the goal?), tool call accuracy (did it call the right tools in a reasonable order?), cost per execution (total tokens consumed), and latency (end-to-end time from input to final response). We also track a 'divergence rate' metric: the percentage of executions where the agent's tool-calling sequence differs from the expected sequence by more than 2 steps. A rising divergence rate signals that a model update or prompt change is affecting agent behavior. For production monitoring, we alert on task completion rate dropping below 90%, average cost per execution exceeding 2x the 7-day average, and P99 latency exceeding 60 seconds.

    Deployment Architecture for Production Agents

    We deploy production agents on ECS Fargate with the following architecture. The agent runtime runs as a long-lived ECS service (not Lambda, because agent executions regularly exceed 15 minutes and require persistent state). Each agent execution is an async task: the API Gateway receives the request, a Lambda function validates the input and enqueues a message to SQS, and the ECS agent worker picks up the message, executes the agent, and stores the result in DynamoDB. The client polls a status endpoint or receives a webhook callback when execution completes. For the LLM backend, we use Bedrock with Claude 3.5 Sonnet as the primary model for reasoning and tool selection, and Claude 3.5 Haiku for lower-stakes subtasks like summarization and formatting. Tools are implemented as MCP servers (see our MCP blog post) deployed as separate ECS services, which allows independent scaling and deployment. The agent connects to MCP servers via SSE transport over the internal ALB. Infrastructure is defined in Terraform with separate modules for the agent service, MCP servers, the SQS queue, and the DynamoDB table. We use Terraform workspaces for staging and production environments. The CI/CD pipeline in GitHub Actions runs the evaluation test suite against staging, and promotes to production only if all metrics pass. For a recent client, we deployed an SRE agent that handles on-call triage. It processes PagerDuty alerts, queries Prometheus and Kubernetes, and either auto-remediates known issues or provides a detailed analysis to the on-call engineer. The agent runs 24/7 on a single Fargate task (0.5 vCPU, 1GB memory, costing $32/month in compute) and handles an average of 15 incidents per day with a 78% auto-remediation rate for known issue types. The remaining 22% receive a structured analysis that reduces mean time to resolution by 40% compared to the engineer starting from scratch.