AI Observability: What to Monitor When LLMs Hit Production

AI/MLOpsCloudmess Team8 min readApril 21, 2025

Your APM Tool Is Blind to AI Failures

Datadog and New Relic will tell you if your LLM endpoint is returning 500 errors or if latency spikes. They won't tell you if the model started hallucinating, if response quality degraded after a prompt change, or if your RAG pipeline is retrieving irrelevant context. LLMs fail silently. They return 200 OK with a confident, wrong answer. Traditional monitoring treats that as a success. AI observability means monitoring the quality of the output, not just the health of the infrastructure.

The Four Pillars of LLM Monitoring

We monitor four dimensions for every production LLM system. First, quality metrics: response relevance, factual accuracy (grounded in retrieved context), and format compliance. Second, safety metrics: toxicity scores, PII detection, and guardrail trigger rates. Third, operational metrics: token usage, latency by model, cache hit rates, and cost per request. Fourth, drift metrics: how input distributions and output patterns change over time. A sudden shift in average token count or topic distribution often signals a problem before users complain.

The Stack We Use

Our typical LLM observability stack: LangSmith or Langfuse for trace-level logging of every LLM call including prompts, completions, retrieved context, and latency. Custom evaluation pipelines that run LLM-as-judge scoring on a sample of production outputs (we typically score 5-10% of traffic). CloudWatch or Prometheus for infrastructure metrics (GPU utilization, memory, request rates). PagerDuty alerts on quality score drops, cost anomalies, and latency regressions. The key insight is that trace-level logging is non-negotiable. When something goes wrong with an LLM, you need to see the exact prompt, the retrieved context, and the model's response to debug it. Aggregate metrics alone won't tell you why the model started giving bad answers.

Cost Monitoring Is AI Observability

LLM costs can spike unexpectedly. A prompt change that increases average token count by 20% increases your Bedrock or OpenAI bill by 20%. A bug in your RAG pipeline that retrieves too many documents inflates context length and cost. We set up per-feature and per-model cost tracking from day one. Every LLM call is tagged with the feature that triggered it, so you can see that 'document summarization' costs $1,200/month while 'chat support' costs $300/month. We alert on cost anomalies: if daily spend exceeds 150% of the 7-day average, someone gets paged.

What We've Caught with This Approach

Real examples from production: a RAG pipeline that started retrieving outdated FAQ documents after a knowledge base update, causing a 40% drop in answer relevance detected within 2 hours. A prompt template change that accidentally removed a formatting instruction, causing JSON parse failures in the downstream application. A token usage spike caused by a retry loop that was resending failed requests with the full conversation history appended each time, quadrupling costs overnight. None of these would have been caught by traditional infrastructure monitoring. All were caught by our AI observability stack within hours.

Back to Blog

AI/MLOps

Building RAG Pipelines That Actually Work in Production

Most RAG implementations fail because of bad retrieval, not bad models. Here's how we build retrieval-augmented generation pipelines that give accurate, grounded answers.