AI Observability: What to Monitor When LLMs Hit Production
Your APM Tool Is Blind to AI Failures
Datadog, New Relic, and Dynatrace will tell you if your LLM endpoint is returning 500 errors or if latency spikes. They will not tell you if the model started hallucinating, if response quality degraded after a prompt change, or if your RAG pipeline is retrieving irrelevant context. LLMs fail silently. They return HTTP 200 OK with a confident, factually wrong answer. Traditional monitoring treats that as a success because the request completed within the timeout and returned a valid JSON response. AI observability means monitoring the quality of the output, not just the health of the infrastructure. This requires a fundamentally different monitoring paradigm: one that evaluates semantic correctness, not just operational metrics.
The Four Pillars of LLM Monitoring
We monitor four dimensions for every production LLM system. First, quality metrics: response relevance scored on a 1-to-5 scale by an LLM judge, factual accuracy measured by faithfulness to retrieved context (does the answer only use information present in the source documents?), and format compliance (does the JSON output match the expected schema? does the response follow length constraints?). Second, safety metrics: toxicity scores via Bedrock Guardrails or a custom classifier, PII detection rates using Amazon Comprehend or Presidio, guardrail trigger rates (what percentage of requests hit a content filter?), and prompt injection attempt detection. Third, operational metrics: token usage broken down by input and output, latency by model and by feature (P50, P95, P99), cache hit rates for semantic caching (we use GPTCache or a custom Redis-based cache), error rates by error type (timeout, rate limit, content filter, model error), and cost per request. Fourth, drift metrics: how input distributions and output patterns change over time, measured by tracking average input token length, topic clustering via embeddings, output token length distribution, and refusal rates. A sudden shift in average token count or topic distribution often signals a problem before users complain.
The Stack We Use
Our typical LLM observability stack has four layers. Layer 1, trace-level logging: Langfuse (self-hosted on ECS Fargate with PostgreSQL on RDS) or LangSmith for every LLM call. Each trace captures the full prompt template, variable substitutions, retrieved context chunks with their similarity scores, the raw model response, token counts, latency, and cost. Langfuse's Python SDK integrates with LangChain, LlamaIndex, and raw Bedrock/OpenAI calls via decorators. Layer 2, automated evaluation: a scheduled ECS task runs every hour, sampling 5 to 10% of production traces. It runs three evaluators: an LLM-as-judge (Claude 3.5 Haiku scoring relevance and faithfulness on a 1-to-5 scale), a RAGAS-based evaluation for RAG-specific metrics (context precision, context recall, answer relevance), and custom regex/schema validators for structured output compliance. Layer 3, infrastructure metrics: Prometheus with Grafana dashboards tracking request rates, latency histograms, GPU utilization (for self-hosted models), and error rates. We use the OpenTelemetry SDK to instrument application code and export traces to both Langfuse (for LLM-specific tracing) and Grafana Tempo (for infrastructure tracing). Layer 4, alerting: PagerDuty alerts trigger on quality score drops below 3.5 average (evaluated over a 1-hour window), cost anomalies exceeding 150% of the 7-day rolling average, P99 latency exceeding 10 seconds, and error rates above 2%.
Cost Monitoring Is AI Observability
LLM costs can spike unexpectedly because they are directly proportional to token volume. A prompt change that increases average token count by 20% increases your Bedrock or OpenAI bill by 20%. A bug in your RAG pipeline that retrieves 15 chunks instead of 5 inflates context length and cost by 3x. A retry loop that resends the full conversation history doubles or triples token consumption. We set up per-feature and per-model cost tracking from day one using Langfuse's cost tracking feature. Every LLM call is tagged with the feature that triggered it (document_summarization, chat_support, code_review), the user tier (free, pro, enterprise), and the model used. This produces dashboards showing that 'document summarization' costs $1,200/month on Claude 3.5 Sonnet while 'chat support' costs $300/month on Claude 3.5 Haiku. We also track cost per user session, which helps product teams understand the unit economics of AI features. One client discovered their free-tier users were consuming 4x the tokens of paid users due to longer, less focused conversations, which informed their pricing strategy.
What We Have Caught with This Approach
Real examples from production deployments. Case 1: A RAG pipeline started retrieving outdated FAQ documents after a knowledge base re-indexing job changed the chunk boundaries. The RAGAS context relevance score dropped from 0.82 to 0.51 within 2 hours. Our hourly evaluation pipeline triggered a PagerDuty alert, and the team identified the root cause (a changed chunking parameter in the indexing config) within 30 minutes. Case 2: A prompt template change in a PR accidentally removed the instruction 'respond only in valid JSON,' causing parse failures in the downstream application. The format compliance evaluator caught it in staging before it reached production. Case 3: A token usage spike caused by a retry loop in the LangChain error handler. When the model returned an unparseable response, the retry logic resent the request with the full conversation history appended each time, quadrupling costs overnight. Our cost anomaly alert fired at 3am, and the on-call engineer disabled the feature flag within 15 minutes. Case 4: A gradual drift in user queries toward a topic not covered by the knowledge base, causing the model to hallucinate answers with high confidence. The faithfulness score trend line (tracked weekly) showed a steady decline over 3 weeks, prompting the team to expand the knowledge base before users reported issues. None of these would have been caught by traditional infrastructure monitoring.