The Real Cost of Running AI in Production on AWS
The GPU Bill Shock
Teams building AI products on AWS consistently underestimate costs by 3 to 5x. A single ml.g5.xlarge SageMaker endpoint (1 NVIDIA A10G GPU, 4 vCPUs, 16GB RAM) costs $1.41/hour, which is approximately $1,015/month running 24/7. Need two endpoints for redundancy across availability zones? That is $2,030. Add a training job on ml.g5.2xlarge ($2.40/hour) that runs 20 hours a week and you are at $2,838 before you have paid for data storage, transfer, or any other infrastructure. The numbers get significantly worse with larger models: an ml.p4d.24xlarge instance (8 NVIDIA A100 GPUs) for training large models costs $32.77/hour, which is $23,594/month. We have seen seed-stage startups burn through $15K/month on AI infrastructure that could be optimized to $4K without sacrificing inference quality or training throughput. The key is understanding where every dollar goes and making deliberate tradeoffs.
Where the Money Actually Goes
We have analyzed the cost breakdown across 30+ production AI workloads on AWS, and the pattern is remarkably consistent. Compute accounts for 50 to 60% of the bill, split between GPU inference endpoints (the largest single line item), training jobs (bursty but expensive per hour), and CPU compute on Lambda or Fargate for orchestration, preprocessing, and API layers. Storage accounts for 15 to 20%: S3 for training datasets and model artifacts (typically 500GB to 5TB), EBS gp3 volumes for training instance scratch space, and vector databases (OpenSearch Serverless at $0.24 per OCU-hour or PostgreSQL with pgvector on RDS). Data transfer accounts for 10 to 15%, which is often surprisingly high due to cross-AZ traffic between inference endpoints and application servers ($0.01/GB per direction), S3 transfer to SageMaker training instances, and VPC endpoint traffic for Bedrock API calls. The remaining 10 to 15% covers CloudWatch logging, X-Ray tracing, Langfuse hosting, Secrets Manager, and supporting services. Most teams focus exclusively on optimizing compute and ignore data transfer, which is a mistake. We have seen data transfer costs exceed $1,200/month for a single AI service due to cross-AZ traffic patterns.
Inference Cost Optimization
The biggest savings come from right-sizing inference endpoints. We see teams running ml.g5.2xlarge instances ($1.73/hour, $1,245/month) for models that fit comfortably on ml.g5.xlarge ($1.41/hour, $1,015/month) or even ml.g5.large ($0.95/hour, $684/month). The first step is profiling: run your model on the smallest GPU instance, measure inference latency at target throughput, and scale up only if P99 latency exceeds your SLA. Auto-scaling inference endpoints based on the InvocationsPerInstance CloudWatch metric instead of keeping peak capacity 24/7 typically cuts inference costs by 30 to 40%. Set the target value to 70% of your measured max throughput with a scale-down cooldown of 300 seconds. For SageMaker, multi-model endpoints (MME) let you serve multiple models from a single GPU instance by loading model artifacts from S3 on demand. This works well when individual models have low utilization, with the tradeoff of a cold-load latency of 5 to 30 seconds for the first request to a model not currently in GPU memory. For Bedrock, the optimization is about prompt engineering: shorter system prompts, concise few-shot examples, and response length limits. Reducing average prompt length from 2,000 to 1,200 tokens cuts your Bedrock input costs by 40%.
Training Cost Traps
Training jobs are bursty by nature, which makes them ideal for spot instances. SageMaker Managed Spot Training can save up to 70% on training costs with automatic checkpointing to S3 so you do not lose progress if a spot instance is reclaimed. Configure your estimator with use_spot_instances=True and max_wait=7200 (seconds) to allow up to 2 hours of wait time for spot capacity. In practice, we see spot interruption rates under 10% for ml.g5 instances and under 5% for ml.g4dn instances. The other common trap is keeping training instances running between jobs. SageMaker notebook instances (which run on dedicated EC2 instances) are the worst offender. An ml.p3.2xlarge notebook instance costs $3.825/hour, which is $91.80/day for an idle GPU. We have found clients with 5 to 10 notebook instances left running over weekends, wasting $500+ every Saturday and Sunday. The fix is SageMaker Studio with lifecycle configuration scripts that auto-stop idle kernels after 60 minutes, combined with an AWS Lambda function triggered by a nightly EventBridge rule that stops any notebook instance idle for more than 2 hours.
A Framework for AI Cost Planning
Before building, estimate costs across three scenarios: development (1 to 2 inference endpoints, low traffic, frequent experiments), launch (2 to 3 endpoints with auto-scaling, moderate traffic from initial users), and scale (multi-AZ endpoints, high traffic, weekly retraining). For each scenario, map out the following in a spreadsheet: number of inference endpoints with instance types and expected hours per month, expected request volume and average latency SLA, training frequency (weekly, monthly) with dataset size and expected training duration, storage requirements for models (typically 1 to 10GB per model), datasets (100GB to 5TB), and vector indices. We provide clients with an AWS AI cost model spreadsheet that uses current pricing from the AWS Pricing API to project monthly costs across these scenarios. The spreadsheet includes a 'what-if' tab where you can model the impact of optimizations: switching to spot for training, enabling auto-scaling, using smaller instance types, or migrating from SageMaker to Bedrock. This avoids the surprise of hitting scale and finding out your AI feature costs $20K/month instead of the $5K you budgeted. One client used this model to decide to start with Bedrock for their MVP (projected cost: $800/month) and plan a migration to SageMaker endpoints when they exceed 5 million requests per month (projected crossover at $3,200/month).