Type something to search...

AWS Bedrock Cost Structure: What You're Actually Paying For

AWS Bedrock looks simple from the outside — call an API, get a response, pay per token. The reality is that a production Bedrock setup has several distinct cost layers, and they behave very differently from each other. Understanding the structure is the prerequisite to optimizing it.

How Bedrock Works (Conceptually)

Bedrock is a managed API layer over foundation models. You don’t provision servers or manage inference endpoints — you send requests, Bedrock routes them to the appropriate model, and you pay for what you use.

The key architectural pieces:

  • Foundation models — the models you call directly: Claude (Anthropic), Llama (Meta), Mistral, Titan and Nova (AWS-native), and others. Each has its own pricing.
  • Knowledge Bases — managed RAG (Retrieval-Augmented Generation). You connect a data source, Bedrock chunks and embeds it, stores vectors, and retrieves relevant chunks at query time.
  • Agents — orchestration layer for multi-step workflows. An agent reasons about a goal, calls tools or APIs, and chains multiple model invocations to complete a task.
  • Guardrails — content filtering layer that sits in front of model calls, evaluating inputs and outputs against defined policies.
  • Provisioned Throughput — reserved model capacity for consistent, high-volume workloads.

Each of these generates costs. Most teams only think about inference — but depending on your architecture, the others add up.

The Cost Layers

Model Inference

The primary cost. Billed per token — input and output priced separately — with rates varying significantly by model.

A rough illustration of the spread (check current AWS pricing for exact rates):

Model tierRelative cost
Lightweight (Haiku, Llama small)1x
Mid-tier (Sonnet, Llama large)5–10x
Frontier (Opus, large 70B+)20–40x

The practical implication: model choice is the biggest lever you have on inference cost. Most teams default to a frontier model for everything and leave significant savings on the table.

On-demand vs. Provisioned Throughput vs. Batch

  • On-demand: pay per token, no commitment, scales to zero. Right for variable or unpredictable workloads.
  • Provisioned Throughput: reserve model units by the hour. Cheaper per token at sustained volume, but you pay whether you use it or not. Only makes sense above a consistent throughput threshold.
  • Batch inference: async processing at a discount. If your use case tolerates latency — document processing, offline enrichment — batch cuts inference costs roughly in half.

Knowledge Base Costs

RAG architectures have two cost components:

  1. Vector storage — Bedrock Knowledge Bases use OpenSearch Serverless under the hood. You pay for the OCU (OpenSearch Compute Unit) allocation, not just storage. This is a fixed ongoing cost that doesn’t scale to zero.
  2. Embedding model calls — every document chunk gets embedded on ingestion, and every query gets embedded at retrieval time. These are model inference calls, billed per token.

For a small knowledge base with infrequent updates, embedding costs are negligible. The OpenSearch Serverless floor cost is not — it runs regardless of query volume.

Agents

Agents inflate your effective token count. Every agent invocation includes a system prompt with the agent’s instructions, available tools, and conversation history. That overhead compounds across multi-step reasoning chains.

A task that takes 3 model calls to complete might generate 5x the tokens of a single direct model call, because each intermediate step re-sends accumulated context. The cost profile of an agentic workflow is fundamentally different from a simple query-response pattern.

Guardrails

Priced per invocation — one charge for the input evaluation, one for the output. If Guardrails runs on every request in a high-volume system, the cost is material. Worth measuring, easy to overlook.

Custom Model Training

If you’re fine-tuning models through Bedrock, training jobs are billed per token processed during training. This is a one-time or periodic cost, not ongoing — but training runs on large datasets can be expensive. Separate from inference costs entirely.

CUR Line Items to Know

Bedrock costs split across two billing sources depending on the model:

Third-party models (Claude, Llama, Mistral) appear as named service line items — for example, Claude Sonnet 4.6 (Amazon Bedrock Edition) — and route through AWS Marketplace under the hood. In CUR, filter by product/ProductCode = marketplace. Usage types follow the pattern <region>-MP:<region>_InputTokenCount-Units and OutputTokenCount-Units.

AWS-native models and Bedrock services (Titan, Nova, Knowledge Bases, Agents, Guardrails) appear under product/ProductCode = AmazonBedrock. Usage types are more descriptive — for example, USE1-TitanEmbeddingV2-Text-input-tokens.

The important note: if you filter Cost Explorer to “Amazon Bedrock,” you will see the native service costs but miss all Claude and third-party model spend. They’re separate line items.

How to Think About Optimization

Work through the layers in order of impact:

1. Model right-sizing — The highest-leverage change. Identify every feature calling a model and ask whether it actually needs that tier. Summarization, classification, extraction, and simple Q&A often work well on lighter models. Test quality before assuming it doesn’t.

2. Input token reduction — Shorter system prompts, tighter few-shot examples, truncated context windows. Every token you don’t send is a token you don’t pay for. This compounds across high-volume features.

3. Prompt caching — For requests with a large, repeated context (a long system prompt, a reference document), prompt caching avoids re-processing the static portion on every call. Supported on Claude models; the savings are significant for the right use case.

4. Batch inference — If any workload is async by nature — nightly enrichment, document indexing, bulk classification — move it to batch. The discount is roughly 50% with no quality trade-off.

5. Provisioned Throughput — Only evaluate this after you’ve right-sized your models and workloads. Reserved capacity only pays off at consistent, predictable volume. Most teams look at this too early.

6. Knowledge Base architecture — Tune chunk size and overlap to reduce the number of chunks retrieved per query. Fewer chunks = fewer tokens passed to the model = lower inference cost per query.

Conclusion

Bedrock’s cost structure rewards teams who look at each layer independently. Model right-sizing is the highest-leverage starting point — most teams can cut inference spend by 30–50% before touching anything else. From there, token reduction, prompt caching, and batch inference each compound on top. Working through the layers in order turns an opaque bill into a set of concrete, addressable decisions.


The cost structure isn’t complicated, but it requires looking at each layer separately. Book a call if you want to walk through where your Bedrock spend is actually going.

Related Posts

Connect Claude Code to Live AWS Tools with the Agent Toolkit

AI coding agents are getting remarkably capable — but they have a blind spot. The models powering them were trained on data that's months or years old. When you ask your agent about Amazon S3 Tables,

read more

Why Your AWS Bedrock Bill Makes No Sense (And How to Fix It)

When a startup says "our AWS bill is too high," the conversation almost always starts at the aggregate level — total monthly spend, a few large services, maybe a spike someone noticed. That's not wher

read more

AWS Bedrock vs SageMaker: How to Pick the Right One

If you're building an AI product on AWS, you'll hit this question early: Bedrock or SageMaker? The short answer is that they solve different problems, and most startups only need one. What Each Se

read more

Deploying Engineering Resource Management Knowledge Graph on AWS

Resource planning in engineering orgs is a multi-hop problem. The data is there — skills, project history, availability — it's just stored in flat tables that you need to join on demand. This post wal

read more