Cache (Simple & Semantic)

Simple caching available on all plans. Semantic caching on Production and Enterprise.

Cache LLM responses to serve requests up to 20x faster and cheaper.

Mode	How it Works	Best For	Supported Routes
Simple	Exact match on input	Repeated identical prompts	All models including image generation
Semantic	Matches semantically similar requests	Denoising variations in phrasing	`/chat/completions`, `/completions`

Enable Cache

Add cache to your config object:

{ "cache": { "mode": "simple" } }

Caching won’t work if x-portkey-debug: "false" header is included.

Simple Cache

Exact match on input prompts. If the same request comes again, Portkey returns the cached response.

Semantic Cache

Matches requests with similar meaning using cosine similarity. Learn more →

Semantic cache is a superset—it handles simple cache hits too.

Semantic cache works with requests under 8,191 tokens and ≤4 messages.

System Message Ignored

Semantic cache requires at least two messages. The first message (typically system) is ignored for matching:

[
  { "role": "system", "content": "You are a helpful assistant" },
  { "role": "user", "content": "Who is the president of the US?" }
]

Only the user message is used for matching. Change the system message without affecting cache hits.

Cache TTL

Set expiration with max_age (in seconds):

{ "cache": { "mode": "semantic", "max_age": 60 } }

Setting	Value
Minimum	60 seconds
Maximum	90 days (7,776,000 seconds)
Default	7 days (604,800 seconds)

Organization-Level TTL

Admins can set default TTL for all workspaces to align with data retention policies:

Go to Admin Settings → Organization Properties → Cache Settings
Enter default TTL (seconds)
Save

Precedence:

No max_age in request → org default used
Request max_age > org default → org default wins
Request max_age < org default → request value honored

Max org-level TTL: 25,923,000 seconds.

Force Refresh

Fetch a fresh response even when a cached response exists. This is set per-request (not in Config):

response = portkey.with_options(
    cache_force_refresh=True
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Requires cache config to be passed
For semantic hits, refreshes ALL matching entries

Cache Namespace

By default, Portkey partitions cache by all request headers. Use a custom namespace to partition only by your custom string—useful for per-user caching or optimizing hit ratio:

response = portkey.with_options(
    cache_namespace="user-123"
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Cache with Configs

Set cache at top-level or per-target:

{
  "cache": { "mode": "semantic", "max_age": 60 },
  "strategy": { "mode": "fallback" },
  "targets": [
    { "override_params": { "model": "@openai-prod/gpt-4o" } },
    { "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" } }
  ]
}

Target-level cache takes precedence over top-level.

Targets with override_params need that exact param combination cached before hits occur.

Analytics & Logs

Analytics → Cache tab shows:

Cache hit rate
Latency savings
Cost savings

Logs → Status column shows: Cache Hit, Cache Semantic Hit, Cache Miss, Cache Refreshed, or Cache Disabled. Learn more →

Introduction

Product

Self-Hosting

Support

Cache (Simple & Semantic)

Enable Cache

Simple Cache

Semantic Cache

System Message Ignored

Cache TTL

Organization-Level TTL

Force Refresh

Cache Namespace

Cache with Configs

Analytics & Logs

Introduction

Product

Self-Hosting

Support

​Enable Cache

​Simple Cache

​Semantic Cache

​System Message Ignored

​Cache TTL

​Organization-Level TTL

​Force Refresh

​Cache Namespace

​Cache with Configs

​Analytics & Logs

Enable Cache

Simple Cache

Semantic Cache

System Message Ignored

Cache TTL

Organization-Level TTL

Force Refresh

Cache Namespace

Cache with Configs

Analytics & Logs