Skip to main content
Simple caching available on all plans. Semantic caching on Production and Enterprise.
Cache LLM responses to serve requests up to 20x faster and cheaper.
ModeHow it WorksBest ForSupported Routes
SimpleExact match on inputRepeated identical promptsAll models including image generation
SemanticMatches semantically similar requestsDenoising variations in phrasing/chat/completions, /completions

Enable Cache

Add cache to your config object:
{ "cache": { "mode": "simple" } }
Caching won’t work if x-portkey-debug: "false" header is included.

Simple Cache

Exact match on input prompts. If the same request comes again, Portkey returns the cached response.

Semantic Cache

Matches requests with similar meaning using cosine similarity. Learn more →
Semantic cache is a superset—it handles simple cache hits too.
Semantic cache works with requests under 8,191 tokens and ≤4 messages.

System Message Ignored

Semantic cache requires at least two messages. The first message (typically system) is ignored for matching:
[
  { "role": "system", "content": "You are a helpful assistant" },
  { "role": "user", "content": "Who is the president of the US?" }
]
Only the user message is used for matching. Change the system message without affecting cache hits.

Cache TTL

Set expiration with max_age (in seconds):
{ "cache": { "mode": "semantic", "max_age": 60 } }
SettingValue
Minimum60 seconds
Maximum90 days (7,776,000 seconds)
Default7 days (604,800 seconds)

Organization-Level TTL

Admins can set default TTL for all workspaces to align with data retention policies:
  1. Go to Admin SettingsOrganization PropertiesCache Settings
  2. Enter default TTL (seconds)
  3. Save
Precedence:
  • No max_age in request → org default used
  • Request max_age > org default → org default wins
  • Request max_age < org default → request value honored
Max org-level TTL: 25,923,000 seconds.

Force Refresh

Fetch a fresh response even when a cached response exists. This is set per-request (not in Config):
response = portkey.with_options(
    cache_force_refresh=True
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)
  • Requires cache config to be passed
  • For semantic hits, refreshes ALL matching entries

Cache Namespace

By default, Portkey partitions cache by all request headers. Use a custom namespace to partition only by your custom string—useful for per-user caching or optimizing hit ratio:
response = portkey.with_options(
    cache_namespace="user-123"
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Cache with Configs

Set cache at top-level or per-target:
{
  "cache": { "mode": "semantic", "max_age": 60 },
  "strategy": { "mode": "fallback" },
  "targets": [
    { "override_params": { "model": "@openai-prod/gpt-4o" } },
    { "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" } }
  ]
}
Target-level cache takes precedence over top-level.
Targets with override_params need that exact param combination cached before hits occur.

Analytics & Logs

Analytics → Cache tab shows:
  • Cache hit rate
  • Latency savings
  • Cost savings
Logs → Status column shows: Cache Hit, Cache Semantic Hit, Cache Miss, Cache Refreshed, or Cache Disabled. Learn more →