Skip to main content
Simple caching is available for all plans.
Semantic caching requires a vector database and is only available on select Enterprise plans. Contact us to learn more about enabling this feature.
Cache LLM responses to serve requests up to 20x faster and cheaper.
ModeHow it WorksBest ForSupported Routes
SimpleExact match on inputRepeated identical promptsAll models including image generation
SemanticMatches semantically similar requestsDenoising variations in phrasing/chat/completions, /completions

Enable Cache

Add cache to your config object:
{ "cache": { "mode": "simple" } }
Caching won’t work if x-portkey-debug: "false" header is included.

Simple Cache

Exact match on input prompts. If the same request comes again, Portkey returns the cached response.

Semantic Cache

Matches requests with similar meaning using cosine similarity. Learn more β†’
Semantic cache is a supersetβ€”it handles simple cache hits too.
Semantic cache works with requests under 8,191 tokens and ≀4 messages.

Set up semantic caching (self-hosted)

To enable semantic caching on a self-hosted Portkey gateway, configure the embedding provider and a vector database.
1

Configure the embedding provider

Set the following environment variables in your gateway environment for generating vector embeddings:
SEMANTIC_CACHE_EMBEDDING_PROVIDER=openai
SEMANTIC_CACHE_EMBEDDINGS_URL=https://api.openai.com/v1/embeddings
SEMANTIC_CACHE_EMBEDDING_MODEL=text-embedding-3-small
SEMANTIC_CACHE_EMBEDDING_API_KEY=<openai-api-key>
SEMANTIC_CACHE_SIMILARITY_THRESHOLD=0.95
SEMANTIC_CACHE_EMBEDDING_DIMENSIONS=1536
2

Configure the vector database

Set the following environment variables in your gateway environment to connect to your vector store (Milvus or Pinecone):
VECTOR_STORE=milvus # supported values: milvus, pinecone
VECTOR_STORE_ADDRESS=<your-vector-store-address>
VECTOR_STORE_COLLECTION_NAME=<your-collection-name>
VECTOR_STORE_API_KEY=<your-vector-db-api-key>
MilvusCreate a collection whose name matches SEMANTIC_CACHE_EMBEDDING_MODEL (for example, text-embedding-3-small when using that model). The collection must define these fields:
FieldType
idVarchar
valuesFloatVector with dimension 1536 (must match SEMANTIC_CACHE_EMBEDDING_DIMENSIONS)
metadataJSON
If you change the embedding model or dimension, update the collection schema and SEMANTIC_CACHE_EMBEDDING_DIMENSIONS so the vector field size stays aligned.Pinecone
  • VECTOR_STORE_COLLECTION_NAME β€” Omit this; it is not used for Pinecone.
  • VECTOR_STORE_ADDRESS β€” Set to your Pinecone index name (not a generic host string).
  • SEMANTIC_CACHE_EMBEDDING_DIMENSIONS β€” Must match the dimension configured on the index (same as your embedding vectors).
  • In the Pinecone console, create or use an index with cosine as the similarity metric so it matches Portkey’s semantic cache behavior.
3

Enable semantic caching per request

Set the cache mode to semantic in your config object for each LLM request:
{ "cache": { "mode": "semantic" } }
Limitations:
  • Only OpenAI-compatible models are supported for generating embeddings.
  • The LLM model used for generating responses must also be OpenAI-compatible.
  • Each request must include at least one user message along with system messages. Requests with only system messages are dropped.

Message matching behavior

Semantic cache requires at least two messages. The first message (typically system) is ignored for matching:
[
  { "role": "system", "content": "You are a helpful assistant" },
  { "role": "user", "content": "Who is the president of the US?" }
]
Only the user message is used for matching. Change the system message without affecting cache hits.

Cache TTL

Set expiration with max_age (in seconds):
{ "cache": { "mode": "semantic", "max_age": 60 } }
SettingValue
Minimum60 seconds
Maximum90 days (7,776,000 seconds)
Default7 days (604,800 seconds)

Organization-Level TTL

Admins can set default TTL for all workspaces to align with data retention policies:
  1. Go to Admin Settings β†’ Organization Properties β†’ Cache Settings
  2. Enter default TTL (seconds)
  3. Save
Precedence:
  • No max_age in request β†’ org default used
  • Request max_age > org default β†’ org default wins
  • Request max_age < org default β†’ request value honored
Max org-level TTL: 25,923,000 seconds.

Force Refresh

Fetch a fresh response even when a cached response exists. This is set per-request (not in Config):
response = portkey.with_options(
    cache_force_refresh=True
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)
  • Requires cache config to be passed
  • For semantic hits, refreshes ALL matching entries

Cache Namespace

By default, Portkey partitions cache by all request headers. Use a custom namespace to partition only by your custom stringβ€”useful for per-user caching or optimizing hit ratio:
response = portkey.with_options(
    cache_namespace="user-123"
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Cache with Configs

Set cache at top-level or per-target:
{
  "cache": { "mode": "semantic", "max_age": 60 },
  "strategy": { "mode": "fallback" },
  "targets": [
    { "override_params": { "model": "@openai-prod/gpt-4o" } },
    { "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" } }
  ]
}
Target-level cache takes precedence over top-level.
Targets with override_params need that exact param combination cached before hits occur.

Analytics & Logs

Analytics β†’ Cache tab shows:
  • Cache hit rate
  • Latency savings
  • Cost savings
Logs β†’ Status column shows: Cache Hit, Cache Semantic Hit, Cache Miss, Cache Refreshed, or Cache Disabled. Learn more β†’
Last modified on April 14, 2026