Cache (Simple & Semantic)

Simple caching is available for all plans.
Semantic caching requires a vector database and is only available on select Enterprise plans. Contact us to learn more about enabling this feature.

Cache LLM responses to serve requests up to 20x faster and cheaper.

Mode	How it Works	Best For	Supported Routes
Simple	Exact match on input	Repeated identical prompts	All models including image generation
Semantic	Matches semantically similar requests	Denoising variations in phrasing	`/chat/completions`, `/completions`

Enable Cache

Add cache to your config object:

{ "cache": { "mode": "simple" } }

Caching won’t work if x-portkey-debug: "false" header is included.

Simple Cache

Exact match on input prompts. If the same request comes again, Portkey returns the cached response.

Semantic Cache

Matches requests with similar meaning using cosine similarity. Learn more →

Semantic cache is a superset—it handles simple cache hits too.

Semantic cache works with requests under 8,191 tokens and ≤4 messages.

Set up semantic caching (self-hosted)

To enable semantic caching on a self-hosted Portkey gateway, configure the embedding provider and a vector database.

Configure the embedding provider

Set the following environment variables in your gateway environment for generating vector embeddings:

SEMANTIC_CACHE_EMBEDDING_PROVIDER=openai
SEMANTIC_CACHE_EMBEDDINGS_URL=https://api.openai.com/v1/embeddings
SEMANTIC_CACHE_EMBEDDING_MODEL=text-embedding-3-small
SEMANTIC_CACHE_EMBEDDING_API_KEY=<openai-api-key>
SEMANTIC_CACHE_SIMILARITY_THRESHOLD=0.95
SEMANTIC_CACHE_EMBEDDING_DIMENSIONS=1536

Configure the vector database

Set the following environment variables in your gateway environment to connect to your vector store (Milvus or Pinecone):

VECTOR_STORE=milvus # supported values: milvus, pinecone
VECTOR_STORE_ADDRESS=<your-vector-store-address>
VECTOR_STORE_COLLECTION_NAME=<your-collection-name>
VECTOR_STORE_API_KEY=<your-vector-db-api-key>

MilvusCreate a collection whose name matches SEMANTIC_CACHE_EMBEDDING_MODEL (for example, text-embedding-3-small when using that model). The collection must define these fields:

Field	Type
`id`	`Varchar`
`values`	`FloatVector` with dimension 1536 (must match `SEMANTIC_CACHE_EMBEDDING_DIMENSIONS`)
`metadata`	`JSON`

If you change the embedding model or dimension, update the collection schema and SEMANTIC_CACHE_EMBEDDING_DIMENSIONS so the vector field size stays aligned.Pinecone

VECTOR_STORE_COLLECTION_NAME — Omit this; it is not used for Pinecone.
VECTOR_STORE_ADDRESS — Set to your Pinecone index name (not a generic host string).
SEMANTIC_CACHE_EMBEDDING_DIMENSIONS — Must match the dimension configured on the index (same as your embedding vectors).
In the Pinecone console, create or use an index with cosine as the similarity metric so it matches Portkey’s semantic cache behavior.

Enable semantic caching per request

Set the cache mode to semantic in your config object for each LLM request:

{ "cache": { "mode": "semantic" } }

Limitations:

Only OpenAI-compatible models are supported for generating embeddings.
The LLM model used for generating responses must also be OpenAI-compatible.
Each request must include at least one user message along with system messages. Requests with only system messages are dropped.

Message matching behavior

Semantic cache requires at least two messages. The first message (typically system) is ignored for matching:

[
  { "role": "system", "content": "You are a helpful assistant" },
  { "role": "user", "content": "Who is the president of the US?" }
]

Only the user message is used for matching. Change the system message without affecting cache hits.

Cache TTL

Set expiration with max_age (in seconds):

{ "cache": { "mode": "semantic", "max_age": 60 } }

Setting	Value
Minimum	60 seconds
Maximum	90 days (7,776,000 seconds)
Default	7 days (604,800 seconds)

Organization-Level TTL

Admins can set default TTL for all workspaces to align with data retention policies:

Go to Admin Settings → Organization Properties → Cache Settings
Enter default TTL (seconds)
Save

Precedence:

No max_age in request → org default used
Request max_age > org default → org default wins
Request max_age < org default → request value honored

Max org-level TTL: 25,923,000 seconds.

Force Refresh

Fetch a fresh response even when a cached response exists. This is set per-request (not in Config):

response = portkey.with_options(
    cache_force_refresh=True
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Requires cache config to be passed
For semantic hits, refreshes ALL matching entries

Cache Namespace

By default, Portkey partitions cache by all request headers. Use a custom namespace to partition only by your custom string—useful for per-user caching or optimizing hit ratio:

response = portkey.with_options(
    cache_namespace="user-123"
).chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    model="@openai-prod/gpt-4o"
)

Cache with Configs

Set cache at top-level or per-target:

{
  "cache": { "mode": "semantic", "max_age": 60 },
  "strategy": { "mode": "fallback" },
  "targets": [
    { "override_params": { "model": "@openai-prod/gpt-4o" } },
    { "override_params": { "model": "@anthropic-prod/claude-3-5-sonnet-20241022" } }
  ]
}

Target-level cache takes precedence over top-level.

Targets with override_params need that exact param combination cached before hits occur.

Analytics & Logs

Analytics → Cache tab shows:

Cache hit rate
Latency savings
Cost savings

Logs → Status column shows: Cache Hit, Cache Semantic Hit, Cache Miss, Cache Refreshed, or Cache Disabled. Learn more →

Introduction

Product

Self-Hosting

Support

Cache (Simple & Semantic)

Enable Cache

Simple Cache

Semantic Cache

Set up semantic caching (self-hosted)

Message matching behavior

Cache TTL

Organization-Level TTL

Force Refresh

Cache Namespace

Cache with Configs

Analytics & Logs

Introduction

Product

Self-Hosting

Support

​Enable Cache

​Simple Cache

​Semantic Cache

​Set up semantic caching (self-hosted)

​Message matching behavior

​Cache TTL

​Organization-Level TTL

​Force Refresh

​Cache Namespace

​Cache with Configs

​Analytics & Logs

Enable Cache

Simple Cache

Semantic Cache

Set up semantic caching (self-hosted)

Message matching behavior

Cache TTL

Organization-Level TTL

Force Refresh

Cache Namespace

Cache with Configs

Analytics & Logs