Introduction

LLM API calls are expensive, both in cost and latency. Caching previously generated responses can reduce costs by 20-80% depending on the application. Unlike traditional HTTP caching where exact URL matching suffices, LLM caching must handle semantically equivalent but textually different queries. This article covers caching strategies from simple exact match to sophisticated semantic caching.

LLM Caching: Semantic Cache, Exact Match, TTL, Invalidation Strategies

Exact Match Cache

The simplest cache: identical inputs produce identical outputs:

import hashlib

import json

from functools import lru_cache

class ExactMatchCache:

def init(self, ttl_seconds: int = 3600):

self.cache = {}

self.ttl = ttl_seconds

def _make_key(self, messages: list[dict], model: str, params: dict) -> str:

canonical = json.dumps({

"messages": messages,

"model": model,

"temperature": params.get("temperature", 0),

"max_tokens": params.get("max_tokens"),

}, sort_keys=True)

return hashlib.sha256(canonical.encode()).hexdigest()

def get(self, messages: list[dict], model: str, params: dict) -> str | None:

key = self._make_key(messages, model, params)

if key in self.cache:

entry = self.cache[key]

if time.time() - entry["timestamp"] < self.ttl:

return entry["response"]

else:

del self.cache[key]

return None

def set(self, messages: list[dict], model: str, params: dict, response: str):

key = self._make_key(messages, model, params)

self.cache[key] = {"response": response, "timestamp": time.time()}

Exact match works well when identical questions recur: FAQs, repeated classification tasks, or template-based prompts where only parameters change.

Semantic Cache

Semantic caching returns cached responses for semantically equivalent questions:

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

class SemanticCache:

def init(self, embedding_fn, similarity_threshold: float = 0.92):

self.embedding = embedding_fn

self.threshold = similarity_threshold

self.cache_entries: list[dict] = []

def get(self, query: str) -> str | None:

query_emb = self.embedding(query)

for entry in self.cache_entries:

similarity = cosine_similarity([query_emb], [entry["embedding"]])[0][0]

if similarity >= self.threshold:

entry["access_count"] += 1

entry["last_accessed"] = time.time()

return entry["response"]

return None

def set(self, query: str, response: str):

entry = {

"query": query,

"embedding": self.embedding(query),

"response": response,

"created_at": time.time(),

"access_count": 1,

"last_accessed": time.time(),

}

self.cache_entries.append(entry)

def evict_lru(self, max_entries: int = 10000):

if len(self.cache_entries) > max_entries:

self.cache_entries.sort(key=lambda e: e["last_accessed"])

self.cache_entries = self.cache_entries[-max_entries:]

The similarity threshold controls the precision-recall tradeoff. A threshold of 0.95 is safe but rarely matches. A threshold of 0.85 catches more queries but risks returning irrelevant cached responses. Test on your specific query distribution.

Two-Level Cache

Combine both strategies for maximum coverage:

class TwoLevelLLMCache:

def init(self, embedding_fn):

self.exact = ExactMatchCache(ttl_seconds=7200)

self.semantic = SemanticCache(embedding_fn, similarity_threshold=0.92)

def get(self, messages: list[dict], model: str, params: dict) -> str | None:

Try exact match first (fast, no embedding computation)

exact_result = self.exact.get(messages, model, params)

if exact_result:

return exact_result

Try semantic match for the last user message

last_user_msg = self._get_last_user_message(messages)

if last_user_msg:

semantic_result = self.semantic.get(last_user_msg)

if semantic_result:

return semantic_result

return None

def set(self, messages: list[dict], model: str, params: dict, response: str):

self.exact.set(messages, model, params, response)

last_user_msg = self._get_last_user_message(messages)

if last_user_msg:

self.semantic.set(last_user_msg, response)

TTL and Invalidation

Different cache entries need different expiration policies:

class TTLManager:

def init(self):

Different TTLs for different content types

self.ttl_config = {

"classification": 86400 * 30, # 30 days (stable task)

"extraction": 86400 * 7, # 7 days

"summarization": 3600, # 1 hour (content changes)

"generation": 300, # 5 minutes (creative, no cache)

"factual_qa": 86400, # 1 day (may become stale)

}

def get_ttl(self, task_type: str) -> int:

return self.ttl_config.get(task_type, 3600)

def invalidate_by_prefix(self, prefix: str):

"""Invalidate cache entries matching a prefix pattern."""

Used when source documents are updated

pass

def invalidate_all(self):

"""Emergency invalidation."""

pass

Cache-Aware Application Design

Structure your application to maximize cache hits:

class CacheAwareLLMClient:

def init(self, cache: TwoLevelLLMCache, llm_fn):

self.cache = cache

self.llm = llm_fn

async def generate(self, task_type: str, messages: list[dict], model: str, params: dict) -> str:

Check cache

cached = self.cache.get(messages, model, params)

if cached:

return cached

Generate fresh response

response = await self.llm(messages, model, params)

Cache with appropriate TTL

self.cache.set(messages, model, params, response)

return response

Cache Hit Rate Optimization

class CacheAnalytics:

def track_hit_rate(self):

"""Monitor and report cache effectiveness."""

metrics = {

"exact_hits": 0,

"semantic_hits": 0,

"misses": 0,

}

Adjust semantic threshold based on hit quality

if metrics["semantic_hits"] > 0 and user_feedback_score < 0.8:

self.semantic.threshold += 0.02 # Be more conservative

Conclusion

LLM caching significantly reduces costs and latency. Use exact-match caching for deterministic tasks with identical inputs. Add semantic caching with embedding similarity for paraphrased questions. Implement two-level caching with exact match first (fast) then semantic match (broader). Configure TTLs based on content stability: long TTLs for classification tasks, short TTLs for generation. Monitor cache hit rates and user satisfaction to tune similarity thresholds. A well-tuned cache typically achieves 30-60% hit rates in production.