Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?


Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.


I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.


That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.


Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?

They have a limited number of resources and can’t keep everyone’s VM running forever.


I pay $5/mo to Vultr for a VM that runs continuously and maintains 25GB of state.


That price at Vultr gets you 1GB of RAM, and 25GB of relatively slow SSD.

The KV cache of your Claude context is:

- Potentially much larger than 25GB. (The KV cache sizes you see people quoting for local models are for smaller models.)

- While it's being used, it's all in RAM.

- Actually it's held in special high-performance GPU RAM, precision-bonded directly to the silicon of ludicrously expensive, state of the art GPUs.

- The KV state memory has to be many thousands of times faster than your 25GB state.

- It's much more expensive per GB than the CPU memory used by a VM. And that in turn is much more expensive than the SSD storage of your 25GB.

- Because Claude is used by far more people (and their agents) than rent VMs, far more people are competing to use that expensive memory at the same time

There is a lot going on to move KV cache state between GPU memory and dedicated, cheaper storage, on demand as different users need different state. But the KV cache data is so large, and used in its entirety when the context is active, that moving it around is expensive too.


Now check out the cost difference in 25GB of computer RAM vs GPU RAM.

And yes, this is also why computer RAM has jumped the shark in costs.

The bandwidth differences in total data transferred per hour aren't even in the same 5 orders of magnitude between your server and the workloads LLMs are doing. And this is why the compute and power markets are totally screwed.


It does not. It just has a fast way to give you the illusion it "runs continuously" with 25GB of warm memory.

Tbh, I'm not sure paged vram could solve this problem for an (assumed) huge cache miss system such as a major LLM server


Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?


Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?


No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache

note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.


To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.


That cost that you're talking about doesn't change based on how long the session is idle. No matter what happens they're storing that state and bring it back at some point, the only difference is how long it's stored out of GPU between requests.


Are you sure about that? They charge $6.25 / MTok for 5m TTL cache writes and $10 / MTok for 1hr TTL writes for Opus. Unless you believe Anthropic is dramatically inflating the price of the 1hr TTL, that implies that there is some meaningful cost for longer caches and the numbers are such that it's not just the cost of SSD storage or something. Obviously the details are secret but if I was to guess, I'd say the 5m cache is stored closer to the GPU or even on a GPU, whereas the 1hr cache is further away and costs more to move onto the GPU. Or some other plausible story - you can invent your own!


Storing on GPU would be the absolute dumbest thing they could do. Locking up the GPU memory for a full hour while waiting for someone else to make a request would result in essentially no GPU memory being available pretty rapidly. This type of caching is available from the cloud providers as well, and it isn't tied to a single session or GPU.


> Storing on GPU would be the absolute dumbest thing they could do

No. It’s not dumb. There will be multiple cache tiers in use, with the fastest and most expensive being on-GPU VRAM with cache-aware routing to specific GPUs and then progressive eviction to CPU ram and perhaps SSD after that. That is how vLLM works as you can see if you look it up, and you can find plenty of information on the multiple tiers approach from inference providers e.g. the new Inference Engineering book by Philip Kiely.

You are likely correct that the 1hr cached data probably mostly doesn’t live on GPU (although it will depend on capacity, they will keep it there as long as they can and then evict with an LRU policy). But I already said that in my last post.


Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:

  Total VRAM: 16GB
  Model: ~12GB
  128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).


Sure, it wouldn’t make sense if they only had one customer to serve :)


Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.


This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200


They don't cache model state to disk. I am proposing they do.

I’m proposing that you should educate yourself on the subject of LLM KV context caching.

It may be persisted but it is not live in the inference engine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: