Without batching, I was actually thinking that's kind of modest. ExllamaV2 will ...

Without batching, I was actually thinking that's kind of modest.

ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:

I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.

100 tokens/s batched throughput (for each individual user) is much harder.