ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:
https://github.com/turboderp/exllamav2#performance
I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.
100 tokens/s batched throughput (for each individual user) is much harder.
ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:
https://github.com/turboderp/exllamav2#performance
I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.
100 tokens/s batched throughput (for each individual user) is much harder.