Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Without batching, I was actually thinking that's kind of modest.

ExllamaV2 will get 48 tokens/s on a 4090, which is much slower/cheaper than an H100:

https://github.com/turboderp/exllamav2#performance

I didn't test codellama, but the 3090 TI figures for other sizes are in the ballpark of my generation speed on a 3090.

100 tokens/s batched throughput (for each individual user) is much harder.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: