Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

My guess is they used Chinchilla scaling rules and the parameter count for GPT-4 is either barely larger or maybe even smaller than GPT-3. Look as what Meta was able to accomplish with llama using much less parameters.



The larger context length makes me think they have a more memory-efficient attention mechanism.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: