Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The dimensions should actually be closer to 12000 * (no of tokens*no of layers / x)

(where x is a number dependent on architectural features like MLHA, QGA...)

There is this thing called KV cache which holds an enormous latent state.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: