Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For reference (according to Google):

> The English Wikipedia, as of June 26, 2025, contains over 7 million articles and 63 million pages. The text content alone is approximately 156 GB, according to Wikipedia's statistics page. When including all revisions, the total size of the database is roughly 26 terabytes (26,455 GB)



better point of reference might be pages-articles-multistream.xml.bz2 (current pages without edit/revision history, no talk pages, no user pages) which is 20GB

https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?


this is a much more deserving and reliable candidate for any labels regarding the breadth of human knowledge.


it barely touches the surface


regarding depth, not breadth, certainly


Wikipedia itself describes its size as ~25GB without media [0]. And it's probably more accurate and with broader coverage in multiple languages compared to the LLM downloaded by the GP.

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia


Really? I'd assume that an LLM would deduplicate Wikipedia into something much smaller than 25GB. That's its only job.


> That's its only job.

The vast, vast majority of LLM knowledge is not found in Wikipedia. It is definitely not its only job.


When trained on next word prediction with the standard loss function, by definition it is it's only job.


What happens if you ask this 8gb model "Compose a realistic Wikipedia-style page on the Pokemon named Charizard"?

How close does it come?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: