Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Huh? How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.

Obviously we can get boxes with multiple terabytes of RAM for $50-200/hr on-demand but nobody is doing that and then also using awk. They’re loading the data into clickhouse or duckdb (at which point the ram requirement is probably 64-128GB)

I feel like this is an anecdotal story that has mixed up sizes and tools for dramatic effect.



Awk doesn't load things into memory. It processes one line at a time. So memory usage is basically zero. That said awk isn't that fast. I mean your looking at "query" times in the range of at least 30 minutes if not more.

Awk is imo a poor solution. I use awk all the time and I would never use it for something like this. Why not just use postgres. Its a lot more powerful, easy to setup and you get SQL which is extremely powerful. Normally I might even go with sqllite but for me 6TB is too much for sqllite.


> How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.

Ramdisk would work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: