Huh? How are you proposing loading a 6TB CSV into memory multiple times? And then processing with awk, which generally streams one a line at a time.
Obviously we can get boxes with multiple terabytes of RAM for $50-200/hr on-demand but nobody is doing that and then also using awk. They’re loading the data into clickhouse or duckdb (at which point the ram requirement is probably 64-128GB)
I feel like this is an anecdotal story that has mixed up sizes and tools for dramatic effect.
Awk doesn't load things into memory. It processes one line at a time. So memory usage is basically zero. That said awk isn't that fast. I mean your looking at "query" times in the range of at least 30 minutes if not more.
Awk is imo a poor solution. I use awk all the time and I would never use it for something like this. Why not just use postgres. Its a lot more powerful, easy to setup and you get SQL which is extremely powerful. Normally I might even go with sqllite but for me 6TB is too much for sqllite.
Obviously we can get boxes with multiple terabytes of RAM for $50-200/hr on-demand but nobody is doing that and then also using awk. They’re loading the data into clickhouse or duckdb (at which point the ram requirement is probably 64-128GB)
I feel like this is an anecdotal story that has mixed up sizes and tools for dramatic effect.