I agree with this. BigQuery or AWS s3/Athena. You shouldn't have to set up a clu...

orhmeh09 · on May 27, 2024

It is actually pretty easy to do the same type of processing you would do on a cluster with AWS Batch.

fifilura · on May 28, 2024

Possibly, but it seems like overkill for the type of analysis that the OP expected the interviewee to do with awk.

SQL should be fine for that.

Actually, I have a feeling that the awk solution will struggle if there are many unique keys.

For example if they in that dataset have a million customers and want to extract the top 10. Then there is an intermediate map stage that will be storage or memory consuming.

It is like matrix multiplication. Calculating the dot product is trivial, but when the matrix has n:m dimensions and n,m starts to grow, it becomes more and more resource heavy. And then the laptop will not be able to handle it.

(in the example, m is the number of rows, and n is the number of unique customers. The dot product is just a sum over one dimension, while the group by customer id is the tricky part)

orhmeh09 · on May 28, 2024

I agree completely for this scale. I did want to point out that it's fairly easy these days to do the kinds of things one would do on a cluster, which I learned just a few months ago myself :)

orhmeh09 · on May 28, 2024

quick addition: there are modules (eg cloudknot) for Python that make it possible to run a Python callable that launches an AWS Batch environment and job with a single method, which you could do anywhere that runs Python.