Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree with this. BigQuery or AWS s3/Athena.

You shouldn't have to set up a cluster for data jobs these days.

And it kind of points out the reason for going with a data scientist with the toolset he has in mind instead of optimizing for a commandline/embedded programmer.

The tools will evolve in the direction of the data scientist, while the embedded approach is a dead end in lots of ways.

You may have outsmarted some of your candidates, but you would have hired a person not suited for the job long term.



It is actually pretty easy to do the same type of processing you would do on a cluster with AWS Batch.


Possibly, but it seems like overkill for the type of analysis that the OP expected the interviewee to do with awk.

SQL should be fine for that.

Actually, I have a feeling that the awk solution will struggle if there are many unique keys.

For example if they in that dataset have a million customers and want to extract the top 10. Then there is an intermediate map stage that will be storage or memory consuming.

It is like matrix multiplication. Calculating the dot product is trivial, but when the matrix has n:m dimensions and n,m starts to grow, it becomes more and more resource heavy. And then the laptop will not be able to handle it.

(in the example, m is the number of rows, and n is the number of unique customers. The dot product is just a sum over one dimension, while the group by customer id is the tricky part)


I agree completely for this scale. I did want to point out that it's fairly easy these days to do the kinds of things one would do on a cluster, which I learned just a few months ago myself :)


quick addition: there are modules (eg cloudknot) for Python that make it possible to run a Python callable that launches an AWS Batch environment and job with a single method, which you could do anywhere that runs Python.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: