Now, you have to consider the cost it takes for you whole team to learn how to u...

clwg · on May 27, 2024

Not necessarily. I always try to write to disk first, usually in a rotating compressed format if possible. Then, based on something like a queue, cron, or inotify, other tasks occur, such as processing and database logging. You still end up at the same place, and this approach works really well with tools like jq when the raw data is in jsonl format.

The only time this becomes an issue is when the data needs to be processed as close to real-time as possible. In those instances, I still tend to log the raw data to disk in another thread.

kjkjadksj · on May 27, 2024

For someone who is comfortable with sql we are talking minutes to hours to figure out awk well enough to see how its used or use it.

esafak · on May 27, 2024

I have been using sql for decades and I am not comfortable with awk or intend to become so. There are better tools.

noisy_boy · on May 27, 2024

It is not only about whether people can figure it out awk. It is also about how supportable the solution is. SQL provides many features specifically to support complex querying and is much more accessible to most people - you can't reasonably expect your business analysts to do complex analysis using awk.

Not only that, it provides a useful separation from the storage format so you can use it to query a flat file exposed as table using Apache Drill or a file on s3 exposed by Athena or data in an actual table stored in a database and so on. The flexibility is terrific.

RodgerTheGreat · on May 27, 2024

With the exception of regexes- which any programmer or data analyst ought to develop some familiarity with anyway- you can describe the entirety of AWK on a few sheets of paper. It's a versatile, performant, and enduring data-handling tool that is already installed on all your servers. You would be hard-pressed to find a better investment in technical training.

Dylan16807 · on May 27, 2024

No, if you want SQL you install postgresql on the single machine.

Why would use use bigquery just to get SQL?

citizen_friend · on May 27, 2024

sqlite cli

tomrod · on May 27, 2024

About $20/month for chatgpt or similar copilot, which really they should reach for independently anyhow.

randomtoast · on May 27, 2024

And since the data scientist cannot verify the very complex AWK output that should be 100% compatible with his SQL query, he relies on the GPT output for business-critical analysis.

tomrod · on May 27, 2024

Only if your testing frameworks are inadequate. But I belive you could be missing or mistaken on how code generation successfully integrates into a developer and data scientist's work flow.

Why not take a few days to get familiar with AWK, a skill which will last a lifetime? Like SQL, it really isn't so bad.

randomtoast · on May 27, 2024

It is easier to write complex queries in SQL instead of AWK. I know both AWK and SQL, and I find SQL much easier for complex data analysis, including JOINS, subqueries, window functions, etc. Of course, your mileage may vary, but I think most data scientists will be much more comfortable with SQL.

elicksaur · on May 27, 2024

Many people have noted how when using LLMs for things like this, the person’s ultimate knowledge of the topic is less than it would’ve otherwise been.

This effect then forces the person to be reliant on the LLM for answering all questions, and they’ll be less capable of figuring out more complex issues in the topic.

$20/mth is a siren’s call to introduce such a dependency to critical systems.