Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think I've written about it here before, but I imported ≈1 TB of logs into DuckDB (which compressed it to fit in RAM of my laptop) and was done with my analysis before the data science team had even ingested everything into their spark cluster.

(On the other hand, I wouldn't really want the average business analyst walking around with all our customer data on their laptops all the time. And by the time you have a proper ACL system with audit logs and some nice way to share analyses that updates in real time as new data is ingested, the Big Data Solution™ probably have a lower TCO...)



> And by the time you have ... the Big Data Solution™ probably have a lower TCO...

I doubt it. The common Big Data Solutions manage to have a very high TCO, where the least relevant share is spent on hardware and software. Most of its cost comes from reliability engineering and UI issues (because managing that "proper ACL" that doesn't fit your business is a hell of a problem that nobody will get right).


> ...managing that "proper ACL" that doesn't fit your business is a hell of a problem that nobody will get right...

I'm not sure there is a way to get this right unless there is a programmatic integration into the org chart, and ability to describe and parse in a declarative language the organizational rules of who has access to what, when, under what auth, etc. It has otherwise been for me an exercise in watching massive amounts of toil manually interpreting between the SOT of the org chart and all the other applications mediated by many manual approval policies and procedures. And at every client I've posed this to, I've always been denied that programmatic access for integration.

A lot of sites try to avoid this by designing ACL's around certain activity or data domains because those are more stable than organizations, but this breaks down when you get to the fine-grained levels of the ACL's so we get capped benefits from this approach.

I'd love to hear how others solve this in large (10K+ staff) organizations that frequently change around teams.


you probably didn't do joins for example on your dataset, because DuckDB is OOMing on them if they don't fit memory.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: