Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, more wants to be a language for talking about execution rather than a full on optimization/execution engine. https://github.com/substrait-io/substrait .
Theres a solid Influx post talks to some of how they are composing the assorted technologies to build they next gen 3.0, which I find helpful for getting a sense of how all the pieces of a modern high-performance data engine slot together. https://www.influxdata.com/blog/flight-datafusion-arrow-parq...
I think you're right - Substrait wants to sit above something like Velox. The closest comparison is probably Databricks Photon[1], but that's proprietary.
A lot of it is just trying to build a more modular Spark.
Which is nice and all but for most companies they want something integrated so a lot of these projects despite promises haven't really seen that much traction.
Especially when every cloud provider has a fully supported Spark platform available.
My general take is that while the idea of composability is good, the implementations of these things are just frankly not of high quality. Velox/Acero in particular are all plagued by what I've come to call "Java syndrome", where everything is written as idiomatic Java but with C++ syntax. Virtual methods, std::shared_ptr galore (in lieu of garbage collection), random heap allocations, etc. As a result these systems tend to be bloated and significantly slower than they need to be.
DuckDB is good though, and I predict its quality of implementation will keep "monolithic databases" relevant for a while longer.
Acero is indeed plagued by "Java syndrome". However honestly Velox looks better, it's not as good as ClickHouse but working on it does not leave a "bruh Apache Arrow bruh shared_ptr everywhere bruh" bad taste in my mouth.
And it does not use Apache Arrow C++, it implemented its own with Arrow compatible memory layout.
Yes this has been an up-and-coming theme in the data science world. Arrow for the data format, Ibis for the API, Acero/Velox/DataFusion/DuckDB/Polars for execution, Substrait for the query plan representation, etc.
By the way, we are looking for talents of velox, data fusion, ducdb, clickhouse to build the world's fastest vector db milvus(mainly on search side). Contact me if you have this background!
I wonder how many of this sort of FAANG project really get used where they are built. I went for an interview at a FAANG years ago to work on a very big consumer product (when it was in relative infancy) and expected to find a hyper tech data backend to use... they told me that they were using mySQL.
I didn't get the job so maybe they were just joking around with me - but the general despair that they evinced about their data situation makes me wonder!
At Meta they probably don't get built unless they're impactful, and they're not impactful if they're not used in production to solve a real pain point.
That's a pretty cynical take. Meta deployed Facebook at enormous scale as in many thousands of MySQL servers. The engineering team included a number of the best engineers in the MySQL community, who adapted MySQL extensively to meet the needs of Facebook applications. They used MySQL because it worked.
That's just some urban myth about promotion in big corp
Yes
There are a lot of vanity projects that get someone promoted for the wrong reasons
That only get broadcasted because that's the newsworthy. You won't get up voted when you share a small story about someone did hard problem and get promoted.
Overwhelmingly, people get promoted because they solve challenging problems with meaningful impact. That's how capitalism and modern corporation work.
But above the baseline there is a lot of errors, exceptions, and manipulations. Because that's how people do everyday: they want to game the system for their own gains. Human nature. There are just so many of them because big corps are big. And that's why big corps eventually lost their vigor.
The best way to combat promotion bullshit and other corporate bullshit, it's to recognize them, call them out in the right technique (being diplomatic and protect yourself) and don't practice yourself.
Yes, don't practice the bullshit. That's extraordinarily difficult.
I think it’s true both that most promotions are legit and not based on vanity projects, and yet still the vanity projects are common and causing major problems. Let’s say you have 10k engineers at your megacorp. Maybe the ideal number of execution platform workflow framework engines your business needs to add this year is 30, but instead 300 are created by 3% of your engineers who wants a promotion. Eventually you have thousands of these frameworks, maintaining them is a drag, everyone is suffering, although the vast majority are good actors.
Speaking from personal experience, the inverse of this is not necessary great either: the desire for the ever-growing scope leads to convincing everyone to switch to the "one true system" where previously multiple custom solutions were better fit for each individual problem.
> Overwhelmingly, people get promoted because they solve challenging problems with meaningful impact. That's how capitalism and modern corporation work.
It's key to ask, does the promotion (or strong performance rating) happen before the impact or after?
You can deliver Project X that will save $YY Million dollars. Everyone agrees the impact is "there", the complexity is there. Launch a PoC to a handful of use cases, realize most of that impact, then move onto something else. PoC works for those use cases, never becomes a complete solution, and slowly develops issues. Once it has enough issues, someone else can solve the problem again for the even more impact assuming the problem space has grown since the initial launch.
Capitalism works when there's competition and cost for (long-term) failure. Neither are guaranteed to exist if you're at a Big Corp that's printing money.
》Capitalism works when there's competition and cost for (long-term) failure. Neither are guaranteed to exist if you're at a Big Corp that's printing money.
Disagree
Big Corp print money still squeeze employees.
See the record profit & revenue and 10k+ layoffs.
And why ever not? It's a perfectly good solution, no?
What the GP alludes to is interesting though - mythologising of
organisations, brands and names.
Spend enough time with "famous" people, "big names", centres of power
and prominence and you quickly see everyone is just ordinary dudes
doing ordinary things with ordinary gear. But for some reason there's
fuck loads of money and attention, and sometimes cloying paranoia and
adulation floating around.
Sure, right out on the periphery are a noble few who play with
particle accelerators, spaceships and bunker supercomputers. But then,
that's just a day job too.
True genius/exceptionalism is rare and found in the unexpected
places. The rest is conjured out of thin air by marketing and PR
people, the press, and commentators. They are the ones who need the
big legend.
tbh my general response to all data questions is "use postgres". It does happen that someone comes back with a good reason why that would be a bad idea, but it's not frequent!
mySQL == Oracle now... so bad on theological grounds.
To the best of my knowledge, Meta has significantly reduced its investment in the Velox project. Apart from Meta, I'm not aware of any other major company that really uses Velox in a production environment. Frankly speaking, Velox may have already missed the window of opportunity for rapid development. If you're looking for a vectorized execution engine, you could consider ClickHouse (www.clickhouse.com) or StarRocks (www.starrocks.io). If your data analysis scenarios require more multi-table join operations, StarRocks is clearly a better choice.
Many ideas look like they were influenced by ClickHouse, and some are direct copies. I'm surprised they didn't provide references to ClickHouse, where the implementations are proven in production in the first place.
Vectorization was never invented by Clickhouse; they all root at the work done by CWI in MonetDB, then VectorWise, then more recently in DuckDB (and Velox). Velox also does not claim to have invented any of these techniques; the novelty claimed by the project is to do so in a modular way so that it can be reused while building any other engines (engine-agnostic), following any SQL dialect (dialect-agnostic).
Could you list the "copied" ideas you are referring to?
(Edit: ah, there's a recent talk discussing PyVelox trying to get Substrait integration. https://www.youtube.com/watch?v=l_kHxkGkNRg#t=18m22s . However there's also discussion about the un-maintainedness of some of the current Substrait work here; unclear status. https://github.com/facebookincubator/velox/issues/8895)
We can also see from the Apache Arrow DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441
It's cool to see this space mature. I like that even Velox sees that Apache Arrow (underlying Apache Arrow DataFusion too) is industry standard tech that they ought work with. https://engineering.fb.com/2024/02/20/developer-tools/velox-...
Theres a solid Influx post talks to some of how they are composing the assorted technologies to build they next gen 3.0, which I find helpful for getting a sense of how all the pieces of a modern high-performance data engine slot together. https://www.influxdata.com/blog/flight-datafusion-arrow-parq...