Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Velox: Meta's Unified Execution Engine [pdf] (umich.edu)
99 points by luu on March 25, 2024 | hide | past | favorite | 42 comments


Python's Substrait seems like the biggest/most-used competitor-ish out there. I'd love some compare & contrast; my sense is that Substrait has a smaller ambition, more wants to be a language for talking about execution rather than a full on optimization/execution engine. https://github.com/substrait-io/substrait .

(Edit: ah, there's a recent talk discussing PyVelox trying to get Substrait integration. https://www.youtube.com/watch?v=l_kHxkGkNRg#t=18m22s . However there's also discussion about the un-maintainedness of some of the current Substrait work here; unclear status. https://github.com/facebookincubator/velox/issues/8895)

We can also see from the Apache Arrow DataFusion discussion that they too see themselves as a bit of a Velox competitor. https://github.com/apache/arrow-datafusion/discussions/6441

It's cool to see this space mature. I like that even Velox sees that Apache Arrow (underlying Apache Arrow DataFusion too) is industry standard tech that they ought work with. https://engineering.fb.com/2024/02/20/developer-tools/velox-...

Theres a solid Influx post talks to some of how they are composing the assorted technologies to build they next gen 3.0, which I find helpful for getting a sense of how all the pieces of a modern high-performance data engine slot together. https://www.influxdata.com/blog/flight-datafusion-arrow-parq...


I think you're right - Substrait wants to sit above something like Velox. The closest comparison is probably Databricks Photon[1], but that's proprietary.

[1]: https://www.databricks.com/product/photon


A lot of it is just trying to build a more modular Spark.

Which is nice and all but for most companies they want something integrated so a lot of these projects despite promises haven't really seen that much traction.

Especially when every cloud provider has a fully supported Spark platform available.


My general take is that while the idea of composability is good, the implementations of these things are just frankly not of high quality. Velox/Acero in particular are all plagued by what I've come to call "Java syndrome", where everything is written as idiomatic Java but with C++ syntax. Virtual methods, std::shared_ptr galore (in lieu of garbage collection), random heap allocations, etc. As a result these systems tend to be bloated and significantly slower than they need to be.

DuckDB is good though, and I predict its quality of implementation will keep "monolithic databases" relevant for a while longer.


Basically 1990's C++ before Java was invented, as proven in any C++ GUI framework that has survived to our days.

I really take issue with people calling "Java code" to what used to be quite common C++ code from CFront 2.0 until C++11 came to be.

Java is the outcome of C++'s programing practices before 1996, with a flavour of Objective-C semantics on top.

To pretend those communities aren't responsible for those practices in first place is not being honest where they came from.


> DuckDB is good though, and I predict its quality of implementation

DuckDB has segfaulted a lot for me. That just simply bodes horribly poorly on the future of a C++ codebase.

Datafusion has been a pretty pleasant experience.


Acero is indeed plagued by "Java syndrome". However honestly Velox looks better, it's not as good as ClickHouse but working on it does not leave a "bruh Apache Arrow bruh shared_ptr everywhere bruh" bad taste in my mouth.

And it does not use Apache Arrow C++, it implemented its own with Arrow compatible memory layout.


Velox could be competitor of datafusion. It is more focus on execution engine and could be great to integrate to other high performance databases.

Database will be split into pieces and rebuild!


Yes this has been an up-and-coming theme in the data science world. Arrow for the data format, Ibis for the API, Acero/Velox/DataFusion/DuckDB/Polars for execution, Substrait for the query plan representation, etc.


Isn't spark already providing a bundled and somewhat functional version of all of this ?


Right, so I think the idea is to "unbundle" this so that you can compose your own data analytics engine.


By the way, we are looking for talents of velox, data fusion, ducdb, clickhouse to build the world's fastest vector db milvus(mainly on search side). Contact me if you have this background!


I wonder how many of this sort of FAANG project really get used where they are built. I went for an interview at a FAANG years ago to work on a very big consumer product (when it was in relative infancy) and expected to find a hyper tech data backend to use... they told me that they were using mySQL.

I didn't get the job so maybe they were just joking around with me - but the general despair that they evinced about their data situation makes me wonder!


Facebook/meta uses mySQL, but with a completely different engine (myrocks) and sharding techniques.

YouTube uses mySQL but they've also rewritten major portions for scalability. (Vitess)

Just because a company is using a technology you've heard of doesn't mean it's what you expect.


YouTube uses Spanner now, they migrated off Vitess a while ago


Can you share why they migrated?

Vitess seemed to be working for them for a long time.


When Vitess was donated to CNCF, there was an internal push from within Google to migrate everything to Spanner.

I do not know any other reason.

Disclaimer: Vitess Maintainer


> YouTube uses mySQL but they've also rewritten major portions for scalability. (Vitess)

I imagine this is some very old info(like 10 yo) and could change since then?


At Meta they probably don't get built unless they're impactful, and they're not impactful if they're not used in production to solve a real pain point.


More like, everything gets build because someone wants to get promoted.


That's a pretty cynical take. Meta deployed Facebook at enormous scale as in many thousands of MySQL servers. The engineering team included a number of the best engineers in the MySQL community, who adapted MySQL extensively to meet the needs of Facebook applications. They used MySQL because it worked.


That's just some urban myth about promotion in big corp

Yes

There are a lot of vanity projects that get someone promoted for the wrong reasons

That only get broadcasted because that's the newsworthy. You won't get up voted when you share a small story about someone did hard problem and get promoted.

Overwhelmingly, people get promoted because they solve challenging problems with meaningful impact. That's how capitalism and modern corporation work.

But above the baseline there is a lot of errors, exceptions, and manipulations. Because that's how people do everyday: they want to game the system for their own gains. Human nature. There are just so many of them because big corps are big. And that's why big corps eventually lost their vigor.

The best way to combat promotion bullshit and other corporate bullshit, it's to recognize them, call them out in the right technique (being diplomatic and protect yourself) and don't practice yourself.

Yes, don't practice the bullshit. That's extraordinarily difficult.


I think it’s true both that most promotions are legit and not based on vanity projects, and yet still the vanity projects are common and causing major problems. Let’s say you have 10k engineers at your megacorp. Maybe the ideal number of execution platform workflow framework engines your business needs to add this year is 30, but instead 300 are created by 3% of your engineers who wants a promotion. Eventually you have thousands of these frameworks, maintaining them is a drag, everyone is suffering, although the vast majority are good actors.


Speaking from personal experience, the inverse of this is not necessary great either: the desire for the ever-growing scope leads to convincing everyone to switch to the "one true system" where previously multiple custom solutions were better fit for each individual problem.


These projects don't just appear out of thin air and get funded.

It's because solutions don't exist to meet their unique requirements which often you don't get visibility of unless you're in the team.

But of course that will never stop HN commenters assuming they know better about the situation than the engineers and managers that work there.


> Overwhelmingly, people get promoted because they solve challenging problems with meaningful impact. That's how capitalism and modern corporation work.

It's key to ask, does the promotion (or strong performance rating) happen before the impact or after?

You can deliver Project X that will save $YY Million dollars. Everyone agrees the impact is "there", the complexity is there. Launch a PoC to a handful of use cases, realize most of that impact, then move onto something else. PoC works for those use cases, never becomes a complete solution, and slowly develops issues. Once it has enough issues, someone else can solve the problem again for the even more impact assuming the problem space has grown since the initial launch.

Capitalism works when there's competition and cost for (long-term) failure. Neither are guaranteed to exist if you're at a Big Corp that's printing money.


》Capitalism works when there's competition and cost for (long-term) failure. Neither are guaranteed to exist if you're at a Big Corp that's printing money.

Disagree

Big Corp print money still squeeze employees. See the record profit & revenue and 10k+ layoffs.


This is being actively used at Meta in Production across several engines ; the paper makes explicit references to this.


I can neither confirm nor deny that S3’s global bucket database is actually just MySql (with a lil bit of special sauce)


And why ever not? It's a perfectly good solution, no?

What the GP alludes to is interesting though - mythologising of organisations, brands and names.

Spend enough time with "famous" people, "big names", centres of power and prominence and you quickly see everyone is just ordinary dudes doing ordinary things with ordinary gear. But for some reason there's fuck loads of money and attention, and sometimes cloying paranoia and adulation floating around.

Sure, right out on the periphery are a noble few who play with particle accelerators, spaceships and bunker supercomputers. But then, that's just a day job too.

True genius/exceptionalism is rare and found in the unexpected places. The rest is conjured out of thin air by marketing and PR people, the press, and commentators. They are the ones who need the big legend.


Yeah, but I bet you the S3 Keymap isn't MySQL....


tbh my general response to all data questions is "use postgres". It does happen that someone comes back with a good reason why that would be a bad idea, but it's not frequent!

mySQL == Oracle now... so bad on theological grounds.


You can use MariaDB.



To the best of my knowledge, Meta has significantly reduced its investment in the Velox project. Apart from Meta, I'm not aware of any other major company that really uses Velox in a production environment. Frankly speaking, Velox may have already missed the window of opportunity for rapid development. If you're looking for a vectorized execution engine, you could consider ClickHouse (www.clickhouse.com) or StarRocks (www.starrocks.io). If your data analysis scenarios require more multi-table join operations, StarRocks is clearly a better choice.


This isnt really true, Meta if anything has doubled down on Velox.


Curious if you’re both right. Could it be because of the orgs involved? Which teams are using it? What products?


Presto is actively switching to use Velox as the backend ( https://github.com/prestodb/presto/tree/master/presto-native... ) . It is also being used extensively internally, again the paper describes these and their usages have grown , not reduced.


This thread clearly has folks from Clickhouse trying to talk down the project. Could you provide any data or evidence to support your claim?


Many ideas look like they were influenced by ClickHouse, and some are direct copies. I'm surprised they didn't provide references to ClickHouse, where the implementations are proven in production in the first place.


Could you be specific about which ideas you think were influenced by ClickHouse specifically and not Presto or DuckDB or Spark?


Vectorization was never invented by Clickhouse; they all root at the work done by CWI in MonetDB, then VectorWise, then more recently in DuckDB (and Velox). Velox also does not claim to have invented any of these techniques; the novelty claimed by the project is to do so in a modular way so that it can be reused while building any other engines (engine-agnostic), following any SQL dialect (dialect-agnostic).

Could you list the "copied" ideas you are referring to?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: