While bloom filters are nice for understanding, as far as I know they've been su...

thomasmg · on Aug 11, 2021

There are many alternatives to Bloom filters, but some variants of Bloom filters are still competitive. I'm one of the authors of some benchmarks for filters: https://github.com/FastFilter/fastfilter_cpp (this is based on the cuckoo filter benchmark) and https://github.com/FastFilter/fastfilter_java

For static sets (where you construct the filter once and then use it for lookup), blocked Bloom filters are the fastest, for lookup. They do need a bit more space (maybe 10% more than Bloom filters). Also very fast are binary fuse filters (which are new), and xor filters. They also save a lot of space compared to others. Cuckoo filters, ribbon filters, and Bloom filters are a bit slower. It's a trade-off between space and lookup speed really.

For dynamic sets (where you can add and remove entries later), the fastest (again for lookup) are "Succinct counting blocked Bloom filter" (no paper yet for this): they are a combination of blocked Bloom filters and counting Bloom filters, so lookup is identical to the blocked Bloom filter. Then cuckoo filters, and counting Bloom filters.

jerf · on Aug 11, 2021

> "Succinct counting blocked Bloom filter"

Incidentally, for people who don't know this already, if you want to just generally Google about these data structures and learn about them, the search term is "succinct data structure". There's been a lot of interesting work in this area. And they are not lossy in general; obviously lossy structures are interesting but there's some mind-blowing non-lossy succinct data structures as well.

gorkish · on Aug 11, 2021

Thank you for posting the correct incantation for one to speak at the oracle.

ozankabak · on Aug 11, 2021

Where can I find more information about the "Succinct counting blocked Bloom filter"? Can you point to an implementation or a document? Thanks.

thomasmg · on Aug 11, 2021

Sure. So far, there are only two Java implementations. One is using "Rank" and the other is using "Select":

https://github.com/FastFilter/fastfilter_java/blob/master/fa...

It should be relatively easy to port it to other programming languages.

Compared to regular counting Bloom filters, there are some advantages (e.g. uses half the space, lookup is much faster, no risk of overflows). It has a disadvantage: add+remove are slower (currently). Cuckoo filters need less space, but otherwise advantages+disadvantages are the same.

For more questions, just open an issue on that project (I'm one of the authors).

ozankabak · on Aug 12, 2021

Thank you!

gigatexal · on Aug 11, 2021

Table1 here shows Vacuum filters being 11x faster than bloom. https://www.vldb.org/pvldb/vol13/p197-wang.pdf

thomasmg · on Aug 11, 2021

This paper seems to use an un-optimized Bloom filter implementation. It also lists cuckoo filters to be 10 times faster than Bloom filters. In other papers, e.g. the cuckoo filter paper and the xor filter paper, the difference is _much_ smaller; maybe factor 2.

https://dl.acm.org/doi/pdf/10.1145/2674005.2674994 - figures 5 and 6

https://arxiv.org/abs/1912.08258 - figures 3b and 4b

boyter · on Aug 11, 2021

Ribbon filters also look promising https://engineering.fb.com/2021/07/09/data-infrastructure/ri...

thomasmg · on Aug 11, 2021

Yes. Some variants of ribbon filters are very fast, and other variants (which are a bit slower thought) need very little space: only a few percent more than the theoretical lower bound.

For static sets, ribbon filters and binary fuse filters (e.g. here: https://github.com/FastFilter/xor_singleheader) are very competitive. Both are based on recent (2019 and newer) theoretical work from Stefan Walzer, e.g. this one https://arxiv.org/pdf/1907.04749.pdf

gigatexal · on Aug 11, 2021

Is anyone curating the latest developments in this space? I feel like I only stumbled upon Vacuum filters and now Ribbon filters just because of the top comment on a HN thread. Is anyone doing a "latest/developments in the datastructure/algo space"?

gigatexal · on Aug 11, 2021

Bonus points if someone starts this and breaks it down for the more pedestrian IQ'd folks like myself.

gigatexal · on Aug 11, 2021

Thanks for this context!

I wonder how the peer review process missed the use of an inefficent bloom filter implementation?

thomasmg · on Aug 11, 2021

That's a good question! This paper also mentions "Dele. BF [43]" - the "The Deletable Bloom Filter: A New Member of the Bloom Family" https://www.researchgate.net/publication/260668624_The_Delet... . As far as I see, lookup is identical to the regular Bloom filter. Yet the vacuum filter paper lists the throughput of the Deletable Bloom Filter to be almost twice as the Bloom filter (1.95x). How can that be? Maybe I misunderstand something here.

Yes, a peer review is supposed to catch completely incorrect numbers. But in many cases, a reviewer doesn't have access to the source code, and no way to re-run the tests.

There's the "Replicated Computational Results initiative": https://jea.acm.org/rcr_initiative.cfm . Papers that pass this test will get a stamp certifying that its results are reproducible. The xor filter paper (where I'm a co-author) took part in this initiative.

gigatexal · on Aug 11, 2021

ugh -- reading related papers on filters like the above exposed another pre-req i am missing: an understanding of math high enough to understand the math in the articles :/

gwittel · on Aug 11, 2021

Depending on how you look at it, bloom filters have been surpassed. In practice I find there are variables driven by implementation factors which can push bloom filters to the top choice.

Starting with language, you might not find what you need (eg vacuum is C only so far afaik). The complexity of the algorithm may force you to decide porting is problematic (understandability, effort, cost of verifying the port).

Then you goto the practical things. Do you need to handle: threads, merging, serialization, etc. after all this, you might find the only unicorn is bloom due to its simplicity, and wide implementation availability.

In my case, after lots of testing I ended up extending fastfilter Java’s bloom filter by adding merge (trivial in bloom), serialization, moving to XX3 hash, and simplified constructors. Doing all this in say Cuckoo or vacuum would have been a lot of work. I’m not certain if/how how merging might work in these offhand.