Table1 here shows Vacuum filters being 11x faster than bloom. https://www.vldb.o...

thomasmg · on Aug 11, 2021

This paper seems to use an un-optimized Bloom filter implementation. It also lists cuckoo filters to be 10 times faster than Bloom filters. In other papers, e.g. the cuckoo filter paper and the xor filter paper, the difference is _much_ smaller; maybe factor 2.

https://dl.acm.org/doi/pdf/10.1145/2674005.2674994 - figures 5 and 6

https://arxiv.org/abs/1912.08258 - figures 3b and 4b

boyter · on Aug 11, 2021

Ribbon filters also look promising https://engineering.fb.com/2021/07/09/data-infrastructure/ri...

thomasmg · on Aug 11, 2021

Yes. Some variants of ribbon filters are very fast, and other variants (which are a bit slower thought) need very little space: only a few percent more than the theoretical lower bound.

For static sets, ribbon filters and binary fuse filters (e.g. here: https://github.com/FastFilter/xor_singleheader) are very competitive. Both are based on recent (2019 and newer) theoretical work from Stefan Walzer, e.g. this one https://arxiv.org/pdf/1907.04749.pdf

gigatexal · on Aug 11, 2021

Is anyone curating the latest developments in this space? I feel like I only stumbled upon Vacuum filters and now Ribbon filters just because of the top comment on a HN thread. Is anyone doing a "latest/developments in the datastructure/algo space"?

gigatexal · on Aug 11, 2021

Bonus points if someone starts this and breaks it down for the more pedestrian IQ'd folks like myself.

gigatexal · on Aug 11, 2021

Thanks for this context!

I wonder how the peer review process missed the use of an inefficent bloom filter implementation?

thomasmg · on Aug 11, 2021

That's a good question! This paper also mentions "Dele. BF [43]" - the "The Deletable Bloom Filter: A New Member of the Bloom Family" https://www.researchgate.net/publication/260668624_The_Delet... . As far as I see, lookup is identical to the regular Bloom filter. Yet the vacuum filter paper lists the throughput of the Deletable Bloom Filter to be almost twice as the Bloom filter (1.95x). How can that be? Maybe I misunderstand something here.

Yes, a peer review is supposed to catch completely incorrect numbers. But in many cases, a reviewer doesn't have access to the source code, and no way to re-run the tests.

There's the "Replicated Computational Results initiative": https://jea.acm.org/rcr_initiative.cfm . Papers that pass this test will get a stamp certifying that its results are reproducible. The xor filter paper (where I'm a co-author) took part in this initiative.

gigatexal · on Aug 11, 2021

ugh -- reading related papers on filters like the above exposed another pre-req i am missing: an understanding of math high enough to understand the math in the articles :/