While bloom filters are nice for understanding, as far as I know they've been surpassed by Cuckoo filters, and now (maybe?) Vacuum Filters.
I built a torrent-like streaming application for video a few years ago on WebRTC and bloom filters made it easy to know whether a peer would have a specific block. The false positives of the bloom filter were fine too, as the 1% of the time it had to request a block and a failure is returned was not a bottleneck to the application.
For static sets (where you construct the filter once and then use it for lookup), blocked Bloom filters are the fastest, for lookup. They do need a bit more space (maybe 10% more than Bloom filters). Also very fast are binary fuse filters (which are new), and xor filters. They also save a lot of space compared to others. Cuckoo filters, ribbon filters, and Bloom filters are a bit slower. It's a trade-off between space and lookup speed really.
For dynamic sets (where you can add and remove entries later), the fastest (again for lookup) are "Succinct counting blocked Bloom filter" (no paper yet for this): they are a combination of blocked Bloom filters and counting Bloom filters, so lookup is identical to the blocked Bloom filter. Then cuckoo filters, and counting Bloom filters.
Incidentally, for people who don't know this already, if you want to just generally Google about these data structures and learn about them, the search term is "succinct data structure". There's been a lot of interesting work in this area. And they are not lossy in general; obviously lossy structures are interesting but there's some mind-blowing non-lossy succinct data structures as well.
It should be relatively easy to port it to other programming languages.
Compared to regular counting Bloom filters, there are some advantages (e.g. uses half the space, lookup is much faster, no risk of overflows). It has a disadvantage: add+remove are slower (currently). Cuckoo filters need less space, but otherwise advantages+disadvantages are the same.
For more questions, just open an issue on that project (I'm one of the authors).
This paper seems to use an un-optimized Bloom filter implementation. It also lists cuckoo filters to be 10 times faster than Bloom filters. In other papers, e.g. the cuckoo filter paper and the xor filter paper, the difference is _much_ smaller; maybe factor 2.
Yes. Some variants of ribbon filters are very fast, and other variants (which are a bit slower thought) need very little space: only a few percent more than the theoretical lower bound.
Is anyone curating the latest developments in this space? I feel like I only stumbled upon Vacuum filters and now Ribbon filters just because of the top comment on a HN thread. Is anyone doing a "latest/developments in the datastructure/algo space"?
That's a good question! This paper also mentions "Dele. BF [43]" - the "The Deletable Bloom Filter: A New Member of the Bloom Family" https://www.researchgate.net/publication/260668624_The_Delet... . As far as I see, lookup is identical to the regular Bloom filter. Yet the vacuum filter paper lists the throughput of the Deletable Bloom Filter to be almost twice as the Bloom filter (1.95x). How can that be? Maybe I misunderstand something here.
Yes, a peer review is supposed to catch completely incorrect numbers. But in many cases, a reviewer doesn't have access to the source code, and no way to re-run the tests.
There's the "Replicated Computational Results initiative": https://jea.acm.org/rcr_initiative.cfm . Papers that pass this test will get a stamp certifying that its results are reproducible. The xor filter paper (where I'm a co-author) took part in this initiative.
ugh -- reading related papers on filters like the above exposed another pre-req i am missing: an understanding of math high enough to understand the math in the articles :/
Depending on how you look at it, bloom filters have been surpassed. In practice I find there are variables driven by implementation factors which can push bloom filters to the top choice.
Starting with language, you might not find what you need (eg vacuum is C only so far afaik). The complexity of the algorithm may force you to decide porting is problematic (understandability, effort, cost of verifying the port).
Then you goto the practical things. Do you need to handle: threads, merging, serialization, etc. after all this, you might find the only unicorn is bloom due to its simplicity, and wide implementation availability.
In my case, after lots of testing I ended up extending fastfilter Java’s bloom filter by adding merge (trivial in bloom), serialization, moving to XX3 hash, and simplified constructors. Doing all this in say Cuckoo or vacuum would have been a lot of work. I’m not certain if/how how merging might work in these offhand.
I built a torrent-like streaming application for video a few years ago on WebRTC and bloom filters made it easy to know whether a peer would have a specific block. The false positives of the bloom filter were fine too, as the 1% of the time it had to request a block and a failure is returned was not a bottleneck to the application.