I've given a talk at Papers We Love about Bloom Filters (http://paperswelove.org/2015/video/armon-dadgar-on-bloom-fil...) for those wanting to learn more. There have been quite a few follow on extensions to standard bloom filters that address some of the shortcomings mentioned here.
For example, the cost of having K hash functions can be avoided with some simple math. This is discussed in "Less Hashing, Same Performance" (http://www.eecs.harvard.edu/%7Ekirsch/pubs/bbbf/esa06.pdf). You can also avoid the FPP rate from saturating at 100% using a technique called "Scalable Bloom Filters" (http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf). The basic idea is you start with a small BF and layer on progressively larger filters. This also reduces the memory you need to allocate up front without worrying about the FPP rate saturating. As others have mentioned, Counting Bloom Filters allow for counts and deletion as well. The basic point is that there has been a _lot_ of additional work on bloom filters. One of my favorites is "Adaptive Range Filters" (http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf).
At an advertising company I previously worked for, we relied on bloom filters and hyperloglogs quite heavily for our realtime analytics pipeline. We open sourced our C daemons (https://github.com/armon/bloomd and https://github.com/armon/hlld) which have been running in production for several years.
Neat visualization. One of the nicest things about a Bloom Filter is that you'll never get a false negative (saying something is not inside even if it is). I'm not super familiar with Cuckoo filters but since it allows for removal it's possible to get false negatives.
You can replicate that by entering "cat" 5 times and then removing it 4 times. Cuckoo will indicate it's not in the filter despite it still showing up there a single time. I guess that's the "limited counting" aspect of it. Really cool.
Based on the explanation, cuckoo filters need to reject inserts that would break the invariants, which the demo doesn't seem to do.
You are right that deletion from a probabilistic filter is sort of weird, because it is possible to delete something that never existed if it happens to exist as a false positive. If you can guarantee that the false deletion doesn't happen, then it should work as expected. There are variants of bloom filters that can do this as well.
I'm not up to arguing about this, but is that not related to the cuckoo filter (this implementation?) seeming to have a limit of four repeats? From the cuckoo description: "When a matching fingerprint is not found, the entry is definitely not in the filter."
It isn't clear in the demo visualization, but insertions into the Cuckoo filter do get rejected after exhausting a maximum number of kick/relocation attempts. If I find some time I'll make that more apparent.
As for false negatives, this can happen in both Cuckoo and Counting Bloom filters if a value that is never added is removed, and the value would have resulted in a false positive on query. The contract for removal is that you must know that the value is currently in the filter. I'll clarify that in the text.
I'm a little confused by the Big-Oh notation O(2). In my CS classes we were taught that O(c) = O(1). So seeing something like 'O(1) to O(2)' is a bit odd.
But I love the visualization - it's very clear how things are working conceptually.
It would seem that the author of the article is abusing the Big-O notation. When saying some operation is O(f(n)) they probably mean that the operation takes exactly f(n) steps, rather than the usual meaning of asymptotic complexity.
While you're right, sometimes people (mis)use it to avoid hiding the constants.
O(2) is a weird one, but say O(5n) can be useful, if not technically correct. I guess O(2) is vaguely useful wrt Cuckoo hashing and filters to denote the 2 lookups.
Guilty. I wanted to be up-front about the constant factor of 1 or 2. I was tired when I wrote that and was actually looking at cleaning that up a couple days ago, but got distracted. Soon.
To further clarify, if a Bloom filter query is O(K) where k is the number of hash functions, then when k=7, the filter could be described as O(1).
The constant factors here are important for practical use. Hashing 7 times is significantly different than hashing 1 or 2 times.
For the original commenter, give me a suggestion if you don't mind. How would you expect to see that info depicted in a comparison chart so that it's honest and obvious?
This is cool visualization but the percentages for 140 bits aren't really telling you much. If you need to use this in production, check the original papers (there was an improved analysis recently [0]) and test your own production loads (another point is that with cuckoo filters you can greatly reduce false-positives, if your application doesn't tolerate them well by increasing key size).
Also I notice this comparison isn't using easily comparable implementations -- e.g. this bloom filter does not support delete, but bloom filter implementation which support delete operations do exist.
Also the "Insertions may be rejected" thing would be a deal-breaker for a lot of applications.
If you can't guarantee that an insert will be successful, then you have to be willing to either tolerate false positives and false negatives, or throw away the cuckoo filter and re-create it from scratch.
Yeah, absolutely. You get the same "feature" from Counting Blooms - i.e. if an insertion would overflow any of the counters, the filter can knowingly reject the insertion.
This is both a feature and a limitation. In our use case, our filter grows predictably, and every few months a rejected insertion triggers a resize/rebuild from source data. Because we could tolerate a full rebuild, we chose for now not to implement the technique of growing the filter by adding successively larger filter segments with lower fpp guarantees, though that is an option.
Regarding the second point on deletions, the cool thing about the work by Bin Fan with the Cuckoo filter is that you get the added benefit of deletion in almost the exact memory footprint as a standard Bloom. This was the critical feature that caused us to implement the structure for our use case.
Counting Bloom filters exist (but use more space). But your link estimates the number of elements inserted into the filter, which is something different. "Counting" is usually used to refer to the count per key, as in "how often did I insert key x?".
> Why is these data structures called "filters" rather than hashes.
I would guess because they are probabilistic data-structures which are information-lossy. Bloom Filters and Cuckoo Filters aren't guaranteed to give you a correct result and they cannot retrieve the original input. Cuckoo Filters of course take the name from Cuckoo hashing.
The application to streaming data is very similar to that of signal processing filters in electrical engineering and signal processing. For example, there you have high pass/low pass filters and more complex components that clean up your signal for downstream consumption.
For example, the cost of having K hash functions can be avoided with some simple math. This is discussed in "Less Hashing, Same Performance" (http://www.eecs.harvard.edu/%7Ekirsch/pubs/bbbf/esa06.pdf). You can also avoid the FPP rate from saturating at 100% using a technique called "Scalable Bloom Filters" (http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf). The basic idea is you start with a small BF and layer on progressively larger filters. This also reduces the memory you need to allocate up front without worrying about the FPP rate saturating. As others have mentioned, Counting Bloom Filters allow for counts and deletion as well. The basic point is that there has been a _lot_ of additional work on bloom filters. One of my favorites is "Adaptive Range Filters" (http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf).
At an advertising company I previously worked for, we relied on bloom filters and hyperloglogs quite heavily for our realtime analytics pipeline. We open sourced our C daemons (https://github.com/armon/bloomd and https://github.com/armon/hlld) which have been running in production for several years.