In this four-part series on counting at scale, we talk about sketching algorithms and how we apply sketching at Kiip to keep accurate counts at high scale.
Yes or No?
In the first part of this series, we discussed the concepts, motivations and a few applications of sketching. In particular, sketching algorithms and data structures are handy for solving a variety of counting problems in sub-linear (sometimes constant) space complexity.
In this post, Part 2 of our Sketching & Scaling series, we talk about conditionally counting keys of arbitrary size using just a small amount of memory by asking yes-or-no questions to a bloom filter.
The Answer Is All You Need
Many questions in counting can be answered by a simple yes or no. Have we seen user X before? Can we trust that IP? Does Melville use the word “banana” in Moby Dick? Each of these questions can be answered by building a set, then querying whether or not a given input value is a member of the set. We could hash each word in Moby Dick into a hash table, for example, then hash “banana” to see if we get any collisions.
Often, only the yes or no is important, and the value about which the question is asked is irrelevant. In that case, we can use a sketching data structure called a bloom filter to get the answer without storing anything extra.
Enter Bloom Filters
The bloom filter data structure is effectively a hash table where collisions are ignored and each element added to the table is hashed by some number k hash functions. There is one major difference, however: a bloom filter does not store the hashed keys. Instead, it has a bit array as its underlying data structure; each key is remembered by flipping on all of the bits the k functions map it to. Querying the filter will report that a given key is either “probably in” (yes) or “definitely not in” (no) the set. For many applications, that’s all the information that we need.
Bloom filters have two primary operations: insertion and membership querying. In general, removal is not supported since it would introduce the possibility of false negatives. Additional common set operations can be performed, but we won’t go into them here.
Insertion is performed by hashing a key by the bloom filter’s k functions to get k different indices in the underlying array. The bits at all positions below are set to 1.
Shown above, the key “banana” is inserted into a bloom filter with three hash functions.
Once keys have been inserted into the filter, we can start asking it questions.
Querying is performed by hashing a key over the filter’s k functions and reading the bit at each of the resulting indices. All 1‘s indicates that the filter probably contains the key.
Above, we query for the key “banana“. We have already inserted it, so we get a “yes” answer from the filter.
If any 0‘s are found, the key is definitely not in the set, since the bits would have all been set to 1 upon insertion.
Here, we query for the key “grape“, which has never been inserted into the filter. A 0 is found, so the filter gives us a definitive “no” response.
One Caveat: False Positives
Even though the filter tells us “yes” in the event that all 1‘s are found, it is possible that the key we are querying might never have been inserted. All of its bits may have been set to 1 by prior insertion of different keys.
Here we query for a “new key” that we have never seen before. Prior insertions have flipped all of its bits on. Even though the filter gives us a “yes” answer, the truth is that the key is not in the filter.
Advantages and Error Rates
Since the bloom filter does not store any of the keys themselves, it enjoys a sizable spatial advantage over normal hash tables with keys of arbitrary size. Bloom filter membership is bounded only by the rate of false positives. This rate will increase with the size of the input, reaching 100% false positives when the filter contains no more 0 bits.
The rate of false positive queries is proportional to the size of the underlying array and the number of elements being hashed. Smaller filters will enjoy a smaller footprint but will see more false positives and vice-versa for larger filters.
Bloomd and Dynamic Bloom Filters
Ideally, over time we could maintain a predictable rate of false positives. As more values are hashed into a static filter, the false positive rate will soon rise to a point where the filter becomes meaningless. The ideal bloom filter for counting large sets is one that can hold its error rate constant and change its size based on the size of the set it represents.
At Kiip we use a tool called bloomd that does just that. Bloomd is a server built in C that exposes set operations over scalable bloom filters. We use a dedicated bloom filter service because of the large metrics fan out that we see with each request to our API. Bloomd handles our needs without flinching, even over a large number of connections. We configure it at a 0.1% false positive rate and have three primary use cases:
- Tracking uniqueness of users and devices
- Tracking user retention over a range of intervals
- Measuring reward engagement rates
More on Bloom Filters
- Fiddle with an interactive bloom filter widget
- Read Burton Howard Bloom’s original paper (PDF)
- Read about scalable bloom filters
In Part 3 of this series we will count 1 trillion unique items within just 3KB of memory, using wizardry, basketball and the HyperLogLog.