In this four-part series on counting at scale, we talk about sketching algorithms and how we apply sketching at Kiip to keep accurate counts at high scale.
Previously in this series we gave an overview of why sketching makes counting problems easier at scale. We discussed one type of sketching data structure called a bloom filter and how it can be used for conditional counting of large sets. At Kiip we use tool called bloomd to do bloom filter operations.
In this post, we talk about how it is possible to estimate the cardinality of streaming sets within a small and constant space complexity.
All the Flavor, None of the Guilt
Often when counting, the only thing that matters is the count itself. For a large set, the count is the cardinality of the set. Values counted, history of values and retrieval of values are all unnecessary expenses. In such a case, we can formulate our counting problem as the question, “How many X are in Y?”
- How many people have iPhones in China?
- How many ants are in an Argentine Ant supercolony?
- How many different words are in James Joyce’s Ulysses?
At first glance it may seem simple to answer these questions: we could just count. Note that each question implies uniqueness among elements in the count. If we were just counting, how would we determine whether a given ant in a colony had already been counted? We would need to keep some history of previously-seen ants in order to check if we had seen one before, otherwise our final count would be bloated with duplicates.
In the case of extra large sets and streams, storing unique elements to avoid counting duplicates quickly becomes a memory problem. We could buy more memory or we could use a magic trick to estimate the counts without remembering anything we’ve already seen.
David Copperfield Has Nothing on HyperLogLog
This is as close to real magic as it gets.
Like other sketching algorithms, HyperLogLog works by asserting that “close” is good enough. It can accurately estimate the cardinality of a streaming set by making one extremely clever observation: the probability of doing the same thing many times in a row is predictably low. To put it another way, the occurrence of any result with a known probability contains information about how many attempts were made to get it.
Take basketball, for example. Let’s say you shoot 50% from the free throw line. We have an innate qualitative understanding that at such a rate it is far easier to hit two free throws in a row than to hit 100. If you were to tell any sane person that you sunk 100 free throws in a row at 50% probability, that person would be quick to point out that your pants were on fire. In fact, the probability of sinking 100 shots in a row would be 1 in 2100, or impossible (since we’re sketching).
In HyperLogLog, instead of counting free throws, we count the number of consecutive zeroes in the binary representation of the hash for each element in the set we want to count:
We store only the maximum number of consecutive zeroes, and use that number to predict the cardinality of the entire stream. We assume a “good” hashing function is used, to ensure that the probability of seeing any single pattern is equal to all others. For example, given four bits there exist only 16 possible representations, with patterns of max consecutive zeroes shown below.
So, if in our stream the highest number of consecutive zeroes were three (000), we could use the fact that the probability of seeing that pattern is 2 in 16 (or 1 in 8) to conclude that we saw 8 elements to reach that pattern. In other words, the cardinality of our streaming set is 8.
Using this technique, we can count an entire set by remembering one single number. If that isn’t amazing, we don’t know what is.
That’s the meat of the algorithm. Astute readers will note though that the estimate of 8 is likely skewed. The 000 pattern had a good chance of being seen at any point within the first 8 elements of the stream. The actual HyperLogLog algorithm uses two tricks to dramatically increase the estimation accuracy: averaging many estimates to reduce bias and harmonic mean to reduce the impact of outliers.
Reducing Bias by Averaging Estimates
As mentioned above, there is a good chance that one single estimate is skewed. To improve the accuracy of our count, we can use the simple trick of taking the average of many estimates. We can use a few of the bits of each element to route it to a different “bucket”, in order to partition our estimates consistently. Each bucket holds only one value: the maximum number of consecutive zeroes that it has seen. The remaining bits become the space in which we count consecutive zeroes.
For example, take the following arbitrary bit string as an element of the set we are counting:
In this example, bucket 8 would be assigned the value 4. We can represent our buckets using indices of a simple array, which is our HyperLogLog:
The consecutive zeroes of the hash for each element in the stream are counted and using this system the result is routed to its appropriate bucket. Over time, the array will fill up with values corresponding to the highest number of consecutive zeroes seen in the stream. The cardinality of the set is calculated using the average among the values in each bucket:
The resulting value is plugged into the HyperLogLog Distinct Value formula to acquire the cardinality estimate.
We’ve been saying “average” in the interest of a more qualitative description, but the HyperLogLog algorithm actually uses harmonic mean instead of arithmetic mean in order to eliminate large outliers. It’s a trick that was so effective in practice it merited changing the name of this algorithm from LogLog to HyperLogLog.
HLLD, HyperLogLog in Practice
At Kiip, we use a tool called HLLD to estimate cardinality of large sets. It implements a variation of the HyperLogLog algorithm published by Google Research in 2013. The variation uses a 6-bit register and improves accuracy and memory efficiency for sets into the trillions. HLLD itself is a network daemon built in C . It is extremely fast and memory-efficient. We use it primarily for gathering demographics information about users on our network.
We use a dedicated service for HyperLogLog operations because of the large metrics fan out that we see with each request and because handling that load with PostgreSQL turned out to be both impossible and ill-advised. We were thrilled to read a few weeks ago that Redis now supports HyperLogLogs as well, but we have yet to play with the feature in our Redis cluster.
More on HyperLogLogs:
- The HLLD source code
- The original HyperLogLog paper by Flajolet et. al.
- The canonical HLL article at Aggregate Knowledge (which this post and probably all recent HLL content borrows from)
- The Google Research HLL variation implemented in HLLD.
In the final part of this series, we’ll drink from a firehose without getting our shirts wet.