In this four-part series on counting at scale, we talk about sketching algorithms and how we apply sketching at Kiip to keep accurate counts at high scale.
By this point, we’ve talked about why sketching is useful and a couple of ways sketching algorithms can help us count smarter. We answered yes-or-no problems with a bloom filter and counted the distinct values of huge sets in just a few tweets’ worth of space with HyperLogLogs.
That’s all fine and good, but up until now all of our counts have existed only in memory. Even though we are saving tons space by sketching, we’ll still have to move our counts to a persistent store for safe keeping. The problem is that at high volume, storing metrics is like drinking from a firehose.
In this post, we talk about how to chug like an Irish fish by aggregating counts and using sketching to compress a normally distributed stream.
A Penny Saved is a Penny Earned
Arguably the most important factor in counting is the ability to play with our counts once we’re done counting. That means we’ll have to move our counts out of memory and into a persistent store. But how to do so? The reason we’re sketching in the first place is because we have so much data coming at us, we can’t feasibly capture it in memory, let alone write it to disk. If we want to avoid overloading our database, we need to write to it far less frequently than the rate at which we’re counting.
We can make one more simple observation about our counts that we haven’t made so far: counting is associative. We can combine like counts by adding them together in memory, in order to “save up” for a while and limit the number of times we have to write to the database.
The most straightforward way to aggregate counts is to assign them unique keys and hash them to running totals in memory. Periodically, we flush the total counts of each key to its corresponding row in the database.
There are a number of tools available that do metrics aggregation using this technique. Perhaps the most famous is Etsy’s StatsD. At Kiip, we use a metrics aggregator called Statsite because of its otherworldly speed and one extremely useful feature that sets it apart: it sketches time.
Number of Craps We Give as Function of Time
Some important metrics should not be combined.
A well-placed timer provides one of the most crucial insights into an app’s health and performance. In general, the times gathered from a certain block of code will be normally distributed, like the grades on your high school algebra test.
Aggregating all of the samples on the curve, however, would only give us insight into the mean value. To be sure, the mean time is an important metric, but it will fluctuate very little while absorbing outlier behavior such as network delays, cache-misses, or clashes with the OS scheduler. The long-tail times, on the other hand, are far more sensitive to outside factors and fluctuations. Their boosted sensitivity makes them extremely useful in tracking down bottlenecks both locally in the code and within the infrastructure at large.
If we want to see the long-tail, we can’t aggregate. Obviously to get the long tail we could build a histogram or a sorted list of timer values. The problem with such an approach is that most of the data we just don’t care about.
Additionally, we’ll have so much timer data that storing all of the don’t-cares won’t be cheap.
Note that in order to query a value at a given percentile, only the ranks that don’t-care regions represent are absolutely necessary and the values themselves can be thrown away. It follows from this observation that we can summarize the information in the don’t care regions, only storing the values that we care about.
Crumpling Time with Targeted Quantiles
As with Bloom Filters and HyperLogLogs, our primary goal in gathering a distribution of times is to save memory. We can’t explicitly combine our times, but it is possible to periodically compress their distribution in order to maintain a tiny memory footprint.
We can picture the technique like an accordion that we repeatedly expand and compress. The accordion has two phases:
- Insert new data into a sorted list (expansion)
- Crumple the list around targeted points of interest (compression)
The sorted list represents the timer distribution. By maintaining the list’s order, we can easily perform queries for the percentile values we are most interested in. These values are specified upfront as “targets”.
“A car crash is a good analogy
for targeted compression.”
There are a number of ways to buffer insertions to improve throughput (Statsite uses heaps, for example), but we won’t go into them here. The compression phase is the interesting part.
A car crash is a good analogy for targeted compression. Probably we’ve all seen video of car crash tests. Car bodies are designed to absorb energy in crumple zones in order to keep the more rigid passenger compartment intact during a collision. During compression of our timer distribution, the goal is to keep our target percentiles intact, and we can obliterate the rest. We can configure varying degrees of accuracy for our chosen targets, useful for the highly desirable long-tail percentiles where we have fewer samples.
Compression is performed by walking down the list in reverse, and evaluating a threshold condition* at each node. If the condition is true, we merge the current node into the node to its right, else we leave it intact and keep going. A merge is performed by deleting the current node, and recording that the node to its right now represents a range of list ranks that includes both nodes:
Over time, the ranges of list ranks will become increasingly more probabilistic. As mentioned above, the degree of accuracy that our targeted percentiles maintain can be configured and is upheld by the merge threshold condition.
We can query the compressed distribution by walking down the list in sorted order and evaluating a similar threshold condition at each node. When we find the node that breaks it, we conclude that the previous node must contain the value we want.
* Effectively, the merge threshold condition is, “Are the ranges of list ranks represented by the current node and the node to its right mutually exclusive?” If so, we can merge the nodes. Both the merge threshold condition and the query threshold condition rely on a probabilistic invariant function introduced in the paper (pdf) presenting the targeted quantile algorithm. We omit discussing it here.
Statsite and Practical Metrics Aggregation
We use a tool called Statsite to aggregate metrics at Kiip. As its backbone, Statsite is a network daemon that implements a screamingly fast C version of Etsy’s StatD aggregator. It extends StatsD by making use of HyperLogLogs for aggregating unique counts and the memory-efficient targeted quantile algorithm described above for gathering timer data. We configure it to give us timer information for mean, 95th, and 99th percentile values at increasingly precise error rates.
More on Aggregation and Streams:
- The Statsite source code
- The StatsD source code
- The paper by Cormode and Muthukrishnan (pdf) describing the targeted quantile algorithm implemented by Statsite
- Normal Distributions
A Final Note on the Tools: