In this four-part series on counting at scale, we talk about sketching algorithms and how we apply sketching at Kiip to keep accurate counts at high scale.
Sketching and Scaling – Part I
It’s important to be able to keep accurate counts within an application. Metrics, DAU, demographics, operation times – counting is probably the single most important factor in gauging and promoting the health of a system as well as analyzing data. Conceptually it’s easy to keep exact counts of all your important numbers, but eventually the system will reach a point in its scale where the rate of the count makes exactness impossible. At that point, “sketching” the numbers becomes useful.
In this four-part series on sketching, we’ll talk about probabilistic counting using sketching algorithms and data structures, and how we use them at Kiip to tackle hard counting problems. We’ll start at a high level and work our way in.
What is Sketching?
The more you understand sketching, the more it seems like magic.
The basic idea behind sketching is that estimation can save time and especially space. The extra space allows the system to overcome limitations imposed by its hardware.
To wrap our brains around what this might mean in practice, take the following exchange:
Anna: “What time is it?”
Barb: “3:58, 47 seconds, 223 milliseconds, …”
Right away we can note a few things. First, Barb’s degree of accuracy here is probably irrelevant to Anna. Moreover, within the time it took Barb to pronounce the more accurate time, her watch would have moved one or two seconds. This means that by the time Anna received the full report, the information was no longer accurate. Barb’s attempts at exactness are unmanageable. It’s a hardware problem: Barb’s mouth can’t move fast enough to report accurate times.
But that doesn’t really matter. Actually, as high-level thinking machines, people sketch every day. Take this second exchange as an example:
Anna: “What time is it?”
Even if it really is 3:58 and however many seconds, sketching the time as 4:00 allows both Anna and Barb to go on living their lives on time.
Likewise, after a certain scale, the time and especially the space it takes to maintain exact counts become problems. Available memory, read/write time and network call time inevitably become obstacles. No amount of code optimization, language switching or Mongo phalanxes will continue to support exact numbers.
Luckily, sketching makes it possible to process large streams of data using sub-linear space. Effectively, this means that you don’t need to double your available memory to support double the load.
Applications of Sketching
The most general use case for sketching is having too much data that needs to be processed in one pass. In the case of an application or service, this often amounts to the stream of requests. Sketching can be used to store a comparatively small amount of data about the stream that can be processed quickly, later to provide approximate insights.
Some real-world examples of where sketching is applied:
- Database query planning
- Accurately gauging network round-trip times
- Counting unique elements in a stream
- Blacklisting malicious URLs
- Counting recurring patterns in genomic sequences
Google and Wikipedia both turn up hundreds of others.
In the remainder of this series, we’ll discuss sketching and how we use it at Kiip. We’ll talk about the algorithms we use and how counting problems can be formulated as simple questions that particular algorithms and data structures answer.
- Did a given email bounce the last time we saw it?
- Does the library have a copy of The Odyssey in stock?
- How many unique devices hit the API from Indonesia last night?
- How many stars are in the Andromeda Galaxy?
- How many different cars cross the Bay Bridge every year?
In Part 2, we’ll differentiate what to count from what not to count by asking yes-or-no questions to a bloom filter.