规模计数

Counting at Scale
作者:Priyavrat Misra    发布时间:2025-07-04 11:42:26    浏览次数:0
Motivation
动机

Last weekend, I decided to delve deeper into the world of databases and stumbled upon the fantastic CMU 15-445/645 course. After the introduction lecture, I jumped straight into the assignments. The first one, aptly named C++ Primer, is a straightforward programming task designed to assess your grasp of basic C++ features.
上周末,我决定深入研究数据库的世界,并偶然发现了奇妙的CMU 15-445/645课程。介绍演讲结束后,我直接跳进了作业。第一个恰当地命名为C ++引物,是一项简单的编程任务,旨在评估您对基本C ++功能的掌握。

The Problem
问题

The problem? Track the number of unique users accessing a website in a single day. If this were a LeetCode problem, it would be a breeze—just toss all the users into an unordered_set and return its size. Easy, right? But, as with most things in life, real-world problems are rarely so simple—or so memory-friendly.
问题?跟踪一天内访问网站的唯一用户数量。如果这是一个leetcode问题,那将是轻而易举的 - 只是将所有用户扔进Unordered_set并返回其大小。容易,对吧?但是,与生活中的大多数事情一样,现实世界中的问题很少如此简单,或者是如此友好。

Let’s break down the “LeetCode way.” Imagine we’re working at Facebook, say about one billion unique users log in daily, and each user ID is around 8 bytes. Using an unordered_set , that’s 8 billion bytes of memory—about 7.7 GB—just to count users! All we really want is a single number. It’s like renting an entire stadium just to host a two-person meeting.
让我们分解“ leetcode方式”。想象一下,我们在Facebook上工作,说约有十亿个唯一用户每天登录,每个用户ID约为8个字节。使用unordered_set,即记忆的80亿个字节(即7.7 GB)来计算用户!我们真正想要的只是一个数字。这就像租用整个体育场只是为了举办两人会议。

Data Streams vs. Accumulated Data
数据流与累积数据

But there’s another subtlety here: in real-world systems, we’re often dealing with a data stream, not a static, accumulated dataset. Users are constantly logging in, and we want to process each event as it happens—potentially across distributed servers—without storing the entire history of who has logged in so far.
但是这里还有另一个微妙的:在现实世界中,我们经常处理数据流,而不是静态的,累积的数据集。用户一直在登录,我们希望处理每个事件发生的情况(跨分布式服务器),没有存储迄今已登录的整个历史记录。

The unordered_set solution works if we have all the data in memory and can afford to keep it there, but that’s rarely feasible at scale. In practice, we don’t want to keep every user ID we’ve ever seen—especially when the list can be massive and ever-growing. Instead, we need a way to estimate the number of unique users “on the fly,” using minimal memory, and without revisiting or accumulating all previous data.
如果我们拥有内存中的所有数据,并且可以负担得起,则UNOREREDED_SET解决方案有效,但这很少是可行的。在实践中,我们不想保留我们见过的每个用户ID,尤其是当列表可能庞大且不断增长的情况下。取而代之的是,我们需要一种方法来使用最小内存,而无需重新访问或累积所有以前的数据,以估算“即时”唯一用户的数量。

Practical Approaches and their Pitfalls
实用的方法及其陷阱

Let’s get practical. Suppose there’s a “last seen” metadata field associated with each user. We could increment a counter whenever “last seen” is updated from a timestamp that isn’t from the current day. Sounds simple, right? But in reality, every time a user opens the app, we’re hitting the database for a lookup—hardly efficient.
让我们实用。假设每个用户都有一个“最后一个看见”的元数据字段。每当从当天开始的时间戳更新“最后一个见面”时,我们都可以增加计数器。听起来很简单,对吗?但实际上,每当用户打开应用程序时,我们都会访问数据库以进行查找,这几乎是有效的。

You might argue, “But we’re fetching user data anyway to show in the chat window, so why not cache and reuse it?” That’s true—if the user visits the chat window. If not, the lookup is only for the counter, which feels wasteful. Plus, we’re assuming the existence of a “last seen” field, which isn’t always available.
您可能会争辩说:“但是我们无论如何都在获取用户数据以在聊天窗口中显示,所以为什么不缓存并重复使用呢?”是的,如果用户访问聊天窗口。如果不是,则查找仅适用于柜台,这感觉很浪费。另外,我们假设存在一个“最后的见面”字段,这并不总是可用。

Now, consider a different question: “How many unique users have liked posts in the last month?” Here, we’re dealing with a different set of actions and data. Often, we have no choice but to perform full table scans, which can put serious strain on the database—much like asking your laptop to run Crysis on integrated graphics.
现在,考虑一个不同的问题:“上个月有多少唯一用户喜欢帖子?”在这里,我们正在处理不同的操作和数据集。通常,我们别无选择,只能执行全表扫描,这可能会对数据库造成严重的压力,就像要求笔记本电脑在集成图形上进行危机。

Enter HyperLogLog: A Probabilistic Solution
输入Hyperloglog:概率解决方案

The problem statement then brings us to HyperLogLog—a probabilistic algorithm designed to estimate the number of unique elements (the “cardinality”) in a data stream without explicitly storing every item. The key idea? The more unique elements we see, the more likely we are to encounter a hash with many contiguous zeroes.
然后,问题语句将我们带到了超段logog,这是一种概率算法,旨在估算数据流中唯一元素(“基数”)的数量,而无需显式存储每个项目。关键主意?我们看到的元素越独特,我们就越有可能遇到许多连续零的哈希。

According to HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm, the HyperLogLog algorithm is able to estimate cardinalities of \(>10^{9}\) with a typical accuracy (standard error) of \(2\%\), using \(1.5 kB\) of memory.
根据HyperloGlog:对近置量算法的分析,使用\(2 \%\)的典型准确性(标准误差),使用\(1.5 kb \ \),超置量算法能够估算\(> 10^{9} \)的红衣主教。

How HyperLogLog works?
Hyperloglog的工作原理?

HyperLogLog uses a hash function—ideally, one that produces uniformly distributed values. We take the user ID (or any suitable identifier), hash it to a numerical value, and use its binary representation. Some implementations consider the maximum number of trailing zeroes, others look at the maximum leftmost position of \(1\) (i.e., the maximum number of leading zeroes plus one). For our discussion, we’ll stick with the latter, as in the problem statement.
HyperLoglog使用哈希函数 - 理想地,产生均匀分布的值的函数。我们使用用户ID(或任何合适的标识符),将其放置在数值中,并使用其二进制表示。某些实现考虑了落后零的最大数量,而另一些实现则查看\(1 \)的最大最大位置(即,前导零的最大数量加一个)。对于我们的讨论,我们将坚持后者,如问题声明中。

Assuming no hash collisions, our problem reduces to finding the number of unique binary strings. If you remember your high school math, for a random binary string, the probability of getting \(k\) leading zeroes followed by a one is \(1 / 2^{k+1}\).
假设没有哈希碰撞,我们的问题减少了找到独特的二进制字符串的数量。如果您还记得自己的高中数学,对于一个随机的二进制字符串,则获得\(k \)领先零的概率,然后是一个,一个是\(1/2^{k+1} \)。

Alternatively, think of the string as a series of coin tosses: zero means heads, one means tails. If, across multiple individual runs, the longest run of heads we encountered is \(l\), then its probability is \(1 / 2^{l+1}\). In other words, we can estimate that, on average, we’ve tried at least \(2^{l+1}\) times.
另外,将字符串视为一系列硬币折腾:零是指头,这意味着尾巴。如果在多个单独的运行中,我们遇到的最长的头部是\(l \),则其概率为\(1 /2^{l+1} \)。换句话说,我们可以估计,平均而言,我们至少尝试了\(2^{l+1} \)次。

To put it simply: if we see a maximum run of 2 heads, \(HHT\), we probably tossed the coin at least \(2^{2+1} = 8\) times, \(\{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}\). The probability ends up being \(1 / 8\).
简而言之:如果我们看到2个头的最大运行\(hht \),我们可能至少将硬币扔到\(2^{2+1} = 8 \)次时,\(\ {\ {hhh,hht,hht,htt,htt,htt,htt,htt,thh,tht,tth,tth,tth,tth,ttt,ttt \ \} \)。概率最终为\(1/8 \)。

Note that, \(HHH\) isn’t valid here because we considered the maximum leading heads to be 2, so technically it should be one less, but to keep things simple, we can round it off.
请注意,在这里,\(HHH \)无效,因为我们认为最大的前导头是2,因此从技术上讲,它应该少少,但是要保持简单,我们可以将其弄清楚。

Try to solve for a maximum run of 10 heads—notice how quickly the probability drops!
尝试求解最大运行10个头的运行,请注意,概率下降的速度有多快!

The chance of seeing a hash with many leading zeroes is very low, and the more unique elements we see, the higher the chance we’ll encounter such a hash. However, it’s still possible—thanks to uniform hashing—to get unlucky and see a binary string with many leading zeroes after just a few tries. These are outliers, and HyperLogLog has a clever way of handling them.
看到许多领先零的哈希的机会非常低,我们看到的元素越越独特,我们遇到这样的哈希的机会就越高。但是,仍然有可能 - 感谢统一的散列 - 不幸的是,在尝试几次尝试后,看到一个带有许多领先零的二进制字符串。这些是离群值,超置槽有一种处理它们的巧妙方式。

Bucketing and Mergeability
水库和订阅性

To manage outliers, HyperLogLog distributes the hash values into different buckets based on the initial \(b\) bits of the hash. This gives us a total of \(2^b\) buckets, each tracking its own maximum. Even though we’re distributing values, the maximum among them remains the same, so our problem is mergeable—we don’t have to worry about double-counting.
为了管理离群值,超置池基于哈希的初始\(b \)位将哈希值分布到不同的存储桶中。这为我们提供了总共\(2^b \)的存储桶,每个存储桶都跟踪自己的最大值。即使我们要分发值,其中的最大值仍然相同,因此我们的问题是可以合并的,我们不必担心双重计数。

This property is incredibly useful for questions like, “How many unique users have liked posts in the last month?” We can easily combine the buckets for each day and calculate the result. To do this, we persist each day’s serialized HyperLogLog structure (typically just a few kilobytes—no big deal). The mergeability of HyperLogLog is also a boon in distributed systems: each server can maintain its own buckets, which can later be aggregated.
此属性对于“上个月有多少唯一用户喜欢帖子?”之类的问题非常有用。我们可以轻松地组合每天的存储桶并计算结果。为此,我们坚持每天的序列化超隔板结构(通常只有几千缩 - 没什么大不了的)。HyperlogLog的合并性也是分布式系统中的福音:每个服务器都可以维护自己的存储桶,以后可以汇总。

Why not just use the mean?
为什么不只是均值呢?

So, what if we take the average of all the maximums across the buckets? Does it solve the outlier problem? Unfortunately, no. The average is still affected by outliers. Instead, HyperLogLog uses the harmonic mean, which is less sensitive to large outliers.
那么,如果我们将所有最大值的平均值拿到了整个存储桶中的平均值怎么办?它可以解决异常问题吗?不幸的是,不。平均水平仍然受异常值的影响。取而代之的是,HyperLoglog使用的谐波平均值对大异常值不太敏感。

Let \(b\) be the number of bits chosen to identify a bucket, giving us \(m = 2^b\) buckets. Let \(p_i\) be the maximum leftmost position of \(1\) seen so far for bucket \(i\). The cardinality estimate is calculated as:
令\(b \)为选择识别水桶的位数,给我们\(m = 2^b \)存储桶。令\(p_i \)为\(1 \)到目前为止的最大最大位置,用于桶\(i \)。基数估计值为:

Notice that outliers appear in the denominator as \(p_i\), contributing only slightly to the cardinality. Also, the more buckets we have, the more outliers we can handle, and the more accurate the cardinality estimate becomes (although this does require more memory).
请注意,异常值以\(p_i \)的形式出现在分母中,仅对基数略有贡献。同样,我们拥有的水桶越多,我们可以处理的离群值越多,并且基数估算的准确性就越准确(尽管这确实需要更多的内存)。

Real-World Impact
现实世界的影响

The efficiency and scalability of HyperLogLog have cemented its place as an industry standard for large-scale unique counting. For instance, Redis offers built-in HyperLogLog support, enabling developers to track unique website visitors or event occurrences with just a few kilobytes of memory. Likewise, Google Analytics relies on similar probabilistic algorithms to deliver fast, accurate unique user counts across billions of events, powering real-time analytics on a global scale. Beyond these high-profile examples, HyperLogLog is widely used in streaming analytics dashboards, traffic heatmaps, and any scenario where massive data volumes demand both accuracy and minimal memory usage.
Hyperloglog的效率和可扩展性巩固了其作为大规模独特计数的行业标准的位置。例如,REDIS提供内置的HyperLogLog支持,使开发人员能够跟踪仅有几千键的内存的独特网站访问者或事件发生。同样,Google Analytics(分析)依靠类似的概率算法来提供数十亿个事件的快速,准确的独特用户数量,从而在全球范围内为实时分析提供动力。除了这些备受瞩目的示例之外,超置槽还广泛用于流式分析仪表板,流量热图以及大量数据量需求准确性和最小内存使用情况的任何情况。

Conclusion
结论

I hope you found this exploration as engaging to read as I did to put together! Thanks for joining me on this deep dive.
我希望您发现这种探索很吸引人,就像我要整理在一起一样!感谢您加入我的深度潜水。

Further Reading
进一步阅读

最新文章

热门文章