Can we extend the XOR trick for finding one or two missing numbers in a list to finding thousands of missing IDs in a billion-row table?
我们可以将XOR的技巧扩展到列表中的一个或两个丢失的数字中以在十亿行的表中找到数千个缺失的ID吗?
Yes, we can! This is possible using a data structure called an Invertible Bloom Filter (IBF) that compares two sets with space complexity based only on the size of the difference. Using a generalization of the XOR trick [1], all the values that are identical cancel out, so the size of this data structure depends only on the size of the difference.
是的,我们可以!使用称为可逆布鲁过滤器(IBF)的数据结构,该数据结构仅基于差异的大小比较两个集合与空间复杂性。使用Xor Trick [1]的概括,所有相同取消的值,因此此数据结构的大小仅取决于差异的大小。
Most explanations of Invertible Bloom Filters start with standard Bloom filters, which support two operations: insert and maybeContains . Next, they extend to counting Bloom filters, which enables a delete operation. Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation. In this article, I will take a different approach and build up to an IBF from the XOR trick.
可演变布鲁姆过滤器的大多数解释都始于标准Bloom过滤器,该过滤器支持两个操作:插入和Maybecontains。接下来,它们扩展到计数Bloom过滤器,这可以删除操作。最后,他们引入了可逆的布鲁姆过滤器,该过滤器增加了精确的获取操作和概率上市操作。在本文中,我将采用不同的方法,并从XOR技巧中构成IBF。
IBFs have remained relatively obscure in the software development community while the XOR trick is a well-known technique thanks to leetcode. My goal with this article is to connect IBFs to the XOR trick so that more developers understand this fascinating and powerful data structure.
在软件开发社区中,IBF仍然相对晦涩,而XOR技巧是众所周知的技术。我的本文目标是将IBFS连接到XOR技巧,以便更多的开发人员了解这种有趣而强大的数据结构。
Finding 3 Missing Numbers
查找3个丢失的数字
Let's start with a concrete example:
让我们从一个具体示例开始:
A = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ] B = [ 1 , 2 , 4 , 5 , 7 , 8 , 10 ]
a = [1,2,3,4,5,6,7,8,9,10] b = [1,2,4,5,7,8,10]
Set B is missing 3 numbers: {3, 6, 9}.
集B缺少3个数字:{3,6,9}。
The classic XOR trick works for finding 1 or 2 missing numbers, but what about 3 or more? If we don't know how many numbers are missing, we can use a count field to detect when the usual XOR trick will fail:
经典的XOR技巧可用于查找1或2个丢失的数字,但是3或更多呢?如果我们不知道缺少多少个数字,我们可以使用计数字段来检测通常的XOR技巧何时失败:
XOR ( A ) = 1 ^ 2 ^ ... ^ 10 = 11 COUNT ( A ) = 10 COUNT ( B ) = 7
xor(a)= 1 ^ 2 ^ ... ^ 10 = 11 count(a)= 10 count(b)= 7
Since B has 7 elements and A has 10, we know 3 numbers are missing—too many for the basic XOR trick.
由于B有7个元素,并且A有10个元素,因此我们知道3个数字缺少,这是基本XOR技巧的许多。
To work around this limitation, we can partition our sets using a hash function. Let's try with 3 partitions:
为了解决此限制,我们可以使用哈希函数对设置进行分区。让我们尝试三个分区:
H ( 1 ) = 2 , H ( 2 ) = 1 , H ( 3 ) = 1 , H ( 4 ) = 0 , H ( 5 ) = 0 H ( 6 ) = 0 , H ( 7 ) = 2 , H ( 8 ) = 1 , H ( 9 ) = 2 , H ( 10 ) = 1 A_0 = [ 4 , 5 , 6 ] B_0 = [ 4 , 5 ] A_1 = [ 2 , 3 , 8 , 10 ] B_1 = [ 2 , 8 , 10 ] A_2 = [ 1 , 7 , 9 ] B_2 = [ 1 , 7 ]
H(1)= 2,H(2)= 1,H(3)= 1,H(4)= 0,H(5)= 0 H(6)= 0,H(7)= 2,H(8)= 1,H(9)= 2,H(10)= 1 A_0 = 1 A_0 = [4,5,5,5,5,5,6] B_0,7,9] b_2 = [1,7]
Now each partition has at most 1 missing element, so we can apply the XOR trick to each:
现在每个分区最多都有1个缺少的元素,因此我们可以将XOR技巧应用于每个元素:
XOR ( B_0 ) ^ XOR ( A_0 ) = ( 4 ^ 5 ) ^ ( 4 ^ 5 ^ 6 ) = 6 XOR ( B_1 ) ^ XOR ( A_1 ) = ( 2 ^ 8 ^ 10 ) ^ ( 2 ^ 3 ^ 8 ^ 10 ) = 3 XOR ( B_2 ) ^ XOR ( A_2 ) = ( 1 ^ 7 ) ^ ( 1 ^ 7 ^ 9 ) = 9
xor(b_0) ^ xor(a_0)=(4 ^ 5) ^(4 ^ 5 ^ 6)= 6 xor(b_1) ^ xor(a_1)=(2 ^ 8 ^ 10) ^(2 ^ 3 ^ 3 ^ 8 ^ 10)= 3 xor(b_2)
Success! We recovered all 3 missing elements: {3, 6, 9}.
成功!我们恢复了所有3个缺失的元素:{3、6、9}。
Unfortunately, this approach alone doesn't work in practice. Here, we got lucky with our hash function. Usually, elements will not distribute evenly across partitions, and some partitions might still have multiple differences.
不幸的是,仅这种方法在实践中不起作用。在这里,我们对哈希功能感到幸运。通常,元素不会在各个分区之间均匀分布,并且某些分区可能仍然存在多个差异。
Detecting when the XOR trick fails
检测XOR技巧何时失败
To fix the problems with naive partitioning, we will need to use a more sophisticated approach.
为了解决幼稚分区的问题,我们将需要使用更复杂的方法。
Let's generalize beyond"missing numbers" to the concept of symmetric difference. The symmetric difference of two sets A and B is the set of elements that are in either set but not in both: (A \ B) ∪ (B \ A) , often written as A Δ B .
让我们将“缺失数字”概括为对称差异的概念。两组A和B的对称差是两个集合中的元素集:(a \ b)∪(b \ a),通常以δb的形式写成。
Figure 1: The symmetric difference A Δ B includes elements that are in either set A or set B, but not in both sets.
图1:对称差AΔB包括在设置A或集B中的元素,但在两组中都不包含。
Consider this example:
考虑此示例:
A = [ 1 , 2 , 3 , 4 , 6 ] B = [ 1 , 2 , 4 , 5 ]
a = [1,2,3,4,6] b = [1,2,4,5]
The symmetric difference is {3, 6, 5}—elements 3 and 6 are only in A, while element 5 is only in B.
对称差为{3,6,5} - 元素3和6仅在A中,而元素5仅在B中。
The naive approach of checking |COUNT(A) - COUNT(B)| = 1 fails here because the count difference is 1, but the symmetric difference has 3 elements. Applying the basic XOR trick gives a meaningless result.
检查的幼稚方法|计数(a) - 计数(b)|= 1失败,因为计数差为1,但是对称差有3个元素。应用基本的XOR技巧可带来毫无意义的结果。
To detect this case, we can use an additional accumulator with a hash function:
为了检测这种情况,我们可以使用带有哈希功能的其他蓄能器:
COUNT ( A ) = 5 , XOR ( A ) = 2 COUNT ( B ) = 4 , XOR ( B ) = 2 XOR_HASH ( A ) = H ( 1 ) ^ H ( 2 ) ^ H ( 3 ) ^ H ( 4 ) ^ H ( 6 ) XOR_HASH ( B ) = H ( 1 ) ^ H ( 2 ) ^ H ( 4 ) ^ H ( 5 ) XOR ( B ) ^ XOR ( A ) = 0 XOR_HASH ( B ) ^ XOR_HASH ( A ) = H ( 3 ) ^ H ( 5 ) ^ H ( 6 ) H ( 3 ) ^ H ( 5 ) ^ H ( 6 ) ≠ H ( 0 )
计数(a)= 5,xor(a)= 2 count(b)= 4,xor(b)= 2 xor_hash(a)= h(1) ^ h(2))= h(3) ^ h(5) ^ h(6)h(3) ^ h(5) ^ h(6)≠h(0)
While we can't yet recover the full symmetric difference, we can detect when the XOR result is unreliable by checking if the hash accumulators behave consistently.
虽然我们还不能恢复完整的对称差异,但我们可以通过检查哈希蓄能器是否始终如一地检测XOR结果何时不可靠。
Invertible Bloom Filters
可逆的花过滤器
The Core Idea
核心想法
The original XOR trick relies on an accumulator that stores the XOR aggregate of a list. Building on this, we've introduced:
原始的XOR技巧依赖于存储列表的XOR聚合的累加器。在此基础上,我们介绍了:
Partitioning to divide sets into smaller parts Additional accumulators that store the element count and hash of XOR aggregate
分区以将设置分为较小的零件,以存储元素计数和XOR聚集的哈希
To fully generalize this into a robust data structure, we need:
为了将其完全推广到强大的数据结构中,我们需要:
A partitioning scheme that creates recoverable partitions with high probability An iterative process that uses recovered values to unlock additional partitions
一种分区方案,该方案创建具有高概率可回收分区的迭代过程,该过程使用恢复值解锁其他分区
This is exactly what Invertible Bloom Filters [2] provide. IBFs use a Bloom filter-style hashing scheme to assign elements to multiple partitions, then employ a graph algorithm called"peeling" to iteratively recover the symmetric difference with very high probability [3].
这正是可逆的布鲁姆过滤器[2]提供的。IBF使用Bloom Filter式散列方案将元素分配给多个分区,然后使用称为“剥离”的图算法以非常高的概率迭代恢复对称差异[3]。
Structure
结构
An IBF consists of an array of cells, where each cell contains three accumulators:
一个IBF由一系列单元组成,每个细胞都包含三个蓄能器:
idSum : XOR aggregate of element values
IDSUM:元素值的XOR聚集
: XOR aggregate of element values hashSum : XOR aggregate of element hashes
:XOR元素值的聚集物hashsum:XOR的元素聚集
: XOR aggregate of element hashes count: number of elements in the cell
:XOR元素哈希计数的聚合:单元格中的元素数量
Figure 2: Figure from [2] showing the process of encoding elements into IBF cells.
图2:[2]的图显示了编码元素到IBF单元中的过程。
Operations
运营
For computing symmetric differences, IBFs support three key operations:
对于计算对称差异,IBFS支持三个关键操作:
Encode: Build an IBF from a set of values Subtract: Subtract one IBF from another—identical values cancel out, leaving only the symmetric difference Decode: Recover stored values by finding"pure" cells (count = ±1 and H(idSum) = hashSum) and iteratively"peeling" them
编码:从一组值中构建一个IBF减去:从另一个值中减去一个IBF - 相同的值取消,仅留下对称差异解码:通过找到“纯”单元格(count =±1和h(idsum)= hashsum)和迭代地“ epeling”它们来恢复存储的值
The IBF has one parameter d —the expected size of the symmetric difference. With proper sizing (typically m > 1.22d cells), IBFs recover the full symmetric difference with very high probability.
IBF具有一个参数D - 对称差异的预期大小。使用适当的尺寸(通常为M>1.22D单元格),IBF以很高的概率恢复了完整的对称差异。
Algorithms for each IBF operation from [2].
[2]的每个IBF操作的算法。
Example Implementation
示例实现
I have not found a solid, maintained library of IBFs in any language (if you know of any please let me know!), so I created a basic implementation in Python if you would like to play with it: https://gist.github.com/hundredwatt/a1e69ff300de941041d824e49249d3d7
我还没有找到任何语言的稳定,维护的IBF库(如果您知道任何语言,请告诉我!),因此,如果您想使用它,我在Python中创建了一个基本实现:
Example usage:
示例用法:
a = [ 1 , 2 , 4 , 5 , 6 , 7 , 9 , 10 ] b = [ 1 , 3 , 4 , 5 , 6 , 7 , 9 , 10 ] set_a = set ( a ) set_b = set ( b ) ibf_size = 100 k = 4 ibf_a = IBF ( size = ibf_size , k = k ) for item in a : ibf_a . insert ( item ) ibf_b = IBF ( size = ibf_size , k = k ) for item in b : ibf_b . insert ( item ) diff_ibf = ibf_a . subtract_ibf ( ibf_b ) success , decoded_added , decoded_removed = diff_ibf . decode ( ) assert ( success ) expected_added = set_a - set_b expected_removed = set_b - set_a print ("--- Verification ---" ) print ( f"Expected added (in A, not B): { len ( expected_added ) } items" ) print ( f"Decoded added: { len ( decoded_added ) } items" ) assert ( expected_added == decoded_added ) print ( f"Expected removed (in B, not A): { len ( expected_removed ) } items" ) print ( f"Decoded removed: { len ( decoded_removed ) } items" ) assert ( expected_removed == decoded_removed )
a = [1,2,4,4,5,6,7,9,10] b = [1,3,4,5,6,6,7,7,9,9,10] set_a = set_a = set(a)set_b = set_b = set(b)ibf_size = 100 k = 100 k = 4 ibf_a = ibf_a = ibf(size = ibf_size,size = ibf_size,k = k),用于a:a:ibfff_a a:ibf_a a:ibf_a a:ibf_a a:插入(item)ibf_b = ibf(size = ibf_size,k = k),用于b:ibf_b中的项目。插入(item)diff_ibf = ibf_a。sumtract_ibf(ibf_b)成功,解码了,decoded_remaved = diff_ibf。decode()assert(success)endure_added = set_a-set_b endure_remaved = set_b-set_a print(“ ----验证---”)print(f”添加(在a,not b)中:{len(expect_added)} item {feccord_ed_added_added = decoded_Added = decoded_added exped_Added nected = decoded_Added decoded_added nected = = =dododed_added)print(f“预期删除(在b,a)中:{len(endivy_remaved)} items”)print(f”解码删除:{len(decoded_remaved)} items ostert(expect_remeved = = decodeded_remaved)
About the"Set Reconciliation" Problem
关于“设定和解”问题
The general problem of efficiently comparing two sets without transferring their entire contents is called"set reconciliation." Early research used polynomial approaches [4], while modern solutions typically employ either Invertible Bloom Filters or error-correction codes [5].
有效比较两组而不传输其全部内容的一般问题称为“集合对帐”。早期研究使用了多项式方法[4],而现代解决方案通常采用可逆的花朵过滤器或错误校正代码[5]。
Conclusion
结论
IBFs are a powerful tool for comparing sets with space complexity based only on the size of the difference. I hope this article has helped you understand how IBFs work and how they can be used to solve real-world problems.
IBFS是仅基于差异大小的空间复杂性比较集合的功能强大工具。我希望本文能帮助您了解IBF的工作方式以及如何使用它们来解决现实世界中的问题。
If you want to learn more about IBFs, I recommend reading the references below. The"Further Reading" section contains some additional, advanced papers that I found interesting.
如果您想了解有关IBFS的更多信息,建议您阅读以下参考文献。“进一步阅读”部分包含一些我发现有趣的高级论文。
If you have any questions or feedback, please feel free to reach out to me, contact info is in the footer.
如果您有任何疑问或反馈,请随时与我联系,请联系信息在页脚中。
References
参考
[1] Florian Hartmann,"That XOR Trick" (2020), https://florian.github.io/xor-trick/
[1] Florian Hartmann,“ That Xor Trick”(2020),https://florian.github.io/xor-trick/
[2] David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese,"What's the Difference? Efficient Set Reconciliation without Prior Context," SIGCOMM (2011), https://ics.uci.edu/~eppstein/pubs/EppGooUye-SIGCOMM-11.pdf
[2] David Eppstein,Michael T. Goodrich,Frank Uyeda和George Varghese,“有什么区别?没有事先上下文的有效设置和解,” Sigcomm(2011),https://ics.uci.uci.uci.edu/~eppstein/~eppstein/ppououye-sigcomps/eppgoouye-sigcomm-11.pdf-11.pdf。
[3] Michael Molloy,"Cores in Random Hypergraphs and Boolean Formulas," Random Structures & Algorithms (2003), https://utoronto.scholaris.ca/server/api/core/bitstreams/371257de-0a44-4702-afeb-542ae9a06986/content
[3] Michael Molloy,"Cores in Random Hypergraphs and Boolean Formulas," Random Structures & Algorithms (2003), https://utoronto.scholaris.ca/server/api/core/bitstreams/371257de-0a44-4702-afeb-542ae9a06986/content
[4] Yaron Minsky and Ari Trachtenberg,"Set Reconciliation with Nearly Optimal Communication Complexity" (2000), https://ecommons.cornell.edu/server/api/core/bitstreams/c3fff828-cfb8-416a-a28b-8afa59dd2d73/content
[4] Yaron Minsky和Ari Trachtenberg,“与几乎最佳的沟通复杂性进行对帐”(2000),https://ecommons.cornell.edu/server/server/server/core/core/core/core/bitsreams/c3fff5fffff.f82fffff.f828-cffb8-cfb8-cfb8-416a-a28b-a28b-8b-8afa573/cont d273/
[5] Pengyu Gong, Lei Yang, and Liwei Wang,"Space- and Computationally-Efficient Set Reconciliation via Parity Bitmap Sketch," VLDB (2021), https://vldb.org/pvldb/vol14/p458-gong.pdf
[5] Pengyu Gong,Lei Yang和Liwei Wang,“通过奇偶校图素描的空间和计算效率设置的对帐”,Vldb(2021),https://vldb.org/pvldb/pvldb/voldb/vol14/vol14/p458-gong.p458-gong.pdff
Further Reading
进一步阅读
Invertible Bloom Lookup Tables, Mitzenmacher (2011), https://people.cs.georgetown.edu/~clay/classes/fall2017/835/papers/IBLT.pdf
可逆的Bloom查找桌,Mitzenmacher(2011),https://people.cs.georgetown.edu/~clay/classes/fall2017/835/papers/iblt.pdf.pdf
This paper extends the original IBF paper to support additional encoded fields (not just a single idSum) and a more robust decoding algorithm that I found to be necessary for production use.
本文扩展了原始的IBF纸,以支持其他编码字段(不仅是单个IDSUM)和一种更强大的解码算法,因此我发现这对于生产使用是必需的。
Simple Multi-Party Set Reconciliation, Mitzenmacher (2013), https://arxiv.org/abs/1311.2037
简单的多党集对帐,Mitzenmacher(2013),https://arxiv.org/abs/1311.2037
This paper shows how to extend IBF-based reconciliation to efficiently handle three or more parties. The implementation is neat, it encodes the IBFs on finite fields of a dimension equal to the number of parties.
本文展示了如何扩展基于IBF的对帐,以有效地处理三个或更多各方。实现很整洁,它在数量的有限字段上编码IBF,等于当事方的数量。
Simple Set Sketching, Bæk (2022), https://arxiv.org/abs/2211.03683
简单集素描,bæk(2022),https://arxiv.org/abs/2211.03683
This paper reduces the size of the IBF by eliminating the hashSum field. Instead, it uses a hash function that calculates the cell index as a checksum and has some additional checks in the decoding process to deal with potential"anomalies" that show up in this approach.
本文通过消除了Hashsum字段来减少IBF的大小。取而代之的是,它使用哈希函数来计算单元格作为校验和,并在解码过程中还有一些其他检查来处理在这种方法中显示的潜在“异常”。
Practical Rateless Set Reconciliation, Yang (2024), https://dspace.mit.edu/handle/1721.1/156671
实用无等级集对帐,杨(2024),https://dspace.mit.mit.edu/handle/1721.1/156671
Standard IBFs require knowing the expected size of the symmetric difference beforehand to size the filter correctly. This recent work introduces"rateless" IBFs that adapt their size dynamically, making them more practical when the difference size is unknown or highly variable.
标准IBF要求事先知道对称差的预期大小,以正确尺寸过滤器。这项最近的工作引入了“无重量”的IBF,该IBF动态地调整了它们的大小,在差异大小未知或高度可变时使其更实用。