（哈希表扩展）位图布隆过滤器

c++ • 449 • 0

tgw 7 月前

仓库地址 https://gitee.com/tgwTTT/data-structure

在计算机科学中，我们经常会遇到一个问题：如何判断一个元素是否存在于一个集合中？

当数据量较小时，使用哈希表或平衡树就能很好解决。但当数据量达到数十亿级别时，传统的数据结构会遇到内存瓶颈。这时，布隆过滤器（Bloom Filter）以其独特的概率型数据结构，为我们提供了一种优雅的解决方案。

在理解布隆过滤器之前，我们先看它的前身——位图（Bitmap）。

我们首先先看一个题目：

给40亿个不重复的⽆符号整数，没排过序。给⼀个⽆符号整数，如何快速判断⼀个数是否在这40亿个数中。

解题思路1：暴⼒遍历，时间复杂度O(N)，太慢

解题思路2：排序+⼆分查找。时间复杂度消耗 O(N*logN) + O(logN)

深⼊分析：解题思路2是否可⾏，我们先算算40亿个数据⼤概需要多少内存。 1G = 1024MB = 1024*1024KB = 1024*1024*1024Byte 约等于10亿多Byte 那么40亿个整数约等于16G，说明40亿个数是⽆法直接放到内存中的，只能放到硬盘⽂件中。⽽⼆分查找只能对内存数组中的有序数据进⾏查找。

解题思路3：数据是否在给定的整形数据中，结果是在或者不在，刚好是两种状态，那么可以使⽤⼀个⼆进制⽐特位来代表数据是否存在的信息，如果⼆进制⽐特位为1，代表存在，为0代表不存在。那么我们设计⼀个⽤位表⽰数据是否存在的数据结构，这个数据结构就叫位图。

位图的核心思想

位图用一个比特位来表示一个元素是否存在：

     索引: 0 1 2 3 4 5 6 7
     位图: 0 0 1 0 1 0 0 1

位图[2]=1 → 表示数字2存在于集合中

位图[7]=1 → 表示数字7存在于集合中

其他位为0 → 表示对应数字不存在

位图的局限性

位图只能处理整数，且当数值范围很大时（如64位整数），位图会变得极其稀疏：

要表示集合{1, 1000000}，需要1,000,001位，但只使用了2位

内存浪费严重：1MB只能表示800万个整数状态

下面是位图代码展示

#pragma once
#include
#include
#include
using namespace std;
namespace bit {
	template
	class bitmap {
	public:
		bitmap() {
			vt.resize(N / 32 + 1);
		}
		void set(int key) {
			int i = key / 32;
			int j = key % 32;
			vt[i] |= (1 << j);
		}
		void reset(int key) {
			int i = key / 32;
			int j = key % 32;
			vt[i] &= ~(1 << j);
		}
		bool test(int key) {
			int i = key / 32;
			int j = key % 32;
			return vt[i] & (1 << j);
		}
	private:
		vectorvt;
	};
}
void testbitmap() {
	int num = 1e9;
	bit::bitmap<0xffffffff>mp;
	for (int i = 0; i < num; i++) {
		mp.set(rand() + i);
	}
	mp.reset(1002);
	if (mp.test(1002)) {
		cout << "1002在数组中" << endl;
	}
	else {
		cout << "不在" << endl;
	}
}

布隆过滤器：位图的进化

布隆过滤器通过多个哈希函数和位数组的组合，解决了位图的扩展性问题。

初始化：创建一个m位的位数组，所有位初始为0
添加元素：
- 使用k个不同的哈希函数计算元素的k个哈希值
- 将每个哈希值对应的位设为1
查询元素：
- 计算元素的k个哈希值
- 检查对应位是否全为1：
- 如果有任意一位为0 → 元素一定不存在
- 如果全为1 → 元素可能存在（可能有误判）

实际应用场景：

数据库系统：如Google Bigtable、HBase用布隆过滤器避免不必要的磁盘查找
网络爬虫：检测URL是否已爬取
垃圾邮件过滤：检查发件人是否在黑名单
区块链：比特币用布隆过滤器加速钱包同步
CDN缓存：判断资源是否已在边缘节点缓存

下面是布隆过滤器代码展示：

#pragma once
#include"Hash.h"
#include
struct HashFuncBKDR
{
	// @detail 本 算法由于在Brian Kernighan与Dennis Ritchie的《The CProgramming Language》
	// 一书被展示而得 名，是一种简单快捷的hash算法，也是Java目前采用的字符串的Hash算法累乘因子为31。
	size_t operator()(const std::string& s)
	{
		size_t hash = 0;
		for (auto ch : s)
		{
			hash *= 31;
			hash += ch;
		}
		return hash;
	}
};
struct HashFuncAP
{
	// 由Arash Partow发明的一种hash算法。  
	size_t operator()(const std::string& s)
	{
		size_t hash = 0;
		for (size_t i = 0; i < s.size(); i++)
		{
			if ((i & 1) == 0) // 偶数位字符
			{
				hash ^= ((hash << 7) ^ (s[i]) ^ (hash >> 3));
			}
			else              // 奇数位字符
			{
				hash ^= (~((hash << 11) ^ (s[i]) ^ (hash >> 5)));
			}
		}

		return hash;
	}
};
struct HashFuncDJB
{
	// 由Daniel J. Bernstein教授发明的一种hash算法。 
	size_t operator()(const std::string& s)
	{
		size_t hash = 5381;
		for (auto ch : s)
		{
			hash = hash * 33 ^ ch;
		}

		return hash;
	}
};

template
class BloomFilter {
public:
	void set(K key) {
		size_t num1 = hash1()(key) % M;
		size_t num2 = hash2()(key) % M;
		size_t num3 = hash3()(key) % M;
		mp.set(num1);
		mp.set(num2);
		mp.set(num3);
	}
	bool Test(K key) {
		size_t num1 = hash1()(key) % M;
		if (!mp.test(num1))return false;
		size_t num2 = hash2()(key) % M;
		if (!mp.test(num2))return false;
		size_t num3 = hash3()(key) % M;
		if (!mp.test(num3))return false;
		return true;
	}
	double getFalseProbability()
	{
		double p = pow((1.0 - pow(2.71, -3.0 / X)), 3.0);

		return p;
	}
private:
	static const size_t M = N * X;
	bit::bitmapmp;
};
void TestBloomFilter1()
{
	BloomFilter<10> bf;
	bf.set("猪八戒");
	bf.set("孙悟空");
	bf.set("唐僧");

	cout << bf.Test("猪八戒") << endl;
	cout << bf.Test("孙悟空") << endl;
	cout << bf.Test("唐僧") << endl;
	cout << bf.Test("沙僧") << endl;
	cout << bf.Test("猪八戒1") << endl;
	cout << bf.Test("猪戒八") << endl;
}

今天的更新就到这里，如有错误欢饮评论区指出！！

1 赏

‹ 上一篇

下一篇 ›

作者尚未添加打赏二维码！

（哈希表扩展）位图布隆过滤器

位图的核心思想

位图的局限性

布隆过滤器：位图的进化

实际应用场景：

相关文章

评论

我是萌萌哒的侧边栏！