Hbase 解析（四) 布隆过滤器(BloomFilter)

2023-08-06 23:32:10

1.简介

1.1 介绍

Bloom filter是1970年引入的一种数据结构，在过去的十年中，由于它为网络主机之间的组成员信息传输提供了带宽效率，因此被网络研究界采用。发送者将信息编码成一个比特向量，即Bloom过滤器，它比传统的表示形式更紧凑。Bloom本身的计算和空间成本在元素数量上是线性的。接收器使用过滤器来测试各种元素是否是集合的成员。虽然过滤器偶尔会返回假阳性，但它永远不会返回假阴性。Hbase正是利用这一点，在数据查询时判断数据是否在当前Hfile中，可以有效减少数据的读取，减少IO操作。

1.2 Bloom filter最优的大小计算

Bloom过滤器对插入到其中的元素的数量非常敏感。对于HBase来说，条目的数量取决于存储在列中的数据的大小。当前默认区域大小为256MB，因此条目计数~=256MB/（列的平均值大小）。尽管有这个经验法则，但是由于压缩，我们并没有有效的方法来计算压缩后的条目计数。因此，通常使用动态bloom过滤器来添加额外的空间，而不是允许错误率增长。

Bloom filter最优的大小计算公示为：bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185)

m表示Bloom filter中的位数（bitSize）
n表示插入bloomfilter中的元素数（maxKeys）
k表示使用的哈希函数数（nbHash）
e表示Bloom所需的误报率（err）

但且仅当k=m/n ln（2）时，误报概率最小。

Hbase中Bloom filter的设置是在创建列族时通过setBloomFilterType方法设定，Hbase支持ROW、ROWCOL、ROWPREFIX_FIXED_LENGTH三种类型的Bloom filter，创建列族时默认设置为ROW，对所有插入数据的rowkey写入到Bloom filter中。

2.hbase中布隆过滤器的实现

2.1 Bloom filter接口实现

Hbase的布隆过滤器由父接口BloomFilterBase类定义，包含2个子继承接口BloomFilter和BloomFilterWriter。

BloomFilter负责读取、判断，BloomFilterWriter负责将数据写入布隆过滤器。

2.1.1 read类

BloomFilter负责数据的读取判断其中定义了三个方法

//检查所定义的keyCell是否包含
boolean contains(Cell keyCell, ByteBuff bloom, BloomType type);
boolean contains(byte[] buf, int offset, int length, ByteBuff bloom);//是否允许Bloom filter自动load数据，默认实现为true
boolean supportsAutoLoading();

BloomFilter最终的实现类是CompoundBloomFilter类，CompoundBloomFilter的核心方法是contains方法。

  public boolean contains(Cell keyCell, ByteBuff bloom, BloomType type) {//如果根索引不包含keyCell，返回false，根索引在Hfile创建时构建，不是对所有rowkeyint block = index.rootBlockContainingKey(keyCell);if (block < 0) {return false; // This key is not in the file.}boolean result;//获得Bloom的BlockHFileBlock bloomBlock = getBloomBlock(block);try {ByteBuff bloomBuf = bloomBlock.getBufferReadOnly();//通过BloomFilterUtil的contains方法判断result = BloomFilterUtil.contains(keyCell, bloomBuf, bloomBlock.headerSize(),bloomBlock.getUncompressedSizeWithoutHeader(), hash, hashCount, type);} finally {// After the use return back the block if it was served from a cache.reader.returnBlock(bloomBlock);}if (numPositivesPerChunk != null && result) {// Update statistics. Only used in unit tests.++numPositivesPerChunk[block];}return result;}

BloomFilterUtil.contains方法中，通过不同的 BloomType，构建不同的BloomHashKey，然后读取bloomBuf中的bitvals，计算cell

对应类型的HashKey，判断在bitvals中是否为1.

  public static boolean contains(Cell cell, ByteBuff bloomBuf, int bloomOffset, int bloomSize,Hash hash, int hashCount, BloomType type) {HashKey hashKey = type == BloomType.ROWCOL ? new RowColBloomHashKey(cell): new RowBloomHashKey(cell);//最终的判断方法，实现还是比较简单return contains(bloomBuf, bloomOffset, bloomSize, hash, hashCount, hashKey);}

2.1.2write类

BloomFilterWriter类定义了一些写入方法


//在数据写入磁盘之前，压缩Bloom filter  
void compactBloom();//获得一个meta data 的Writer，写入Bloom TYpe 、数据大小等
Writable getMetaWriter();//获取一个 Bloom bits 的Writer
Writable getDataWriter();// previous cell written
Cell getPrevCell();

BloomFilterWriter的最终实现类为CompoundBloomFilterWriter。子类CompoundBloomFilterWriter的核心方法是append方法，负责将添加的数据写入到BloomFilter

  @Overridepublic void append(Cell cell) throws IOException {if (cell == null)throw new NullPointerException();enqueueReadyChunk(false);if (chunk == null) {if (firstKeyInChunk != null) {throw new IllegalStateException("First key in chunk already set: "+ Bytes.toStringBinary(firstKeyInChunk));}//第一添加时需要allocateNewChunk，Chunk动态添加，完成hash等// This will be done only once per chunkif (bloomType == BloomType.ROWCOL) {firstKeyInChunk =PrivateCellUtil.getCellKeySerializedAsKeyValueKey(PrivateCellUtil.createFirstOnRowCol(cell));} else {firstKeyInChunk = CellUtil.copyRow(cell);}allocateNewChunk();}//Chunk 实现具体的hash计算，和bit置位chunk.add(cell);this.prevCell = cell;++totalKeyCount;}

BloomFilterChunk继承自BloomFilterBase，实现了BloomFilter的读写，其中add方法实现写入

  public void add(Cell cell) {/** For faster hashing, use combinatorial generation* http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf*/int hash1;int hash2;HashKey hashKey;// 计算2次hash 写入位if (this.bloomType == BloomType.ROWCOL) {hashKey = new RowColBloomHashKey(cell);hash1 = this.hash.hash(hashKey, 0);hash2 = this.hash.hash(hashKey, hash1);} else {hashKey = new RowBloomHashKey(cell);hash1 = this.hash.hash(hashKey, 0);hash2 = this.hash.hash(hashKey, hash1);}setHashLoc(hash1, hash2);}

get方法完成数据查询，和之前BloomFilterUtil.contains方法一致

  static boolean get(int pos, ByteBuffer bloomBuf, int bloomOffset) {//实现位查找int bytePos = pos >> 3; //pos / 8int bitPos = pos & 0x7; //pos % 8// TODO access this via Util API which can do Unsafe access if possible(?)byte curByte = bloomBuf.get(bloomOffset + bytePos);curByte &= BloomFilterUtil.bitvals[bitPos];return (curByte != 0);}

3.布隆过滤器的使用

对于之前创建的布隆过滤器的使用，hbse中体现在两个地方，一个是构建scannner时，判断scanner的是否包含所需要的数据列或者列族.

在构建StoreFileScanner时，会通过shouldUseScanner方法判断，时都用到当前Scanner，其中用到了reader.passesBloomFilter的方法。

  public boolean shouldUseScanner(Scan scan, HStore store, long oldestUnexpiredTS) {// if the file has no entries, no need to validate or create a scanner.byte[] cf = store.getColumnFamilyDescriptor().getName();TimeRange timeRange = scan.getColumnFamilyTimeRange().get(cf);if (timeRange == null) {timeRange = scan.getTimeRange();}//从时间范围、startkey和endkey范围、bloomfilter判断return reader.passesTimerangeFilter(timeRange, oldestUnexpiredTS) && reader.passesKeyRangeFilter(scan) && reader.passesBloomFilter(scan, scan.getFamilyMap().get(cf));}

StoreFileReader在创建StoreFileScanner的时候创建，主要用来读取hfile文件。 passesBloomFilter方法当前Hfile的bloomFilter的类型，构建具体的bloomFilter。bloomFilter的类型是创建表时，列族中定义的。

boolean passesBloomFilter(Scan scan, final SortedSet columns) {byte[] row = scan.getStartRow();switch (this.bloomFilterType) {case ROW:if (!scan.isGetScan()) {return true;}return passesGeneralRowBloomFilter(row, 0, row.length);case ROWCOL:if (!scan.isGetScan()) {return true;}if (columns != null && columns.size() == 1) {byte[] column = columns.first();// create the required fake keyCell kvKey = PrivateCellUtil.createFirstOnRow(row, HConstants.EMPTY_BYTE_ARRAY, column);return passesGeneralRowColBloomFilter(kvKey);}// For multi-column queries the Bloom filter is checked from the// seekExact operation.return true;case ROWPREFIX_FIXED_LENGTH:return passesGeneralRowPrefixBloomFilter(scan);default:return true;}
}

passesGeneralRowBloomFilter方法中this.generalBloomFilter是创建reader时构建的BloomFilter。

  private boolean passesGeneralRowBloomFilter(byte[] row, int rowOffset, int rowLen) {BloomFilter bloomFilter = this.generalBloomFilter;if (bloomFilter == null) {return true;}// Used in ROW bloombyte[] key = null;if (rowOffset != 0 || rowLen != row.length) {throw new AssertionError("For row-only Bloom filters the row must occupy the whole array");}key = row;//判断row是否在本hfile中return checkGeneralBloomFilter(key, null, bloomFilter);}

checkGeneralBloomFilter方法中调用contains完成最终的判断。

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > HBR 聪明人如何做决策
下一篇 > win10电脑录屏快捷键是什么？win10电脑录屏软件下载

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce