RocketMq文件存储-MappedFile
broker端存储消息数据,如CommitLog,ConsumeQueue,IndexFile等均是通过MappedFile实现,其是对文件操作的封装。
MappedFile对应着磁盘上的存储文件,同时也是MappedByteBuffer的封装,消息存储跟磁盘、内存的交互都是通过它完成,是对文件操作的抽象。
相关IO知识参考:
RocketMQ源码解析-零拷贝
rocketMQ零拷贝+kafka零拷贝+netty零拷贝分析
ReferenceResource-文件引用计数
文件引用计数,相当于实现文件加锁操作。
public abstract class ReferenceResource {// 文件缓冲区引用数量,默认1protected final AtomicLong refCount = new AtomicLong(1);// 存活状态,资源处于非存活状态 - 不可用protected volatile boolean available = true;// 执行完子类的cleanUp() 资源完全释放protected volatile boolean cleanupOver = false;// 第一次shutdown时间 第一次关闭资源可能失败,外部程序可能还依耐当前资源(refCount>0,此时记录初次关闭资源的时间)// 之后 再次关闭该资源的时候,会传递一个interval 参数,如果系统当前时间-firstShutdownTimestamp》interval 强制关闭private volatile long firstShutdownTimestamp = 0;// 持有文件量增加public synchronized boolean hold() {if (this.isAvailable()) {if (this.refCount.getAndIncrement() > 0) {return true;} else {// 越界this.refCount.getAndDecrement();}}return false;}public void shutdown(final long intervalForcibly) {if (this.available) {this.available = false;// 初次关闭时间this.firstShutdownTimestamp = System.currentTimeMillis();// 引用计数-1this.release();} else if (this.getRefCount() > 0) {if ((System.currentTimeMillis() - this.firstShutdownTimestamp) >= intervalForcibly) {// 强制关闭this.refCount.set(-1000 - this.getRefCount());this.release();}}}// 释放资源public void release() {// -1long value = this.refCount.decrementAndGet();//if (value > 0)return;// 当前资源无其他程序依赖,可以释放synchronized (this) {this.cleanupOver = this.cleanup(value);}}// 子类实现具体的释放逻辑public abstract boolean cleanup(final long currentRef);} 实现了文件资源的引用计数
实现对文件资源的释放抽象,具体如何释放交给子类实现(MappedFile)
MappedFile -文件操作抽象层
MappedFile的实现时通过FileChannel和ByteBuffer实现,一般采用直接内存,使用内存映射技术,直接将虚拟内存直接映射到物理内存中,操作虚拟内存实际就是在操作物理内存,减少了一次内存拷贝的操作。
public class MappedFile extends ReferenceResource {// 内存页大小 4KBpublic static final int OS_PAGE_SIZE = 1024 * 4;// 当前进程下 所有的mappedFile占用的总虚拟内存大小private static final AtomicLong TOTAL_MAPPED_VIRTUAL_MEMORY = new AtomicLong(0);// 当前进程下所有的mappedFile个数private static final AtomicInteger TOTAL_MAPPED_FILES = new AtomicInteger(0);// 当前mappedFile数据写入点protected final AtomicInteger wrotePosition = new AtomicInteger(0);// 当前mappedFile数据提交点protected final AtomicInteger committedPosition = new AtomicInteger(0);// 当前mappedFile数据罗盘点(flushedPosition)之前的数据都是安全的,flushedPosition-wrotePosition之间的数据还是脏页private final AtomicInteger flushedPosition = new AtomicInteger(0);// 文件大小protected int fileSize;// 文件通道protected FileChannel fileChannel;//Message will put to here first, and then reput to FileChannel if writeBuffer is not null.// jvm内存protected ByteBuffer writeBuffer = null;protected TransientStorePool transientStorePool = null;// 文件名(commitLog,consumeQueue:文件名是第一条消息的物理偏移量,)private String fileName;//private long fileFromOffset;// 文件对像private File file;// 内存映射缓冲区private MappedByteBuffer mappedByteBuffer;private volatile long storeTimestamp = 0;// 当前文件如果是目录内 有效文件的首文件 该值为TRUEprivate boolean firstCreateInQueue = false;public MappedFile() {}//public MappedFile(final String fileName, final int fileSize) throws IOException {// 文件名,文件大小init(fileName, fileSize);}
} 初始化
public void init(final String fileName, final int fileSize,final TransientStorePool transientStorePool) throws IOException {init(fileName, fileSize);//this.writeBuffer = transientStorePool.borrowBuffer();this.transientStorePool = transientStorePool;
}// 当前进程下 所有的mappedFile占用的总虚拟内存大小
private static final AtomicLong TOTAL_MAPPED_VIRTUAL_MEMORY = new AtomicLong(0);
// 当前进程下所有的mappedFile个数
private static final AtomicInteger TOTAL_MAPPED_FILES = new AtomicInteger(0);private void init(final String fileName, final int fileSize) throws IOException {this.fileName = fileName;this.fileSize = fileSize;// 文件this.file = new File(fileName);// 文件起始this.fileFromOffset = Long.parseLong(this.file.getName());boolean ok = false;// 创建目录(确保fileName对应的目录存在)ensureDirOK(this.file.getParent());try {// 创建文件通道 可读写this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel();// 文件内存映射缓冲区this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize);//TOTAL_MAPPED_VIRTUAL_MEMORY.addAndGet(fileSize);TOTAL_MAPPED_FILES.incrementAndGet();ok = true;} catch (FileNotFoundException e) {log.error("Failed to create file " + this.fileName, e);throw e;} catch (IOException e) {log.error("Failed to map file " + this.fileName, e);throw e;} finally {if (!ok && this.fileChannel != null) {this.fileChannel.close();}}
} 确保文件夹存在
根据文件地址创建FileChannel对象
创建文件的内存映射MappedByteBuffer对象
文件写入消息
普通直接写入字节
public boolean appendMessage(final byte[] data) {int currentPos = this.wrotePosition.get();if ((currentPos + data.length) <= this.fileSize) {try {ByteBuffer buf = this.mappedByteBuffer.slice();buf.position(currentPos);// 直接写入buf.put(data);} catch (Throwable e) {log.error("Error occurred when append message to mappedFile.", e);}// 更新写入数据位点this.wrotePosition.addAndGet(data.length);return true;}return false;
} 追加写
public AppendMessageResult appendMessagesInner(final MessageExt messageExt, final AppendMessageCallback cb,PutMessageContext putMessageContext) {assert messageExt != null;assert cb != null;// 当前文件的写入位点int currentPos = this.wrotePosition.get();// 条件成立:说明文件还可以继续写if (currentPos < this.fileSize) {// 创建文件切片ByteBuffer byteBuffer = writeBuffer != null ? writeBuffer.slice() : this.mappedByteBuffer.slice();// 切片的写入位点设置byteBuffer.position(currentPos);//AppendMessageResult result;// 具体的写入逻辑,交给钩子方法实现// 单条消息if (messageExt instanceof MessageExtBrokerInner) {// 追加数据 具体追加逻辑由钩子方法去实现result = cb.doAppend(this.getFileFromOffset(), byteBuffer, this.fileSize - currentPos,(MessageExtBrokerInner) messageExt, putMessageContext);//} else if (messageExt instanceof MessageExtBatch) {// 批量写入result = cb.doAppend(this.getFileFromOffset(), byteBuffer, this.fileSize - currentPos,(MessageExtBatch) messageExt, putMessageContext);} else {return new AppendMessageResult(AppendMessageStatus.UNKNOWN_ERROR);}// 位点更新this.wrotePosition.addAndGet(result.getWroteBytes());// 存储时间this.storeTimestamp = result.getStoreTimestamp();//return result;}log.error("MappedFile.appendMessage return null, wrotePosition: {} fileSize: {}", currentPos, this.fileSize);return new AppendMessageResult(AppendMessageStatus.UNKNOWN_ERROR);
} 该方法的方式是将追加的逻辑交给钩子方法实现,返回写入的长度,用于更新写入点位。
刷盘
我们知道追加写入到MappedFile的系统映射缓冲中,但是何时刷盘是操作系统的IO调度线程根据调度算法确定,为了保证数据能写入到磁盘,操作系统提供强制刷盘的系统调用方法。
同时,刷盘可以指定最小的刷盘页数(每页默认4KB)
/**
* @param flushLeastPages 刷盘最小页数,为0时强制刷盘
* @return The current flushed position
*/
public int flush(final int flushLeastPages) {if (this.isAbleToFlush(flushLeastPages)) {// 保证刷盘的过程中资源不被释放if (this.hold()) {// 写入位点int value = getReadPosition();try {//We only append data to fileChannel or mappedByteBuffer, never both.if (writeBuffer != null || this.fileChannel.position() != 0) {this.fileChannel.force(false);} else {this.mappedByteBuffer.force();}} catch (Throwable e) {log.error("Error occurred when force data to disk.", e);}this.flushedPosition.set(value);this.release();} else {log.warn("in flush, hold failed, flush offset = " + this.flushedPosition.get());this.flushedPosition.set(getReadPosition());}}return this.getFlushedPosition();
}/**
* @param flushLeastPages 刷盘的最小页数(0:强制刷盘,>0:需要脏页数据达到flushLeastPages 才进行物理刷盘))
* @return
*/
private boolean isAbleToFlush(final int flushLeastPages) {// 当前已经刷盘的位点int flush = this.flushedPosition.get();// 当前写入位点int write = getReadPosition();// 文件写满了?if (this.isFull()) {return true;}if (flushLeastPages > 0) {// 脏页是否达到最小刷盘页数大小return ((write / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE)) >= flushLeastPages;}// =0: 强制刷盘return write > flush;
} 提交消息
对于提交消息操作,只有针对jvm内存时才需要单独操作。直接内存操作系统会自动刷盘,写入到哪里就算提交到哪里
使用jvm内存作为缓冲区 commitPos=commitPosition
使用直接内存作为缓冲区 commitPos = writePosition
public int commit(final int commitLeastPages) {// 只针对jvm内存,若是直接内存则返回写入点位if (writeBuffer == null) {//no need to commit data to file channel, so just regard wrotePosition as committedPosition.return this.wrotePosition.get();}// 是否达到批量提交的数据量if (this.isAbleToCommit(commitLeastPages)) {if (this.hold()) {commit0();this.release();} else {log.warn("in commit, hold failed, commit offset = " + this.committedPosition.get());}}// All dirty data has been committed to FileChannel.if (writeBuffer != null && this.transientStorePool != null && this.fileSize == this.committedPosition.get()) {this.transientStorePool.returnBuffer(writeBuffer);this.writeBuffer = null;}return this.committedPosition.get();
}protected void commit0() {int writePos = this.wrotePosition.get();int lastCommittedPosition = this.committedPosition.get();if (writePos - lastCommittedPosition > 0) {try {ByteBuffer byteBuffer = writeBuffer.slice();byteBuffer.position(lastCommittedPosition);byteBuffer.limit(writePos);this.fileChannel.position(lastCommittedPosition);// 将jvm内存写入文件this.fileChannel.write(byteBuffer);this.committedPosition.set(writePos);} catch (Throwable e) {log.error("Error occurred when commit data to FileChannel.", e);}}
}
protected boolean isAbleToCommit(final int commitLeastPages) {int commit = this.committedPosition.get();int write = this.wrotePosition.get();// 文件满了if (this.isFull()) {return true;}// 剩余待提交达到提交的数据量if (commitLeastPages > 0) {return ((write / OS_PAGE_SIZE) - (commit / OS_PAGE_SIZE)) >= commitLeastPages;}return write > commit;
} 读取内容
读取文件消息,返回一个缓冲区的切片,同时增加文件引用计数。
public SelectMappedBufferResult selectMappedBuffer(int pos, int size) {// 可以读取的位置(writePos或commitPos)int readPosition = getReadPosition();// 能够满足大小if ((pos + size) <= readPosition) {// 增加引用计数if (this.hold()) {// 返回切片ByteBuffer byteBuffer = this.mappedByteBuffer.slice();byteBuffer.position(pos);ByteBuffer byteBufferNew = byteBuffer.slice();byteBufferNew.limit(size);return new SelectMappedBufferResult(this.fileFromOffset + pos, byteBufferNew, size, this);}}return null;
}// 从指定位置读所有
public SelectMappedBufferResult selectMappedBuffer(int pos) {// 可读位置 = writePositionint readPosition = getReadPosition();// 有效位点if (pos < readPosition && pos >= 0) {//if (this.hold()) {// 资源计数+1// 切片ByteBuffer byteBuffer = this.mappedByteBuffer.slice();// 设置位置byteBuffer.position(pos);// 可读大小int size = readPosition - pos;// 切片ByteBuffer byteBufferNew = byteBuffer.slice();//byteBufferNew.limit(size);return new SelectMappedBufferResult(this.fileFromOffset + pos, byteBufferNew, size, this);}}return null;
}public int getReadPosition() {return this.writeBuffer == null ? this.wrotePosition.get() : this.committedPosition.get();
} 销毁MappedFile
使用完文件,或则文件过期需要销毁文件时,对文件进行物理删除。
public boolean destroy(final long intervalForcibly) {// 关闭MappedFilethis.shutdown(intervalForcibly);if (this.isCleanupOver()) {try {// 关闭channelthis.fileChannel.close();// 物理删除文件boolean result = this.file.delete();} catch (Exception e) {log.warn("close file channel " + this.fileName + " Failed. ", e);}return true;} else {log.warn("destroy mapped file[REF:" + this.getRefCount() + "] " + this.fileName+ " Failed. cleanupOver: " + this.cleanupOver);}return false;
} showdown()参考其父类ReferenceResource实现。
删除物理磁盘上的文件
预热文件
RocketMQ 使用文件预热优化后,在进行内存映射后,会预先写入数据到文件中,并且将文件内容加载到 page cache,当消息写入或者读取的时候,可以直接命中 page cache,避免多次缺页中断。
每个缓存页写入0字节
public void warmMappedFile(FlushDiskType type, int pages) {long beginTime = System.currentTimeMillis();// 1. 创建一个新的字节缓冲区// 新缓冲区的内容将从该缓冲区的当前位置开始。对该缓冲区内容的更改将在新缓冲区中可见,ByteBuffer byteBuffer = this.mappedByteBuffer.slice();int flush = 0;long time = System.currentTimeMillis();// OS_PAGE_SIZE为4KBfor (int i = 0, j = 0; i < this.fileSize; i += MappedFile.OS_PAGE_SIZE, j++) {// 2. MappedByteBuffer 每隔 4KB 就写入一个 0 bytebyteBuffer.put(i, (byte) 0);// 3. 如果为同步刷盘策略,则执行强制刷盘// 缓存中未刷盘的页数超过4096页时执行一次刷盘// 4096 * 4KB = 16MB, 也就是未落盘数据超过16MB就执行一次刷盘// force flush when flush disk type is syncif (type == FlushDiskType.SYNC_FLUSH) {if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {flush = i;mappedByteBuffer.force();}}// 4. 每写入1000个字节时就执行Thread.sleep(0)// 让线程放弃CPU,防止时间未用完的时候还占用CPU不让优先级低的线程使用CPU// prevent gcif (j % 1000 == 0) {log.info("j={}, costTime={}", j, System.currentTimeMillis() - time);time = System.currentTimeMillis();try {Thread.sleep(0);} catch (InterruptedException e) {log.error("Interrupted", e);}}}// 5. 如果为同步刷盘策略,则将还未落盘的数据落盘// force flush when prepare load finishedif (type == FlushDiskType.SYNC_FLUSH) {log.info("mapped file warm-up done, force to disk, mappedFile={}, costTime={}",this.getFileName(), System.currentTimeMillis() - beginTime);mappedByteBuffer.force();}log.info("mapped file warm-up done. mappedFile={}, costTime={}", this.getFileName(),System.currentTimeMillis() - beginTime);// 6. 内存锁定this.mlock();
} 为什么 MappedByteBuffer 每隔 4KB 写入一个 0 byte?
调用Mmap进行内存映射后,OS只是建立虚拟内存地址至物理地址的映射关系,实际上并不会加载任何MappedFile数据至内存中。
如果不加载任何MappedFile数据至内存中的话,程序要访问数据时OS会检查该部分的分页是否已经在内存中,如果不在,则发出一次缺页中断。这样的话,1GB的CommitLog需要发生26w多次缺页中断,才能使得对应的数据才能完全加载至物理内存中(X86的Linux中一个标准页面大小是4KB)。
所以有必要对每个内存页面中写入一个假的值(byte 0)。在上面的warmMappedFile()源码中可以看到MappedByteBuffer 每隔 4KB 就写入一个 0 byte,而4KB刚好是一个页的大小,这样就刚好把一个MappedFile文件数据调入内存中,也就是进行文件预热了。
内存锁定
// 内存加锁
public void mlock() {final long beginTime = System.currentTimeMillis();// 内存起始地址final long address = ((DirectBuffer) (this.mappedByteBuffer)).address();Pointer pointer = new Pointer(address);{// 加锁int ret = LibC.INSTANCE.mlock(pointer, new NativeLong(this.fileSize));log.info("mlock {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime);}{// 建议操作系统加载数据到内存,将来可能会使用// 向内核提出关于使用内存的建议,建议使用MADV_WILLNEED模式int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED);log.info("madvise {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime);}
}// 内存解锁
public void munlock() {final long beginTime = System.currentTimeMillis();final long address = ((DirectBuffer) (this.mappedByteBuffer)).address();Pointer pointer = new Pointer(address);// 解锁int ret = LibC.INSTANCE.munlock(pointer, new NativeLong(this.fileSize));log.info("munlock {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime);
} mlock()内存锁定可以将进程使用的部分或全部的地址空间锁定在物理内存中,防止其被交换到swap空间。
对于RocketMQ这种的高吞吐量的分布式消息队列来说,追求的是消息读写低延迟,那么肯定希望要使用的数据在物理内存不被交换到swap空间,这样能提高数据读写访问的操作效率。
LibC.INSTANCE.mlock:将锁住指定的内存区域避免被操作系统调到swap空间中。
LibC.INSTANCE.madvise:向内核提供一个针对于地址区间的I/O的建议,内核可能会采纳这个建议,会做一些预读的操作。例如MADV_WILLNEED表示预计不久将会被访问,建议OS做一次内存映射后对应的文件数据尽可能多的预加载至内存中,这样可以减少了缺页异常的产生。从而达到内存预热的效果。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
