Bitcask

April 5, 2025

Bitcask是一个高性能的键值存储系统，设计之初的目的是提供高写入吞吐量和高效读取性能。采用了日志化结构哈希表(Log-Structured Hash Table)，核心是写前日志(WAL)、内存哈希表和定期merge.

具体细节均在官方的设计论文 Bitcask Design Paper 中可以查看。

官方论文中提到的bitcask应该具备的特性
low latency per item read or written
high throughput, especially when writing an incoming stream of random items
ability to handle datasets much larger than RAM w/o degradation
crash friendliness, both in terms of fast recovery and not losing data
ease of backup and restore
a relatively simple, understandable (and thus supportable) code structure and data format
predictable behavior under heavy access load or large volume
a license that allowed for easy default use in Riak

目前已经有优秀的开源实现：

Bitcask 核心

目录结构(bitcask)：作为一个目录，存储数据文件，仅允许一个进程同时写入
数据文件(datafile)：
- activefile: 当前用于写入的文件，达到阈值（1GB）时关闭，变成仅读文件
- readonlyfile: 仅读文件，存储旧数据条目
- dataentry: 写入数据文件的数据条目 => { crc | tstamp | key_sz | value_sz | key | value }
内存结构(keydir): 映射键值到数据文件中的位置，每次写入时原子更新
- index: [key] -> { file_id | value_sz | value_pos | tstamp }
合并(merge): 定期或者主动触发合并操作，遍历不可变文件，创建新的包含最新现有键的数据文件，同时生成提示文件(hint)
提示文件(hint): 包含数据文件中值的位置和大小信息，即保存keydir，加速启动流程
恢复崩溃(recover): 由于数据文件的不可变且作为提交日志，恢复简单，扫描提示文件进行恢复

在bitcask论文中提到，允许处理比内存大得多的数据集，且不显著降低性能，是因为其采用了将键和元数据存储在内存中，而将数据值存储在磁盘上，通过键值分离的设计，使其能够处理比内存大得多的数据集。但是因此bitcask将所有的键存储在内存中，这可能在键数量极多的时候成为瓶颈。