mongo ObjectID 分析
MongoDB ObjectID 分析
- 官方的算法要求
- python
- `_id` 算法
- ` _id` 由谁生成
- go
- 结论
先说一下原由吧,准备对旧库做改造,涉及到把以前n(百万级)张表的数据迁移到m(1<=m<=100)张表中,那么需要确认一个问题,就是会不会有id的冲突,mongodb的id算法中有没有collection的信息来确保所有collection中id也是唯一的。
首先上结论,不会有冲突,也没有collection信息。或者说,如果会有冲突,collection对其减轻冲突的作用可以忽略。
官方的算法要求
A 4-byte timestamp, representing the ObjectId's creation, measured in seconds since the Unix epoch.
A 5-byte random value generated once per process. This random value is unique to the machine and process.
A 3-byte incrementing counter, initialized to a random value.
4字节时间戳+5字节随机值(要求机器和进程可以唯一)+3字节自增
4字节时间戳可以保证一定程度的自增,和秒级的唯一,到达2106年会耗尽。
5字节随机值看起来似乎让人感觉有点隐患,但是算法去保证机器和进程的唯一还是可以做到的。
3字节的自增值,外加一个初始化的随机,进一步减少冲突的概率。每秒一共可以产生16777216个值,每毫秒16777.216
对于这个自增是什么概念呢,举python代码示例一下
In [9]: def func():...: st = time.time()...: for i in range(16777):...: i += 1...: return (time.time() - st) * 1000
In [12]: func()
Out[12]: 2.1560192108154297
In [14]: def func():...: st = time.time()...: for i in range(16777):...: objectid.ObjectId()...: return (time.time() - st) * 1000
In [15]: func()
Out[15]: 49.735069274902344
也就是说,python的生成算法,要再快50倍什么都不干,只产生id才可能冲突。
python
_id 算法
平常使用的时候都是from bson import objectid,所以,查找到相关的文件 bson/objectid.py
class ObjectId(object):"""A MongoDB ObjectId."""_inc = random.randint(0, 0xFFFFFF)_inc_lock = threading.Lock()_machine_bytes = _machine_bytes()__slots__ = ('__id')_type_marker = 7def __init__(self, oid=None):"""Initialize a new ObjectId.An ObjectId is a 12-byte unique identifier consisting of:- a 4-byte value representing the seconds since the Unix epoch,- a 3-byte machine identifier,- a 2-byte process id, and- a 3-byte counter, starting with a random value...."""if oid is None:self.__generate()else:self.__validate(oid)
首先看注释。和mongodb id生成的算法规则是一致的。接下来,看到没有传oid的时候,会调用函数生成
def __generate(self):"""Generate a new value for this ObjectId."""oid = EMPTY# 4 bytes current timeoid += struct.pack(">i", int(time.time()))# 3 bytes machineoid += ObjectId._machine_bytes# 2 bytes pidoid += struct.pack(">H", os.getpid() % 0xFFFF)# 3 bytes incObjectId._inc_lock.acquire()oid += struct.pack(">i", ObjectId._inc)[1:4]ObjectId._inc = (ObjectId._inc + 1) % 0xFFFFFFObjectId._inc_lock.release()self.__id = oid
其中 _machine_bytes
def _machine_bytes():"""Get the machine portion of an ObjectId."""machine_hash = _md5func()if PY3:# gethostname() returns a unicode string in python 3.x# while update() requires a byte string.machine_hash.update(socket.gethostname().encode())else:# Calling encode() here will fail with non-ascii hostnamesmachine_hash.update(socket.gethostname())return machine_hash.digest()[0:3]
那么,其中machine和pid有没有可能重呢,有概率,但完全可以预先检查machine hash规避掉。(不过最新版的中间5字节已经由完全随机的算法替代了)
_id 由谁生成
首先下载pymongo源码,我这里参照我们线上版本下载了2.8的
pip download pymongo==2.8
tar -xf
打开源码目录,从collection看起,因为插入操作中会涉及相关逻辑。直接搜到
class Collection(common.BaseObject):"""A Mongo collection."""def insert(self, doc_or_docs, manipulate=True,safe=None, check_keys=True, continue_on_error=False, **kwargs):"""Insert a document(s) into this collection.If `manipulate` is ``True``, the document(s) are manipulated usingany :class:`~pymongo.son_manipulator.SONManipulator` instancesthat have been added to this :class:`~pymongo.database.Database`.In this case an ``"_id"`` will be added if the document(s) doesnot already contain one and the ``"id"`` (or list of ``"_id"``values for more than one document) will be returned...."""client = self.database.connection# Batch inserts require us to know the connected primary's# max_bson_size, max_message_size, and max_write_batch_size.# We have to be connected to the primary to know that.client._ensure_connected(True)docs = doc_or_docsreturn_one = Falseif isinstance(docs, dict):return_one = Truedocs = [docs]ids = []if manipulate:def gen():db = self.__databasefor doc in docs:# Apply user-configured SON manipulators. This order of# operations is required for backwards compatibility,# see PYTHON-709.doc = db._apply_incoming_manipulators(doc, self)if '_id' not in doc:doc['_id'] = ObjectId()doc = db._apply_incoming_copying_manipulators(doc, self)ids.append(doc['_id'])yield docelse:def gen():for doc in docs:ids.append(doc.get('_id'))yield docsafe, options = self._get_write_mode(safe, **kwargs)if client.max_wire_version > 1 and safe:...
由此可见,默认情况下,ObjectID是由客户端生成的,只有用户指定manipulate为False,_id在没有的情况下,才会由server返回。
go
var objectIDCounter = readRandomUint32()
var processUnique = processUniqueBytes()// NewObjectIDFromTimestamp generates a new ObjectID based on the given time.
func NewObjectIDFromTimestamp(timestamp time.Time) ObjectID {var b [12]bytebinary.BigEndian.PutUint32(b[0:4], uint32(timestamp.Unix()))copy(b[4:9], processUnique[:])putUint24(b[9:12], atomic.AddUint32(&objectIDCounter, 1))return bfunc processUniqueBytes() [5]byte {var b [5]byte_, err := io.ReadFull(rand.Reader, b[:])if err != nil {panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))}return b
}func readRandomUint32() uint32 {var b [4]byte_, err := io.ReadFull(rand.Reader, b[:])if err != nil {panic(fmt.Errorf("cannot initialize objectid package with crypto.rand.Reader: %v", err))}return (uint32(b[0]) << 0) | (uint32(b[1]) << 8) | (uint32(b[2]) << 16) | (uint32(b[3]) << 24)
}
mongo官方最新版,go的算法里,中间5字节已经变成了纯随机的,那么go的是啥速度呢?
for循环生成上限数量16777216只需要264.349586ms
结论
objectid是天然唯一的,不同collection和不同的db都可以做到,当然,如果你们的业务可以到达每秒产生千万级别的id,当我没说。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
