北大天网搜索引擎TSE分析及完全注释[5]倒排索引的建立及文件介绍
不好意思让大家久等了,前一阵一直在忙考试,终于结束了。呵呵!废话不多说了下面我们开始吧!
TSE用的是将抓取回来的网页文档全部装入一个大文档,让后对这一个大文档内的数据整体统一的建索引,其中包含了几个步骤。
view plain copy to clipboard print ?- 1. The document index (Doc.idx) keeps information about each document.
- It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
- The information stored in each entry includes a pointer into the repository,
- a document length, a document checksum.
- //Doc.idx 文档编号 文档长度 checksum hash码
- 0 0 bc9ce846d7987c4534f53d423380ba70
- 1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
- 2 141624 d019433008538f65329ae8e39b86026c
- 3 142350 5705b8f58110f9ad61b1321c52605795
- //Doc.idx end
- The url index (url.idx) is used to convert URLs into docIDs.
- //url.idx
- 5c36868a9c5117eadbda747cbdb0725f 0
- 3272e136dd90263ee306a835c6c70d77 1
- 6b8601bb3bb9ab80f868d549b5c5a5f3 2
- 3f9eba99fa788954b5ff7f35a5db6e1f 3
- //url.idx end
- It is a list of URL checksums with their corresponding docIDs and is sorted by
- checksum. In order to find the docID of a particular URL, the URL's checksum
- is computed and a binary search is performed on the checksums file to find its
- docID.
- ./DocIndex
- got Doc.idx, Url.idx, DocId2Url.idx //Data文件夹中的Doc.idx DocId2Url.idx和Doc.idx中
- //DocId2Url.idx
- 0 http://*.*.edu.cn/index.aspx
- 1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
- 2 http://*.*.edu.cn/0102.html
- 3 http://*.*.edu.cn/0103.html
- //DocId2Url.idx end
- 2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夹中的Url.idx.sort_uniq
- //Url.idx.sort_uniq
- //对hash值进行排序
- 000bfdfd8b2dedd926b58ba00d40986b 1111
- 000c7e34b653b5135a2361c6818e48dc 1831
- 0019d12f438eec910a06a606f570fde8 366
- 0033f7c005ec776f67f496cd8bc4ae0d 2103
- 3. Segment document to terms, (with finding document according to the url)
- ./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448为爬回来的文件 ,每个页面包含http头
- got Tianwang.raw.2559638448.seg
- //Tianwang.raw.2559638448 爬取的原始网页文件在文档内部每一个文档之间应该是通过version,