NLP07：基于潜在隐语义索引的文本相似度计算

2023-08-31 01:11:39

1.潜在隐语义索引(LSI)概述

潜在语义索引(Latent Semantic Indexing,以下简称LSI)，有的文章也叫Latent Semantic Analysis（LSA）。其实是一个东西，后面我们统称LSI，它是一种简单实用的主题模型。LSI是基于奇异值分解（SVD）的方法来得到文本的主题的。

这里我们简要回顾下SVD：对于一个 $\times n$ 的矩阵 $A$ ，可以分解为下面三个矩阵：
$A_{m \times n} = U_{m \times m}\Sigma_{m \times n} V^T_{n \times n}$
有时为了降低矩阵的维度到k，SVD的分解可以近似的写为：
$A_{m \times n} \approx U_{m \times k}\Sigma_{k \times k} V^T_{k \times n}$
如果把上式用到我们的主题模型，则SVD可以这样解释：我们输入的有m个文本，每个文本有n个词。而 $A_{ij}$ 则对应第 $i$ 个文本的第 $j$ 个词的特征值，这里最常用的是基于预处理后的标准化TF-IDF值。k是我们假设的主题数，一般要比文本数少。SVD分解后， $U_{il}$ 对应第 $i$ 个文本和第 $l$ 个主题的相关度。 $V_{jm}$ 对应第 $j$ 个词和第 $m$ 个词义的相关度。 $Σ_{lm}$ 对应第 $l$ 个主题和第 $m$ 个词义的相关度。

也可以反过来解释：我们输入的有 $m$ 个词，对应 $n$ 个文本。而 $A_{ij}$ 则对应第 $i$ 个词档的第 $j$ 个文本的特征值，这里最常用的是基于预处理后的标准化TF-IDF值。k是我们假设的主题数，一般要比文本数少。SVD分解后， $U_{il}$ 对应第 $i$ 个词和第 $l$ 个词义的相关度。 $V_{jm}$ 对应第 $j$ 个文本和第 $m$ 个主题的相关度。 $Σ_{lm}$ 对应第 $l$ 个词义和第 $m$ 个主题的相关度。

这样我们通过一次SVD，就可以得到文档和主题的相关度，词和词义的相关度以及词义和主题的相关度。

2. 相似度计算

通过LSI得到的文本主题矩阵可以用于文本相似度计算。而计算方法一般是通过余弦相似度。

from gensim.test.utils import common_dictionary, common_corpus
from gensim.models import LsiModel
from gensim import similaritiesif __name__ == '__main__':for k, v in common_dictionary.items():print(k, v)print(len(common_dictionary))  # 12个词汇print(len(common_corpus))  # 9个文档model = LsiModel(common_corpus, num_topics=3, id2word=common_dictionary)  # 3个主题vectorized_corpus = model[common_corpus]  # 右奇异向量，文档-主题 (9,3)# for x in vectorized_corpus:#     print(x)#print(model.projection.u.shape)  # 左奇异向量主题-单词，shape为(12,3)print(model.projection.s.shape)  # 奇异值 (3,)for x in vectorized_corpus:print(x)index = similarities.MatrixSimilarity(vectorized_corpus)print("==" * 30)print(vectorized_corpus[0])print(list(enumerate(index[vectorized_corpus[0]])))  # 计算各个文本与第一个文本的相似度

3.实战

import re
from collections import defaultdict
import jieba.posseg
import numpy as np
import codecs
import os
import pickle
from gensim import corpora,models,similarities

def tokenizer(filename, stop_words):"""读取文件内容，并进行分词:param filename:文件名称:param stop_words:list,停用词:return:[[word1,word2]]"""texts = []with open(filename, "r", encoding="utf-8") as f:for line in f.readlines():texts.append([token for token, _ in jieba.posseg.cut(line.rstrip()) if token not in stop_words])# 去除仅出现一次的单词frequency = defaultdict(int)for text in texts:for token in text:frequency[token] += 1texts = [[token for token in text if frequency[token] > 1] for text in texts]return texts

stop_words_filepath = "/content/drive/My Drive/data/qa/data/stop_words.txt"
knowledge_texts_filepath = "/content/drive/My Drive/data/qa/data/knowledge.txt"
stop_words = codecs.open(stop_words_filepath, "r", encoding="utf-8").readlines()
stop_words = [w.strip() for w in stop_words]
texts = tokenizer(knowledge_texts_filepath, stop_words)

def topk_sim_ix(texts,stops,k):""":param file_name: 分词后的训练样本:param stop_words:停用词:param k:与每个文本top k相似度的文本:return:list"""dictionary = corpora.Dictionary(texts) #构建词典corpus=[dictionary.doc2bow(text) for text in texts] #bow# 构建LSI模型lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=10)  # 潜在语义索引(分析),主题数量为10index = similarities.MatrixSimilarity(lsi[corpus],num_best=k)  # 计算相似度vec_lsi=lsi[corpus]return index[vec_lsi]

index=topk_sim_ix(texts,stop_words,5)

总共11740个文档，每个文档选择5个最相似的文档

len(index),len(texts),len(index[0])

第一个文档，除了文档本身外，最相似的就是第123、39、3985、11176个文档

for index_text in index[0]:print(texts[index_text[0]],index_text[1])

参考：https://www.cnblogs.com/pinard/p/6805861.html

代码：https://github.com/chongzicbo/nlp-ml-dl-notes/blob/master/code/nlp_tutorial/NLP07%EF%BC%9A%E5%9F%BA%E4%BA%8ELSI%E7%9A%84%E6%96%87%E6%9C%AC%E7%9B%B8%E4%BC%BC%E5%BA%A6%E8%AE%A1%E7%AE%97.ipynb

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

NLP07：基于潜在隐语义索引的文本相似度计算

1.潜在隐语义索引(LSI)概述

2. 相似度计算

3.实战

相关文章