使用Glove提取单词特征

2023-11-24 18:07:07

文章目录

简介
使用
使用torchtext

简介

在2013年Tomas Mikolov等人提出word2vec之后，2014年，Jeffrey Pennington, Richard Socher, Christopher D. Manning三人提出了GloVe算法。其中，GloVe是Global Vector的缩写。在传统上，实现word embedding（词嵌入）主要有两种方法，Matrix Factorization Methods（矩阵分解方法、例如LSA）和Shallow Window-Based Methods（基于浅窗口的方法，例如word2vec），二者分别有优缺点，而GloVe结合了两者之间的优点。从论文中的实验，可以看到GloVe方法好于word2vec等方法。

Glove算法是一种基于全局词频统计的回归算法。它不是基于神经网络的，而是基于最小二乘原理的回归方法。

Glove论文：Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. ACL, 1532–1543.

Glove官网

代码：来自论文2018-MM-Cross-modal Moment Localization in Videos的代码https://acmmm18.wixsite.com/role
在2018-ICCV-Grounding Referring Expressions in Images by Variational Context的代码https://github.com/yuleiniu/vc/中首次被用到

使用

若直接使用vocabulary_72700.txt和embed_matrix.npy，注意转换前将单词转为小写，否则可能转换失败

功能：将多个句子list(str) 转为 300d float32词嵌入，支持72700个英语单词（不含pad、go、eos、unk）

if T >= 0:tensor shape[句子数，T, 300]
else:list(tensor) shape[句子数，该句单词数, 300]

项目地址：
链接：https://pan.baidu.com/s/1bmcrRCeQy7vNbxW-f2E1sg?pwd=7dl9
提取码：7dl9

"""
vocabulary_72700.txt 72704个单词，开始4个为   padding    start of sequence   end of sequence   unknown words
"""import io
import numpy as np
import reimport torchdef glove_preprocess(sentence, vocab_dict, T=-1, padding_at_first=False):"""分词 转小写 去除标点?.（,可以被编码) 编码（转int） padding至指定长度T:param sentence: 句子 str:param vocab_dict: 编码字典 str:int:param T: 指定长度，-1表示不padding:param padding_at_first: 当T>=0时，padding_at_first=False表示在后padding，True表示在前padding:return: list(int)if T < 0: [sentence单词数] else: [T]"""# if bytes then to strif isinstance(sentence, bytes):sentence = sentence.decode()# 分词 转小写 去除标点?.（,可以被编码)words = re.compile(r'(\W+)').split(sentence.strip())words = [w.lower() for w in words if len(w.strip()) > 0]if len(words) > 0 and (words[-1] == '.' or words[-1] == '?'):words = words[:-1]# 编码（str->int）vocab_indices = [(vocab_dict[w] if w in vocab_dict else vocab_dict['']) for w in words]if T >= 0:if len(vocab_indices) > T:vocab_indices = vocab_indices[:T]elif len(vocab_indices) < T:if padding_at_first:vocab_indices = [vocab_dict['']] * (T - len(vocab_indices)) + vocab_indiceselse:vocab_indices = vocab_indices + [vocab_dict['']] * (T - len(vocab_indices))return vocab_indicesdef glove(sentences, T=-1, padding_at_first=False):"""将 多个句子list(str) 转为 300d float32词嵌入:param sentences: str 或 list(str):param T: 指定长度，-1表示不padding:param padding_at_first: 当T>=0时，padding_at_first=False表示在后padding，True表示在前padding:return: dtype=float32if T >= 0:tensor shape[句子数，T, 300]else:list(tensor) shape[句子数，该句单词数, 300]"""if isinstance(sentences, str):sentences = [sentences]# 加载编码字典 str -> intwith io.open('vocabulary_72700.txt', encoding='utf-8') as f:words = [w.strip() for w in f.readlines()]vocab_dict = {words[n]: n for n in range(len(words))}# 加载embedding表 int-> tensorwordembed_params = 'embed_matrix.npy'embedding_mat = np.load(wordembed_params)result = []for sentence in sentences:sent_emb = []vocab_indices = glove_preprocess(sentence, vocab_dict, T, padding_at_first)for item in vocab_indices:sent_emb.append(embedding_mat[item])result.append(torch.tensor(np.array(sent_emb)))if T >= 0:return torch.stack(result)else:return resultif __name__ == '__main__':test = ['Person sets large mug on counter.', 'The person disposes of the egg shell into the wastebin.']with_T = glove(test, 10, padding_at_first=True)without_T = glove(test)print(with_T)print(without_T)

使用torchtext

import torch
import torchtext
from torch import nn# 400000词，添加 位置400000 全0 300d，400001词
# cache参数可选，词向量文件有：
# charngram.100d fasttext.en.300d fasttext.simple.300d glove.42B.300d glove.840B.300d
# glove.twitter.27B.25d glove.twitter.27B.50d glove.twitter.27B.100d glove.twitter.27B.200d
# glove.6B.50d glove.6B.100d glove.6B.200d glove.6B.300d
vocab = torchtext.vocab.pretrained_aliases["glove.6B.300d"](cache='../.vector_cache')
vocab.itos.extend([''])
vocab.stoi[''] = vocab.vectors.shape[0]
vocab.vectors = torch.cat([vocab.vectors, torch.zeros(1, vocab.dim)], dim=0)
word_embedding = nn.Embedding.from_pretrained(vocab.vectors)def fun(sentence):word_idxs = torch.tensor([vocab.stoi.get(w.lower(), 400000) for w in sentence.split()], dtype=torch.long)word_vectors = word_embedding(word_idxs)return word_idxs, word_vectorssentence = 'I am a student.'
word_idxs, word_vectors = fun(sentence)
print(word_idxs)
print(word_vectors)

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > stm32学习日志20211225（一）点亮LED灯并使其三色交替闪烁
下一篇 > 如何做到有存款100万？

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

使用Glove提取单词特征

文章目录

简介

使用

使用torchtext

相关文章