Coggle 30 Days of ML(23年7月)-task6
#### 任务六:学会训练Word2Vec词向量
- 说明:在这个任务中,你将学习如何训练FastText和Word2Vec词向量模型,这些词向量模型可以捕捉文本中的语义信息。
- 实践步骤:
- 准备大规模文本语料库。
- 使用Word2Vec类,设置相应的参数(如词向量维度、窗口大小、训练迭代次数等)来构建词向量模型。
- 使用Word2Vec类,训练词向量模型。
from gensim.models.word2vec import Word2Vec
import pandas as pd
from collections import defaultdict
from gensim import corpora
train_data = pd.read_csv('./ChatGPT生成文本检测器公开数据-更新/train.csv')
test_data = pd.read_csv('./ChatGPT生成文本检测器公开数据-更新/test.csv')
# 对输入的内容进行处理
train_data['content'] = train_data['content'].apply(lambda x: x[1:-1].strip().replace('\n', ' \n '))
test_data['content'] = test_data['content'].apply(lambda x: x[1:-1].strip().replace('\n', ' \n '))
train_data['content'] = train_data['content'].apply(lambda x: x.split(' '))
test_data['content'] = test_data['content'].apply(lambda x: x.split(' '))
train_data['content'] = train_data['content'].apply(lambda x: [i for i in x if i != '' and i != '\n'])
test_data['content'] = test_data['content'].apply(lambda x: [i for i in x if i != '' and i != '\n'])
train_data_list = train_data['content'].tolist()[:5]
test_data_list = test_data['content'].tolist()[:5]
all_data_list = train_data_list + test_data_list
# print(train_data_list)
# print(test_data_list)
# print(len(all_data_list))
# print(list(train_data['content']))
model = Word2Vec(sentences=all_data_list, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv.most_similar('0', topn=10))
# 训练词向量
# model = word2vec.Word2Vec(sentences=all_data_list, vector_size=100, window=5, min_count=1, workers=4)
model.train(all_data_list, total_examples=len(all_data_list), epochs=10)
# 保存模型
model.save('./word2vec.model')
[('123', 0.31092387437820435), ('2177', 0.3047351539134979), ('3834', 0.3037860691547394), ('139', 0.28955206274986267), ('290', 0.28250473737716675), ('263', 0.27930933237075806), ('2214', 0.2713657319545746), ('1527', 0.27030256390571594), ('1456', 0.26746299862861633), ('2031', 0.2601543366909027)]
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
