词嵌入+神经网络进行邮件分类

2023-08-30 18:08:08

1 问题描述

问题：邮件分类问题（Email classification）

任务：将邮件分为两类(spam or ham)

数据集：https://www.kaggle.com/uciml/sms-spam-collection-dataset#spam.csv

2 数据处理

import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import re
from sklearn.model_selection import train_test_split

读取数据

# 读取数据
data = pd.read_csv('spam.csv', encoding = "ISO-8859-1")

data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

# 查看前5行数据
data.head()

	v1	v2	Unnamed: 2	Unnamed: 3	Unnamed: 4
0	ham	Go until jurong point, crazy.. Available only ...	NaN	NaN	NaN
1	ham	Ok lar... Joking wif u oni...	NaN	NaN	NaN
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	NaN	NaN	NaN
3	ham	U dun say so early hor... U c already then say...	NaN	NaN	NaN
4	ham	Nah I don't think he goes to usf, he lives aro...	NaN	NaN	NaN

去除无用数据

# 去除无用数据，后3列是无用数据
data = data[['v1', 'v2']]
data.head()

	v1	v2
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

修改表头信息

# 修改表头信息
data = data.rename(columns={"v1":"label","v2":"text"})
data.head()

	label	text
0	ham	Go until jurong point, crazy.. Available only ...
1	ham	Ok lar... Joking wif u oni...
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...
3	ham	U dun say so early hor... U c already then say...
4	ham	Nah I don't think he goes to usf, he lives aro...

去除标点符号及多余的空格

# 去除标点符号及两个以上的空格
data['text'] = data['text'].apply(lambda x:re.sub('[!@#$:).;,?&]', ' ', x.lower()))
data['text'] = data['text'].apply(lambda x:re.sub(' ', ' ', x))
data['text'][0]

'go until jurong point  crazy   available only in bugis n great world la e buffet    cine there got amore wat   '

单词转换为小写

# 单词转换为小写
data['text'] = data['text'].apply(lambda x:" ".join(x.lower() for x in x.split()))
# 或者 
#data['text'] = data['text'].apply(lambda x:x.lower())
data['text'][0]

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

去除停止词

# 去除停止词 ，如a、an、the、高频介词、连词、代词等
stop = stopwords.words('english')
data['text'] = data['text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
data['text'][0]

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

分词处理

# 分词处理，希望能够实现还原英文单词原型
st = PorterStemmer()
data['text'] = data['text'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
data['text'] = data['text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
data['text'][0]

'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'

data.head()

	label	text
0	ham	go jurong point crazi avail bugi n great world...
1	ham	ok lar joke wif u oni
2	spam	free entri 2 wkli comp win fa cup final tkt 21...
3	ham	u dun say earli hor u c alreadi say
4	ham	nah think goe usf live around though

3 特征提取

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.

分出训练集和测试集

#以 8:2 的比例分出训练集和测试集
train, test = train_test_split(data, test_size=0.2)

设置参数

# 每个序列的最大长度，多了截断，少了补0
max_sequence_length = 300#只保留频率最高的前20000个词
num_words = 20000# 嵌入的维度
embedding_dim = 100

构建分词器

# 找出经常出现的单词，分词器
tokenizer = Tokenizer(num_words=num_words)
tokenizer.fit_on_texts(train.text)
train_sequences = tokenizer.texts_to_sequences(train.text)
test_sequences = tokenizer.texts_to_sequences(test.text)# dictionary containing words and their index
word_index = tokenizer.word_index# print(tokenizer.word_index)
# total words in the corpus
print('Found %s unique tokens.' % len(word_index))
# get only the top frequent words on traintrain_x = pad_sequences(train_sequences, maxlen=max_sequence_length)
# get only the top frequent words on test
test_x = pad_sequences(test_sequences, maxlen=max_sequence_length)print(train_x.shape)
print(test_x.shape)

Found 6702 unique tokens.
(4457, 300)
(1115, 300)

标签向量化

# 标签向量化
# [0,1]: ham;[1,0]:spam
import numpy as npdef lable_vectorize(labels):label_vec = np.zeros([len(labels),2])for i, label in enumerate(labels):if str(label)=='ham':label_vec[i][0] = 1else:label_vec[i][1] = 1return label_vectrain_y = lable_vectorize(train['label'])            
test_y = lable_vectorize(test['label'])# 或者
from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical# converts the character array to numeric array. Assigns levels to unique labels.
train_labels = train['label']
test_labels = test['label']le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))

4 构建模型并训练

# Import Libraries
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Input, LSTM, Embedding,Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D,Conv1D, SimpleRNN
from keras.models import Model
from keras.models import Sequential
from keras import initializers, regularizers, constraints,optimizers, layers
from keras.layers import Dense, Input, Flatten, Dropout,BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Sequentialmodel = Sequential()
model.add(Embedding(num_words,embedding_dim,input_length=max_sequence_length))
model.add(Dropout(0.5))
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))model.add(BatchNormalization())
model.add(Conv1D(128, 5, activation='relu'))
model.add(MaxPooling1D(5))
model.add(Dropout(0.5))model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])model.fit(train_x, train_y,batch_size=64,epochs=5,validation_split=0.2)

Train on 3565 samples, validate on 892 samples
Epoch 1/5
3565/3565 [==============================] - 25s 7ms/step - loss: 0.3923 - acc: 0.8480 - val_loss: 0.1514 - val_acc: 0.9451
Epoch 2/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.1729 - acc: 0.9372 - val_loss: 0.0789 - val_acc: 0.9753
Epoch 3/5
3565/3565 [==============================] - 25s 7ms/step - loss: 0.0940 - acc: 0.9731 - val_loss: 0.2079 - val_acc: 0.9787
Epoch 4/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.0590 - acc: 0.9857 - val_loss: 0.3246 - val_acc: 0.9843
Epoch 5/5
3565/3565 [==============================] - 23s 7ms/step - loss: 0.0493 - acc: 0.9882 - val_loss: 0.3150 - val_acc: 0.9877

5 模型评估

# [0.07058866604882806, 0.9874439467229116]
model.evaluate(test_x, test_y)

1115/1115 [==============================] - 2s 2ms/step[0.32723046118903054, 0.97847533632287]

# prediction on test data
predicted=model.predict(test_x)
predicted

array([[0.71038646, 0.28961352],[0.71285075, 0.28714925],[0.7101978 , 0.28980213],...,[0.7092874 , 0.29071262],[0.70976096, 0.290239  ],[0.70463425, 0.29536578]], dtype=float32)

#模型评估
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score
precision, recall, fscore, support = score(test_y,predicted.round())
print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print("############################")
print(sklearn.metrics.classification_report(test_y,predicted.round()))

precision: [0.97961264 0.97014925]
recall: [0.99585492 0.86666667]
fscore: [0.98766701 0.91549296]
support: [965 150]
############################precision    recall  f1-score   support0       0.98      1.00      0.99       9651       0.97      0.87      0.92       150avg / total       0.98      0.98      0.98      1115

文章来源: https://foochane.cn/article/2019052202.html

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

词嵌入+神经网络进行邮件分类

1 问题描述

2 数据处理

读取数据

去除无用数据

修改表头信息

去除标点符号及多余的空格

单词转换为小写

去除停止词

分词处理

3 特征提取

分出训练集和测试集

设置参数

构建分词器

标签向量化

4 构建模型并训练

5 模型评估

相关文章