kaggle Home Depot relevance相关性预测

#Home Depot 产品相关性预测 kaggle竞赛:https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站,用户通过在搜索框中输入关键词,得到相关商品和服务,如输入floor,得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型,能够更好的匹配用户搜索关键词,得到相关性更高的产品和服务。 ##导入所需

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
df_train = pd.read_csv('train.csv',encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv',encoding='ISO-8859-1')
#除了train test数据外,还有一个商品描述数据
df_desc = pd.read_csv('product_descriptions.csv')
#看一下各数据的样子
df_train.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_uidproduct_titlesearch_termrelevance
02100001Simpson Strong-Tie 12-Gauge Angleangle bracket3.0
13100001Simpson Strong-Tie 12-Gauge Anglel bracket2.5
29100002BEHR Premium Textured DeckOver 1-gal. #SC-141 …deck over3.0
df_test.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_uidproduct_titlesearch_term
01100001Simpson Strong-Tie 12-Gauge Angle90 degree bracket
14100001Simpson Strong-Tie 12-Gauge Anglemetal l brackets
25100001Simpson Strong-Tie 12-Gauge Anglesimpson sku able
df_desc.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
product_uidproduct_description
0100001Not only do angles make joints stronger, they …
1100002BEHR Premium Textured DECKOVER is an innovativ…
2100003Classic architecture meets contemporary design…

train中relevance是我们要在test上预测的目标,relevance 1-3代表相关程度,3最高,1最低;search_term是搜索词,即该产品在某一搜索词下的相关度是多少;product discription里是对应产品id的产品介绍。

对train和test数据进行合并方便处理,同时在描述数据中product_uid是共同特征,也可合并进去。

df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True)
#两个表的index都没有实际含义,选择忽视,axis=0按行合并
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_term
02Simpson Strong-Tie 12-Gauge Angle1000013.0angle bracket
13Simpson Strong-Tie 12-Gauge Angle1000012.5l bracket
29BEHR Premium Textured DeckOver 1-gal. #SC-141 …1000023.0deck over
df_all.shape
(240760, 5)
df_all = df_all.merge(df_desc,on='product_uid',how='left')
df_all.head(3)
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
idproduct_titleproduct_uidrelevancesearch_termproduct_description
02Simpson Strong-Tie 12-Gauge Angle1000013.0angle bracketNot only do angles make joints stronger, they …
13Simpson Strong-Tie 12-Gauge Angle1000012.5l bracketNot only do angles make joints stronger, they …
29BEHR Premium Textured DeckOver 1-gal. #SC-141 …1000023.0deck overBEHR Premium Textured DECKOVER is an innovativ…
##文本预处理
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
###Stemmer词干提取 因为homedepot做的是搜索匹配,所以文本的统一性很重要,我们需要对文本特征做stemmer,提取词干,保证search term在文本中只有一种表达效果。
#去掉停止词
stop = stopwords.words('english')#去掉数字
import re 
def hasnumber(input_str):return bool(re.search(r'\d',input_str))#整合在一起
def check(string):if string in stop:return Falseelif hasnumber(string):return Falseelse:return True


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部