#Home Depot 产品相关性预测 kaggle竞赛:https://www.kaggle.com/c/home-depot-product-search-relevance HomeDepot是美国一家家具建材商品网站,用户通过在搜索框中输入关键词,得到相关商品和服务,如输入floor,得到不同材料的地板商品、地板清洗商品、地板安装服务等。kaggle竞赛目的是通过设计一种模型,能够更好的匹配用户搜索关键词,得到相关性更高的产品和服务。 ##导入所需
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
df_train = pd.read_csv('train.csv' ,encoding='ISO-8859-1' )
df_test = pd.read_csv('test.csv' ,encoding='ISO-8859-1' )
df_desc = pd.read_csv('product_descriptions.csv' )
df_train.head(3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_uid product_title search_term relevance 0 2 100001 Simpson Strong-Tie 12-Gauge Angle angle bracket 3.0 1 3 100001 Simpson Strong-Tie 12-Gauge Angle l bracket 2.5 2 9 100002 BEHR Premium Textured DeckOver 1-gal. #SC-141 … deck over 3.0
df_test.head(3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_uid product_title search_term 0 1 100001 Simpson Strong-Tie 12-Gauge Angle 90 degree bracket 1 4 100001 Simpson Strong-Tie 12-Gauge Angle metal l brackets 2 5 100001 Simpson Strong-Tie 12-Gauge Angle simpson sku able
df_desc.head(3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
product_uid product_description 0 100001 Not only do angles make joints stronger, they … 1 100002 BEHR Premium Textured DECKOVER is an innovativ… 2 100003 Classic architecture meets contemporary design…
train中relevance是我们要在test上预测的目标,relevance 1-3代表相关程度,3最高,1最低;search_term是搜索词,即该产品在某一搜索词下的相关度是多少;product discription里是对应产品id的产品介绍。
对train和test数据进行合并方便处理,同时在描述数据中product_uid是共同特征,也可合并进去。
df_all = pd.concat((df_train, df_test), axis=0 , ignore_index=True )
D:\programs\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass ‘sort=True’. To retain the current behavior and silence the warning, pass sort=False “”“Entry point for launching an IPython kernel.
df_all.head(3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term 0 2 Simpson Strong-Tie 12-Gauge Angle 100001 3.0 angle bracket 1 3 Simpson Strong-Tie 12-Gauge Angle 100001 2.5 l bracket 2 9 BEHR Premium Textured DeckOver 1-gal. #SC-141 … 100002 3.0 deck over
df_all.shape (240760, 5)
df_all = df_all.merge(df_desc,on='product_uid' ,how='left' )
df_all.head(3 )
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
id product_title product_uid relevance search_term product_description 0 2 Simpson Strong-Tie 12-Gauge Angle 100001 3.0 angle bracket Not only do angles make joints stronger, they … 1 3 Simpson Strong-Tie 12-Gauge Angle 100001 2.5 l bracket Not only do angles make joints stronger, they … 2 9 BEHR Premium Textured DeckOver 1-gal. #SC-141 … 100002 3.0 deck over BEHR Premium Textured DECKOVER is an innovativ…
##文本预处理
from nltk.stem.snowball import SnowballStemmer
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
###Stemmer词干提取 因为homedepot做的是搜索匹配,所以文本的统一性很重要,我们需要对文本特征做stemmer,提取词干,保证search term在文本中只有一种表达效果。
stop = stopwords.words('english' )
import re
def hasnumber (input_str) :return bool(re.search(r'\d' ,input_str))
def check (string) :if string in stop:return False elif hasnumber(string):return False else :return True
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】 进行投诉反馈!