数据清洗之 数据离散化

数据离散化

  • 数据离散化就是分箱
  • 一把你常用分箱方法是等频分箱或者等宽分箱
  • 一般使用pd.cut或者pd.qcut函数

pandas.cut(x, bins, right=True, labels)

  • x: 数据
  • bins: 离散化的数目,或者切分的区间
  • labels: 离散化后各个类别的标签
  • right: 是否包含区间右边的值
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):if '$' in str(x):x = str(x).strip('$')x = str(x).replace(',', '')else:x = str(x).replace(',', '')return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
ConditionCondition_DescPriceLocationModel_YearMileageExterior_ColorMakeWarrantyModel...Vehicle_TitleOBOFeedback_PercWatch_CountN_ReviewsSeller_StatusVehicle_TileAuctionBuy_NowBid_Count
0Usedmint!!! very low miles11412.0McHenry, Illinois, United States2013.016000.0BlackHarley-DavidsonUnspecifiedTouring...NaNFALSE8.1NaN2427Private SellerClearTrueFALSE28.0
1UsedPerfect condition17200.0Fort Recovery, Ohio, United States2016.060.0BlackHarley-DavidsonVehicle has an existing warrantyTouring...NaNFALSE10017657Private SellerClearTrueTRUE0.0
2UsedNaN3872.0Chicago, Illinois, United States1970.025763.0Silver/BlueBMWVehicle does NOT have an existing warrantyR-Series...NaNFALSE100NaN136NaNClearTrueFALSE26.0
3UsedCLEAN TITLE READY TO RIDE HOME6575.0Green Bay, Wisconsin, United States2009.033142.0RedHarley-DavidsonNaNTouring...NaNFALSE100NaN2920DealerClearTrueFALSE11.0
4UsedNaN10000.0West Bend, Wisconsin, United States2012.017800.0BlueHarley-DavidsonNO WARRANTYTouring...NaNFALSE10013271OWNERClearTrueTRUE0.0

5 rows × 22 columns

df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))
# 计算频数
df['Price_bin'].value_counts()
0    6762
1     659
2      50
3      20
4       2
Name: Price_bin, dtype: int64
%matplotlib inline
df['Price_bin'].value_counts().plot(kind='bar')

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-olyuNEbB-1587367665199)(output_12_1.png)]

df['Price_bin'].hist()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kUCpxNzE-1587367665204)(output_13_1.png)]

w = [100, 1000, 5000, 10000, 20000, 100000]
df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))
df[['Price', 'Price_bin']].head(5)
PricePrice_bin
011412.03
117200.03
23872.01
36575.02
410000.02
df['Price_bin'].hist()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W11kWf50-1587367665206)(output_17_1.png)]


# 分位数
k = 5
w = [1.0 * i/k for i in range(k+1)]
w
[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
# 等频分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))
df['Price_bin'].hist()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-B3njTZxo-1587367665209)(output_21_1.png)]

# 计算分位点
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])
w1
0.0         0.0
0.2      3500.0
0.4      6491.0
0.6      9777.0
0.8     14999.0
1.0    100000.0
Name: Price, dtype: float64
# 一般第一个分位点要比实际小
# 最后一个分位点要比实际大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1
w1
0.0         0.0
0.2      3500.0
0.4      6491.0
0.6      9777.0
0.8     14999.0
1.0    110000.0
Name: Price, dtype: float64
# 按照新的分段标准分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))
df['Price_bin'].hist()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CbT03bmk-1587367665212)(output_27_1.png)]


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部