基于Python的某电商平台交易数据分析

2023-12-08 02:06:00

基于Python的某电商平台2016年交易数据分析

分析框架：

1.数据解读
1.1导入数据
1.2数据读取
2.数据清洗
3.交易数据特征分析
3.1 商品维度
3.2 用户维度
3.3 城市维度
3.4 价格维度
3.5 渠道维度
3.6 时间维度

1.数据解读

1.1导入数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv(r'D:\order_info_2016.csv',index_col='id')

df['payMoney']=df['payMoney']/100
df['price']=df['price']/100
df.info()


Int64Index: 104557 entries, 47510 to 10889
Data columns (total 10 columns):#   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  0   orderId     104557 non-null  int64  1   userId      104557 non-null  int64  2   productId   104557 non-null  int64  3   cityId      104557 non-null  int64  4   price       104557 non-null  float645   payMoney    104557 non-null  float646   channelId   104549 non-null  object 7   deviceType  104557 non-null  int64  8   createTime  104557 non-null  object 9   payTime     104557 non-null  object 
dtypes: float64(2), int64(5), object(3)
memory usage: 8.8+ MB

1.2 数据解读

这是一组以订单ID为唯一数据标签，诠释每张订单的产品/用户/渠道/成交价和交易时间的二维关系表。此数据集共包含104557行，包含以下10项字段：
Data columns (total 10 columns):
Column Non-Null Count Dtype 含义注释

0 orderId 104557 non-null int64 订单ID 九位数字组成
1 userId 104557 non-null int64 用户ID 六位数字组成
2 productId 104557 non-null int64 产品ID 0-1000的数字
3 cityId 104557 non-null int64 城市的ID 5-6位数字组成
4 price 104557 non-null int64 标价非0数值
5 payMoney 104557 non-null int64 成交价数值（亦可以<0）
6 channelId 104549 non-null object 渠道ID 由英文字母和数字构成
7 deviceType 104557 non-null int64 设备类型 1-6代表六种不同的设备
8 createTime 104557 non-null object 订单创建时间 2015-2016的时间
9 payTime 104557 non-null object 订单支付时间 2015-2016的时间不可以早于订单创建时间

2.数据清洗

#orderid预处理
# df['orderId'].value_counts(dropna=False)
df['orderId'].unique().size
#说明有27条重复的,暂不处理

#userid预处理,可以重复 但必须是数值
print(df['userId'].describe())
print(df['userId'].dtype)
#userid不需要处理

count    1.045570e+05
mean     3.270527e+06
std      4.138208e+07
min      2.930600e+04
25%      2.179538e+06
50%      2.705995e+06
75%      3.271237e+06
max      3.072939e+09
Name: userId, dtype: float64
int64

#productId预处理,是否有空值，可以重复 但必须是数值,0-1000之间均可，为0亦不作处理
print(df['productId'].describe())

count    104557.000000
mean        504.566275
std         288.130647
min           0.000000
25%         254.000000
50%         507.000000
75%         758.000000
max        1000.000000
Name: productId, dtype: float64

#cityId预处理,是否有空值，可以重复 但必须是数值
df['cityId'].describe()

count    104557.000000
mean     154410.947225
std       72197.163762
min       30000.000000
25%      100011.000000
50%      150001.000000
75%      220002.000000
max      380001.000000
Name: cityId, dtype: float64

#price预处理,是否有空值，是否有负值
df['price'].describe()

count    104557.000000
mean        916.734987
std         915.883592
min           6.000000
25%         379.000000
50%         592.000000
75%        1080.000000
max       22956.000000
Name: price, dtype: float64

#payMoney预处理,是否有空值，是否有负值
print(df['payMoney'].describe())
print(df[df['payMoney']<0].index)#负值可能代表退货，这里只有六行，无法用于对退货商品进行分析
df.drop(index=df[df['payMoney']<0].index,inplace=True)#样本量足够直接删除这六行以避免对分析造成影响

count    104557.000000
mean        868.668948
std         907.202848
min         -10.000000
25%         336.000000
50%         550.000000
75%        1040.000000
max       22942.000000
Name: payMoney, dtype: float64
Int64Index([66897, 87878, 81494, 72556, 25344, 55044], dtype='int64', name='id')

#channelID预处理,是否有空值
print(df[df['channelId'].isnull()])#有8行数据是channelId为空的
df.drop(index=df[df['channelId'].isnull()].index,inplace=True)#把channelId为空的记录删除

          orderId   userId  productId  cityId   price  payMoney channelId  \
id                                                                          
19086   284008366  3309847        698  240001  2164.0    2040.0       NaN   
38175   287706890  2799815        823   70001   760.0     749.0       NaN   
100952  283627429  4156620        269  280001   484.0     410.0       NaN   
48073   248057459  3970570        142  130001   474.0     400.0       NaN   
100954  352853915  2229389        786  240001   474.0     440.0       NaN   
75949   266847859  3761925        649  120006   257.0     257.0       NaN   
100955  379473081  4531810         18  180009   146.0      50.0       NaN   
100953  346836140  3751526        738  100013   105.0      80.0       NaN   deviceType        createTime           payTime  
id                                                      
19086            2    2016/3/8 22:36    2016/3/8 22:36  
38175            3   2016/6/10 22:30   2016/6/10 22:30  
100952           2  2016/12/13 13:24  2016/12/13 14:47  
48073            2   2016/3/30 12:59   2016/3/30 12:59  
100954           2  2016/12/13 16:54  2016/12/13 16:55  
75949            2    2016/8/19 8:46    2016/8/19 8:46  
100955           3  2016/12/13 20:18  2016/12/13 20:18  
100953           1  2016/12/13 13:47  2016/12/13 13:47

#deviceType预处理，从1-6
print(df['deviceType'].describe())
df['deviceType'].value_counts()
# df.info()

count    104543.000000
mean          2.385325
std           0.648467
min           1.000000
25%           2.000000
50%           2.000000
75%           3.000000
max           6.000000
Name: deviceType, dtype: float642    52440
3    42944
1     7052
4     2017
6       87
5        3
Name: deviceType, dtype: int64

#createTime payTime预处理
print(df['createTime'].dtype)#查看格式
df['createTime']=pd.to_datetime(df['createTime'])#createTime/payTime都转换成时间格式
df['payTime']=pd.to_datetime(df['payTime'])

object

#时间早于2016年的删除
import datetime
startime1=datetime.datetime(2016,1,1)
endtime1=datetime.datetime(2016,12,31,23,59,59)
# startime2=datetime.datetime('2016-1-1')
# endtime2=pd.to_datetime('2016-12-31')
df[df['createTime']<startime1]
df.drop(index=df[df['createTime']<startime1].index,inplace=True)#把时间早于2016年的删除

#支付时间早于创建时间，删除
df[df['payTime']<df['createTime']]
df.drop(index=df[df['payTime']<df['createTime']].index,inplace=True)

df.info()


Int64Index: 104533 entries, 47510 to 10889
Data columns (total 10 columns):#   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         0   orderId     104533 non-null  int64         1   userId      104533 non-null  int64         2   productId   104533 non-null  int64         3   cityId      104533 non-null  int64         4   price       104533 non-null  float64       5   payMoney    104533 non-null  float64       6   channelId   104533 non-null  object        7   deviceType  104533 non-null  int64         8   createTime  104533 non-null  datetime64[ns]9   payTime     104533 non-null  datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(5), object(1)
memory usage: 8.8+ MB

#还剩下orderId重复的27行，因为样本量足够，删除重复的
df['orderId'].unique().size
df.drop(index=df[df['orderId'].duplicated()

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 某程序员的520是怎么过的
下一篇 > 实战演练 | 在 MySQL 中选择除了某一列以外的所有列

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce