Pandas模块的学习笔记（一）

2023-10-06 23:11:38

一、前言
开学第二周，好好学一下数据处理。首先是拾起Python，在队里的小图书馆借了一本《Python可以这样学》，用一周的时间敲了敲。
加之第二周周末的CCF考试+数模的双重洗礼，一周的代码量有几千行，Python基本已经能用了。
然后是第三周，学习coursera上的mooc，密歇根大学的Introduction to Data Science in Python.准备在本周末做完所有的Assignment（一共四个）。对应的参考书是《利用Python进行数据分析》，Github上也有对应的ipython项目pydata-book。初学时难度还是有的，DataFrame数据结构的索引和聚合操作太灵活了。又在B站上面找找能看的课，知乎上好多人推荐莫烦的课，不过已经是几年前的了，很老。还有GitHub上的项目pandas_exercises，一天学大概4~8学时的Pandas，学了五天终于基本能用了，毕竟还有几门课要上，学的稍慢点。有点感觉了，赶紧写篇博客记录一下。
二、数据结构概述
Pandas库有两种基本的数据结构。
第一个是Series，序列，由索引和列（数据）组成。可以看成是特殊的数组，只不过它的索引（下标）是可以自定义的。一维。
第二个是DataFrame，列表，由索引和（多个）列（数据）组成。可以类比数据库中的表结构（EXCEL表）。实际上就是多个Series共享索引组成的数据结构。多维。
三、Series
1.创建

#1.直接传入数组
obj = pd.Series(np.arange(5))
print(obj)
#输出结果如下
0    0
1    1
2    2
3    3
4    4
dtype: int32
#默认index是类似于数组下标的，可以使用其它字符作为下标。
#2.数组+下标
import string
obj = pd.Series(np.arange(5), list(string.ascii_uppercase[:5]))
print(obj)
#输出结果如下
A    0
B    1
C    2
D    3
E    4
dtype: int32
#3.字典
data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj = pd.Series(data)
print(obj)
#输出如下
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
#4.取自DataFrame的列
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame.state)
#输出结果如下
0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

2.属性和方法

1.idxmax() / argmax()
返回对应值最大的索引2.concat()
np.concat([data1, data2, data3], axis=1)
将三个data（Series）合并成一个DataFrame，注意到axis=1，保存索引轴3.to_frame()
将Series转换成DataFrame。4.reindex()
.reindex将会根据索引重新排序，如果当前索引不存在，则引入缺失值
s2 = s.reindex(['c','b','a','d'], fill_value = 0)5.Series对齐
s1 = pd.Series(np.random.rand(3), index = ['Jack','Marry','Tom'])
s2 = pd.Series(np.random.rand(3), index = ['Wang','Jack','Marry'])
s1+s2得到什么？
Jack     1.85397
Marry    1.48133
Tom          NaN
Wang         NaN
dtype: float646.drop()
s = pd.Series(np.random.rand(5), index = list('ngjur'))
s1 = s.drop('n')
s2 = s.drop(['g','j'])7.

四、DataFrame
1.创建

users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', sep='|', index_col='user_id')

2.属性和方法

1.shape
users.shape
#返回一个元组
(943, 4)
#the number of the observations in the dataset:
users.shape[0]
#the number of the columns:
users.shape[1]2.columns
users.columns
#返回一个index对象，包含所有列名，可以转化为list
users.columns.tolist()3.index
users.index
#返回一个index对象，即这个DataFrame的索引是什么4.dtypes
#查看每一列的dtype
#返回的是一个Series，index是列名，data是对应的dtype
users.dtype5.[column_name]
#这个属性是基于DataFrame可变的，可以更灵活地索引
users.occupation
#设置列名
e.g.
wine.columns = ['alcohol', 'malic_acid', 'alcalinity_of_ash', 'magnesium', 'flavanoids', 'proanthocyanins', 'hue']6.value_counts()
#取出一列后，对Series使用，可以产生类似聚合的效果
#并且按照数据的大小进行排序，计算最值问题时好用。
users.occupation.value_counts()7.describe()
#summarize the DataFrame
users.describe(include="all")
#也可以先取出一列后单独describe8.info() / head() / tail()
#查看数据情况，主要查看是否缺失及缺失程度
users.info()
users.head() #返回一个DataFrame，内容是users的前几行9.mean() / max() / min() / median() / std()
#统计方法
users.age.mean()
users.age.std()10.drop_duplicates()
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')
#chipo.drop_duplicates(inplace=True)	# 完全去重
chipo.drop_duplicates(subset=['item_name','quantity'],keep='first',inplace=True)
#subset 需要去重的列名
#keep first保留第一次出现 last保留最后一次11.query()
#查询函数，类似于SQL的SELECT语句，伪代码
df.query(expr(col1, col2))
#等价于df[expr(df.col1, df.col2)]12.sort_values() / sort_index()
chipo.sort_values(by = ["item_name", "quantity"], ascending=False)
#降序排列,先按照item_name,再按照quantity
df.sort_index()
# 默认 ascending=True, inplace=False13.isin()
isin()接受一个列表，判断该列中元素是否在列表中。
df.col.isin([a, b])14.set_index()
army.set_index('origin', inplace=True)
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
crime = crime.set_index('Year', drop = True)15.四则运算
加 df.add(other) df + other
减 df.sub(other) df - other
乘 df.mul(other) df * other
除 df.div(other) df / other
参数level 选定MultiIndex的具体index16.apply() / applymap() / map()
函数式编程
apply() 针对DataFrame的一列
applymap() 针对DataFrame的所有列
map() 针对Series的值17.unstack() / stack()
regiment.groupby(['regiment', 'company']).preTestScore.mean().unstack()
将MultiIndex转成Index
#unstack()方法可以快速将一个多级索引的Series转化为普通索引的DataFrame。18.to_datetime()
crime.Year = pd.to_datetime(crime.Year, format='%Y')
#Convert the type of the column Year to datetime64# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])19.del 方法
del crime['Total']20.resample()
'''
Group the year by decades and sum the values
Pay attention to the Population column number, summing this column is a mistake.
'''
crimes = crime.resample('10AS').sum()
population = crime['Population'].resample('10AS').max()
# Updating the "Population" column
crimes['Population'] = population
'''
Calculate the min, max and mean windspeeds and standard deviations 
of the windspeeds across all locations for each week (assume that 
the first week starts on January 2 1961) for the first 52 weeks
'''
weekly = data.resample('W').agg(['min','max','mean','std'])
#列分了四种小列21.idxmax()
df.idxmax(0)
Population            2010
Violent               1990
Property              1990
Murder                1990
Forcible_Rape         1990
Robbery               1990
Aggravated_assault    1990
Burglary              1980
Larceny_Theft         1990
Vehicle_Theft         1990
dtype: int64
返回一个Series，原列的最大值全部放入此中22.append()
df1.append(df2)23.concat()
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,keys=None, levels=None, names=None, verify_integrity=False,copy=True)
all_data = pd.concat([data1, data2], axis=1)
axis = i 保留i轴24.merge()
pd.merge(data1, data2, on=col_name, how='inner',axis=0)
how = 'inner' / 'outer' / 'left' / 'right'
axis=i 保存i轴
e.g.
pd.merge(data1, data2, on='subject_id', how='inner')25.rename()
传入一个字典，修改对应的列名
e.g.
housemkt.rename(columns = {0: 'bedrs', 1: 'bathrs', 2: 'price_sqr_meter'}, inplace=True)26.reset_index()
> https://zhuanlan.zhihu.com/p/110819220?from_voters_page=true
修改索引index
set_index(col, drop=True)
可以使用reset_index(drop=True) 重置成原来的索引。
df.index = range(len(df.shape[0]) 也可以重置索引。27.isnull() / notnull() / notna() / isna()
df.isnull() 对df中的每个元素进行判断
df.isnull().any() 判断哪些列存在NaN，返回一个Series
那该如何计算每列中的NaN个数呢？
df.isnull().sum()#number of columns minus the number of missing values for each location
data.shape[0] - data.isnull().sum()
data.notnull().sum()
data.sum().sum() / data.notna().sum().sum()		# 分母计算总数28.to_period()
#Downsample the record to a yearly frequency for each location.
data.groupby(data.index.to_period('A')).mean()
#Downsample the record to a monthly frequency for each location.
data.groupby(data.index.to_period('M')).mean()
#Downsample the record to a weekly frequency for each location.
data.groupby(data.index.to_period('W')).mean()29.dropna() / fillna()
iris = iris.dropna(how='any')
#Delete the rows that have NaN
wine.alcohol.fillna(10, inplace = True)
#Fill the value of NaN with the number 10 in alcohol#通过 thresh 参数设置行或列中非缺失值的最小数量
df.dropna(axis='rows', thresh=3) #非缺失值至少有3个#可以用缺失值前面的有效值来从前往后填充（forward-fill），也可以用缺失值后面的有效值来从后往前填充（back-fill）
data.fillna(method="ffill")
data.fillna(method='bfill')30.drop()
#drop函数默认删除行，列需要加axis = 1
e.g.
wine = wine.drop(wine.columns[[0,3,6,8,11,12,13]], axis = 1)
# Delete the first, fourth, seventh, nineth, eleventh, thirteenth and fourteenth columns
e.g.
result_df = df.drop(frames,axis=1)31.T(转置)
df.T32.select_dtypes()
df.select_dtypes(exclude=['object']).columns.values

3.索引

1.[column_name]
#Select more than one columns
users_ = users[[col1, col2, col3]]2.[row_name] & 切片
users_ = users[row1:row2]
users_ = users[0:3]		# 前三行的数据3.标签索引
users_ = users.loc[row1:row2, [col1, col2, ...]]4.位置索引
users_ = users.iloc[1:5, 0:2]
users_ = users.iloc[[1,3,5], [2,4]]5.布尔索引
四.2.11.query()
四.2.13.isin()
# .loc is another way to slice, using the labels of the columns and indexeseuro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

五、Index

1.将Index看作不可变数组
In: pd.Index([2, 3, 5, 7, 11])
Out: Int64Index([2, 3, 5, 7, 11], dtype='int64')
切片操作
In: ind[1] 
Out: 3 
In: ind[::2] 
Out: Int64Index([2, 5, 11], dtype='int64')
基本属性
In: print(ind.size, ind.shape, ind.ndim, ind.dtype)
Out: 5 (5,) 1 int64
Index对象与NumPy数组之间的不同在于，Index对象的索引是不可变的，
也就是说不能通过通常的方式进行调整。
> ind[1] = 0  # 这种操作是不可取的，会报错2.将Index看作有序集合
In: indA = pd.Index([1, 3, 5, 7, 9]) 
In: indB = pd.Index([2, 3, 5, 7, 11]) 
In: indA & indB      # 交集
Out: Int64Index([3, 5, 7], dtype='int64') 
In: indA | indB      # 并集
Out: Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64') 
In: indA ^ indB      # 异或
Out: Int64Index([1, 2, 9, 11], dtype='int64')3.MutiIndex
index = [('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561] 
pop = pd.Series(populations, index=index)
index1 = pd.MultiIndex.from_tuples(index)
index1
Out：
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]], codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
pop1 = pop.reindex(index1)
pop1
Out：
California 2000 33871648 2010 37253956 
New York   2000 18976457 2010 19378102 
Texas      2000 20851820 2010 25145561 
dtype: int64pop[:, 2010]  # 得到的是一个单索引数组df = pd.DataFrame(np.random.rand(4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=['data1', 'data2'])
df
Out:data1    data2 
a 1 0.554233 0.356072 2 0.925244 0.219474 
b 1 0.441759 0.610054 2 0.171495 0.886688# 有不同等级的若干简单数组组成的列表来构建
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
# 包含多个索引值的元组构成的列表创建
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
# 由两个索引的笛卡尔积（Cartesian product）创建
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
# 三种创建方法的结果都一致
Out:
MultiIndex(levels=[['a', 'b'], [1, 2]], codes=[[0, 0, 1, 1], [0, 1, 0, 1]])可以在前面任何一个MultiIndex构造器中通过names参数设置等级名称，也可以在创建之后通过索引的names属性来修改名称。
pop.index.names = ['state', 'year']多级索引
df.loc[(slice(None), slice(None), '2'), :]
sort_index()
https://www.educoder.net/tasks/yqg5xw2a8oru
列也是多级索引！
df = df.swaplevel('project', 'name')
#交换索引层级

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 小程序跳转页面携带参数时参数丢失
下一篇 > 第三方调用开放平台服务用户指南

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Pandas模块的学习笔记（一）

相关文章