数据的重构
数据重构
数据的合并
任务一:将data文件夹里面的所有数据都载入,与之前的原始数据相比,观察他们的之间的关系
text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")
text_left_up.head()
| PassengerId | Survived | Pclass | Name | |
|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry |
text_left_down.head()
| PassengerId | Survived | Pclass | Name | |
|---|---|---|---|---|
| 0 | 440 | 0 | 2 | Kvillner, Mr. Johan Henrik Johannesson |
| 1 | 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) |
| 2 | 442 | 0 | 3 | Hampe, Mr. Leon |
| 3 | 443 | 0 | 3 | Petterson, Mr. Johan Emil |
| 4 | 444 | 1 | 2 | Reynaldo, Ms. Encarnacion |
text_right_down.head()
| Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | male | 31.0 | 0 | 0 | C.A. 18723 | 10.500 | NaN | S |
| 1 | female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.250 | NaN | S |
| 2 | male | 20.0 | 0 | 0 | 345769 | 9.500 | NaN | S |
| 3 | male | 25.0 | 1 | 0 | 347076 | 7.775 | NaN | S |
| 4 | female | 28.0 | 0 | 0 | 230434 | 13.000 | NaN | S |
text_right_up.head()
| Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么
任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
list_up = [text_left_up,text_right_up]
result_up = pd.concat(list_up,axis=1)
result_up.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 3.0 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2.0 | 1.0 | 1.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3.0 | 1.0 | 3.0 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4.0 | 1.0 | 1.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 5.0 | 0.0 | 3.0 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。
list_down=[text_left_down,text_right_down]
result_down = pd.concat(list_down,axis=1)
result = pd.concat([result_up,result_down])
result.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 3.0 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2.0 | 1.0 | 1.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3.0 | 1.0 | 3.0 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4.0 | 1.0 | 1.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 5.0 | 0.0 | 3.0 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
resul_up = text_left_up.join(text_right_up)
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down)
result.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 3.0 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2.0 | 1.0 | 1.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3.0 | 1.0 | 3.0 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4.0 | 1.0 | 1.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 5.0 | 0.0 | 3.0 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)
result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)
result = resul_up.append(result_down)
result.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?
任务六:完成的数据保存为result.csv
result.to_csv('result.csv')
换一种角度看数据
任务一:将我们的数据变为Series类型的数据
这个stack函数是干什么的?
# 将完整的数据加载出来
text = pd.read_csv('result.csv')
text.head()
# 代码写在这里
unit_result=text.stack().head(20)
unit_result.head()
0 Unnamed: 0 0PassengerId 1Survived 0Pclass 3Name Braund, Mr. Owen Harris dtype: object
#将代码保存为unit_result,csv
unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test.head()
| 0 | Unnamed: 0 | 0.1 | |
|---|---|---|---|
| 0 | 0 | PassengerId | 1 |
| 1 | 0 | Survived | 0 |
| 2 | 0 | Pclass | 3 |
| 3 | 0 | Name | Braund, Mr. Owen Harris |
| 4 | 0 | Sex | male |
数据重构
# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:result.csv
text = pd.read_csv('result.csv')
text.head()
| Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | NaN | S |
数据聚合与运算
数据运用
任务一:通过《Python for Data Analysis》P303、Google or Baidu来学习了解GroupBy机制
#写入心得
在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。
下面通过几个任务来熟悉GroupBy机制。
任务二:计算泰坦尼克号男性与女性的平均票价
df = text['Fare'].groupby(text['Sex'])
means = df.mean()
means
Sex female 44.479818 male 25.523893 Name: Fare, dtype: float64
任务三:统计泰坦尼克号中男女的存活人数
survived_sex = text['Survived'].groupby(text['Sex']).sum()
survived_sex.head()
Sex female 233 male 109 Name: Survived, dtype: int64
任务四:计算客舱不同等级的存活人数
survived_pclass = text['Survived'].groupby(text['Pclass'])
survived_pclass.sum()
Pclass 1 136 2 87 3 119 Name: Survived, dtype: int64
【提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0
【思考:】从数据分析的角度,上面的统计结果可以得出那些结论
#思考心得
【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
#例子:
text.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'count'}).rename(columns={'Fare': 'mean_fare', 'Pclass': 'count_pclass'})
| mean_fare | count_pclass | |
|---|---|---|
| Sex | ||
| female | 44.479818 | 314 |
| male | 25.523893 | 577 |
任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
text.groupby(['Pclass','Age'])['Fare'].mean().head()
Pclass Age 1 0.92 151.55002.00 151.55004.00 81.858311.00 120.000014.00 120.0000 Name: Fare, dtype: float64
任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
result = pd.merge(means,survived_sex,on='Sex')
result
| Fare | Survived | |
|---|---|---|
| Sex | ||
| female | 44.479818 | 233 |
| male | 25.523893 | 109 |
result.to_csv('sex_fare_survived.csv')
任务七:得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)
#不同年龄的存活人数
survived_age = text['Survived'].groupby(text['Age']).sum()
survived_age.head()
Age 0.42 1 0.67 1 0.75 2 0.83 2 0.92 1 Name: Survived, dtype: int64
#找出最大值的年龄段
survived_age[survived_age.values==survived_age.max()]
Age 24.0 15 Name: Survived, dtype: int64
_sum = text['Survived'].sum()
print(_sum)
342
#首先计算总人数
_sum = text['Survived'].sum()print("sum of person:"+str(_sum))precetn =survived_age.max()/_sumprint("最大存活率:"+str(precetn))
sum of person:342 最大存活率:0.043859649122807015
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
