Python数据分析案例三：1880-2010年间全美婴儿姓名数据分析

2023-08-28 15:30:59

**一：介绍**

该案例来自《利用Python进行数据分析·第2版》，主要对1880-2010年间全美婴儿姓名进行分析。
二：分析流程
1：读取数据：因为一个年份有一张表，将所有的表信息合成一张以便后续分析。

years = range(1880,2011)
piece = []
columns = ['name','sex', 'births']

将每年的文件转为DataFrame

for year in years:path = 'C:/Users/17322/Desktop/datasets/babynames/yob%d.txt' %yearframe = pd.read_csv(path, names=columns)frame['year']= yearpiece.append(frame)

合成一张

names = pd.concat(piece, ignore_index=True)

2：利用映射表进行聚合作图

total_births = names.pivot_table('birth',index = 'year', columns = 'sex',aggfunc='sum')
total_births.plot(title = 'Total births per year')

在这里插入图片描述
3：分析放指定名字的婴儿数相对于总出生数的比例

添加proportion列

def add_prop(group):group['prop'] = group.birth / group.birth.sum()return group
names = names.groupby(['year','sex']).apply(add_prop)

在这里插入图片描述
4：取子集：取出每个group前n列

def get_n(group,n):return group.sort_values(by = 'birth',ascending=False)[:n]

将数据按([‘year’,‘sex’])重组，取前500/1000个

groups = names.groupby(['year','sex'])
top_500 = groups.apply(get_n,500)
top_500.reset_index(inplace=True,drop=True)
top1000 = groups.apply(get_n,1000)
top1000.reset_index(inplace=True,drop=True)
top1000

另一种方法:

piece=[]
for year,group in names.groupby(['year','sex']):piece.append(group.sort_values(by='birth',ascending=False)[:1000])
top_1000=pd.concat(piece,ignore_index=True)
top_1000

在这里插入图片描述
5：利用top1000子集分析命名趋势

boys = top1000[top1000.sex == 'M']
girls = top1000[top1000.sex == 'F']

用pivot_table统计不同年份各名字的命名人数

total_births = top1000.pivot_table('birth',index = 'year',columns = 'name', aggfunc='sum')
subset = total_births[['Tom','Emma', 'Mary', 'Anna']]
subset

在这里插入图片描述
做表

subset.plot(subplots=True,figsize=(14,12),grid=False,title='number of  births per year of chosen names')

在这里插入图片描述
6：评估命名的多样性
（1）思路仍是使用pivot_table()；

table = top1000.pivot_table('prop',index='year',columns='sex',aggfunc=sum)
table.plot(title = 'the proportion of the sum of names top 1000',yticks=np.linspace(0.5,1。0,10),xticks=range(1880,2020,10),grid=True)

前1000个名字的人数占总婴儿人数的比值
在这里插入图片描述

table = top_500.pivot_table('prop',index='year',columns='sex',aggfunc=sum)
table.plot(title = 'the proportion of the sum of names top 500',yticks=np.linspace(0.5,1,10),xticks=range(1880,2020,10),grid=True)

前500个名字的人数占总婴儿人数的比值
在这里插入图片描述
（2）第二个反应多样性的方法：计算前多少个名字的人数加起来才够50%
return cumsum（）

def get_midian_counts(group):group = group.sort_values(by='prop',ascending=False)return group.prop.cumsum().values.searchsorted(0.5)+1

所有年份对应的midian_counts+[‘year’,‘sex’]组成的Series：

all_midian = top1000.groupby(['year','sex']).apply(get_midian_counts)

转为DataFrame

Diversity_names = all_midian.unstack('sex')

作图

Diversity_names.plot(title = 'diversity of names in top 50%')

在这里插入图片描述
7：名字最后一个字母的比例变化

get_last_letter= lambda x:x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
last_letters

在这里插入图片描述
取三个年份的名字最后字母聚合表

table = names.pivot_table('birth',index=last_letters,columns=['sex','year'],aggfunc=sum)
subtable = table.reindex(columns=[1880,1930,2010],level='year')

在这里插入图片描述
计算出不同字母人数所占比例

letter_prop = subtable/subtable.sum()
letter_prop

在这里插入图片描述
可视化：

import matplotlib.pyplot as plt
fig,axes = plt.subplots(2,1,figsize=(12,12))
letter_prop['M'].plot(kind='bar',rot=0,ax=axes[0],title='Male')
letter_prop['F'].plot(kind='bar',rot=0,ax=axes[1],title='Female',legend=False)

在这里插入图片描述
也可选取特定字母观察变化趋势：

letter_prop = table/table.sum()
T = letter_prop.loc[['d','m','h'],'F'].T
T.head()

在这里插入图片描述

8：观察‘lesl’在男女名字中的情况

all_names = pd.Series(top1000.name.unique())
girl_likes = all_names[all_names.str.lower().str.contains('lesl')]

在这里插入图片描述
将所有名字包含‘lesl’字段的列筛选出来

filtered = top1000[top1000.name.isin(girl_likes)]

重组

table = filtered.pivot_table('birth',index='year',columns='sex',aggfunc=sum)

将值换为相对比例

table = table.div(table.sum(1),	axis=0)

在这里插入图片描述
可视化：

table.plot(style = {'M':'k-','F':'k--'})

在这里插入图片描述

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 题目 1880: 蓝桥杯2017年第八届真题-九宫幻方
下一篇 > 1880-2014年间全美婴儿姓名分析

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

Python数据分析案例三：1880-2010年间全美婴儿姓名数据分析

相关文章