深度学习零基础实践（多层神经网络）--Pima印第安人糖尿病数据分析（Pima Indians Diabetes Analysis）

2023-10-26 01:46:45

1.数据集来源

需要自己注册一个Kaggle账号才能获取：Pima Indians Diabetes Database | Kaggle

或者从我的百度网盘中获取：

链接：https://pan.baidu.com/s/11HAgMGGHXIUZPZJTPUAKkA
提取码：wjjd

2.数据分析

从csv文件中可以看到，最后一列是“是否患病”，前边几列都是影响是否患病的因素。在进行深度学习模型搭建之前，需要进行数据预处理工作，这是非常必要的。

3.代码

本文基于Pytorch框架编程。作者自认为有难度的代码部分都进行了注释，读者可以自行参考。由于作者本人也是一名新手，欢迎大家随时来访，共同交流学习。

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch.nn.functional as Fdoc = pd.read_csv('G:/diabetes.csv')        # 读取数据集信息，改成自己的csv文件所在的目录，不能有中文
# print(doc.head())                         # 显示导入的数据前5行 此处必须加print 否则不显示不报错
print(doc.shape[0], doc.shape[1])           # 查看数据集行数和列数 大小为768*9
# 查看是否有空缺值 两种方法 发现没有
# print(doc.isnull().sum())
# print(doc.isna().sum())# 分别检查数据中有存在零的情况 （怀孕次数不需要检查）如下：
print("Glucose=0: ", doc[doc.Glucose == 0].shape[0])  # 血糖有5例为零，不符合实际情况
print("BloodPressure=0:", doc[doc.BloodPressure == 0].shape[0])     # 血压有35例为零
print("SkinThickness=0:", doc[doc.SkinThickness == 0].shape[0])     # 皮肤厚度有227例为零
print("Insulin=0:", doc[doc.Insulin == 0].shape[0])                 # 胰岛素有374例为零
print("BMI=0:", doc[doc.BMI == 0].shape[0])                         # BMI有11例为零
print("DiabetesPedigreeFunction=0:", doc[doc.DiabetesPedigreeFunction == 0].shape[0])   # 糖尿病谱系功能无异常
print("Age=0:", doc[doc.Age == 0].shape[0])                         # 年龄无异常# 处理无效值
# 删除 Glucose 和 BMI 中无效的行
doc_next = doc[(doc.Glucose != 0) & (doc.BMI !=0)]
print(doc_next.shape)       # 此时数据大小变为 752*9
# 对于确实量大的用平均值填充
def mean_column(feature):temp = doc_next[doc_next[feature] != 0]temp = temp[[feature, 'Outcome']].groupby(['Outcome'])[[feature]].mean().reset_index()  # 重置索引return temp
print(mean_column('BloodPressure'))
# 血压数据填充
doc_next.loc[(doc_next['Outcome'] == 0) & (doc_next['BloodPressure'] == 0), 'BloodPressure'] = \mean_column('BloodPressure')['BloodPressure'][0]
doc_next.loc[(doc_next['Outcome'] == 1) & (doc_next['BloodPressure'] == 0), 'BloodPressure'] = \mean_column('BloodPressure')['BloodPressure'][1]
# 皮肤厚度数据填充
doc_next.loc[(doc_next['Outcome'] == 0) & (doc_next['SkinThickness'] == 0), 'SkinThickness'] = \mean_column('SkinThickness')['SkinThickness'][0]
doc_next.loc[(doc_next['Outcome'] == 1) & (doc_next['SkinThickness'] == 0), 'SkinThickness'] = \mean_column('SkinThickness')['SkinThickness'][1]
# 胰岛素数据填充
doc_next.loc[(doc_next['Outcome'] == 0) & (doc_next['Insulin'] == 0), 'Insulin'] = \mean_column('Insulin')['Insulin'][0]
doc_next.loc[(doc_next['Outcome'] == 1) & (doc_next['Insulin'] == 0), 'Insulin'] = \mean_column('Insulin')['Insulin'][1]# 检验是否还有零值
"""
print("Glucose=0: ", doc_next[doc_next.Glucose == 0].shape[0])  # 血糖有5例为零，不符合实际情况
print("BloodPressure=0:", doc_next[doc_next.BloodPressure == 0].shape[0])     # 血压有35例为零
print("SkinThickness=0:", doc_next[doc_next.SkinThickness == 0].shape[0])     # 皮肤厚度有227例为零
print("Insulin=0:", doc_next[doc_next.Insulin == 0].shape[0])                 # 胰岛素有374例为零
print("BMI=0:", doc_next[doc_next.BMI == 0].shape[0])                         # BMI有11例为零
print("DiabetesPedigreeFunction=0:", doc_next[doc_next.DiabetesPedigreeFunction == 0].shape[0])   # 糖尿病谱系功能无异常
print("Age=0:", doc_next[doc_next.Age == 0].shape[0])                         # 年龄无异常
"""
# 全部列显示出来
# pd.set_option('display.max_columns', 10)
# print(doc_next.head(10))# 特征选取
inputs, outputs = doc_next.iloc[:, 0:8], doc_next.iloc[:, 8]
# 归一化
# 输入归一化
inputs_mean = inputs.mean()
inputs_std = inputs.std()
inputs = (inputs - inputs_mean) / inputs_std# print(inputs.shape, outputs.shape)
# training set and testing set
in_train = inputs.iloc[0:602, :]
in_train = in_train.values                           # 想转为tensor类型，得把数值提取出来
in_train = torch.tensor(in_train).to(torch.float32)  # 将DataFrame类型转换为Tensor类型，然后再转为float32类型，否则无法输入到网络中out_train = outputs.iloc[0:602]
out_train = out_train.values
out_train = torch.tensor(out_train).to(torch.float32).reshape(602, 1)in_test = inputs.iloc[602:, :]
in_test = in_test.values
in_test = torch.tensor(in_test).to(torch.float32)out_test = outputs.iloc[602:]
out_test = out_test.values
out_test = torch.tensor(out_test).to(torch.float32)
# print(in_test.shape, out_test.shape)# 定义网络
class Net(torch.nn.Module):def __init__(self):super(Net, self).__init__()self.linear1 = torch.nn.Linear(8, 72)self.linear2 = torch.nn.Linear(72, 64)self.linear3 = torch.nn.Linear(64, 1)def forward(self, x):x = F.relu(self.linear1(x))x = F.relu(self.linear2(x))x = F.relu(self.linear3(x))return xmodel = Net()# 定于损失函数和优化器
criterion = torch.nn.MSELoss(size_average=None)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)loss_store = []                 # 用于存储每一步的loss值，用于之后的plt绘制
iter = []                       # 每执行一步，iter都会加一，否则如果在plt.plot()中直接用epoch或者len(str(loss_store)会出现维度不匹配
def train():acc = 0for epoch in range(1001):iter.append(epoch)      # iter空间加一out_pred = model(in_train)      # 将training set输入到网络中，得到输出的predication# print(out_pred)loss = criterion(out_pred, out_train)   # 计算损失（目标真实值和预测值）# print(epoch, loss.item())optimizer.zero_grad()       # 梯度清零loss.backward()             # 反向传播optimizer.step()            # 梯度更新loss_store.append(loss.item())  # 存储loss值if epoch % 10 == 0:out_pred_lable = torch.where(out_pred >= 0.5, torch.tensor([1.0]), torch.tensor([0.0])) # condition (bool型张量) ：当condition为真，返回x的值，否则返回y的值# print(out_train)acc = torch.eq(out_pred_lable, out_train.reshape(602, 1)).sum().item()print(epoch, loss.item(), 'accuracy: ', 100 * acc / len(in_train))plt.plot(iter, loss_store)plt.xlabel('epoch')plt.ylabel('Loss')plt.grid()          # 显示网格plt.show()          # 这一步必须加，没有就不会出图def test():acc = 0# 测试无需计算梯度with torch.no_grad():out_test_pre = model(in_test)print(out_test_pre.shape)# predicted = out_test_pre.argmax(dim=0, keepdim=True)# print(predicted)out_pred_test_lable = torch.where(out_test_pre >= 0.5, torch.tensor([1.0]), torch.tensor([0.0]))acc = torch.eq(out_pred_test_lable, out_test.reshape(150, 1)).sum().item()print('accuracy: ', 100*acc/len(in_test))if __name__ == '__main__':train()# test()

4.代码运行结果

（1）Loss损失函数曲线

（2）损失函数和准确率展示

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce

深度学习零基础实践（多层神经网络）--Pima印第安人糖尿病数据分析（Pima Indians Diabetes Analysis）

1.数据集来源

2.数据分析

3.代码

4.代码运行结果

相关文章