初识概率统计三

2023-10-25 02:52:03

采样和分布

$\mu_{\hat{p}} = \hat{p}$
$\sigma_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})}$

searches = np.array([0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()
print('Mean: ' + str(np.mean(searches)))
print('StDev: ' + str(np.std(searches)))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GR2hrwbC-1610248684747)(output_88_0.png)]

Mean: 0.1875
StDev: 0.3903123748998999

对数据

Sample	Result
p̂₁	0.1875
p̂₂	0.2500
p̂₃	0.3125
p̂₄	0.1875
p̂₅	0.1250
p̂₆	0.3750
p̂₇	0.2500
p̂₈	0.1875
p̂₉	0.3125
p̂₁₀	0.2500
p̂₁₁	0.2500
p̂₁₂	0.3125


searches = np.array([0.1875,0.25,0.3125,0.1875,0.125,0.375,0.25,0.1875,0.3125,0.25,0.25,0.3125])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9jYApB16-1610248684748)(output_90_0.png)]

中心极限定理

对于很多样本数的情况，二项式分布趋近于正态分布。

再看一个例子

n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
means.plot.hist(title='Simulated Sampling Distribution')  
plt.show()
print ('Mean: ' + str(means.mean()))
print ('Std: ' + str(means.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rVfmgLJw-1610248684748)(output_92_0.png)]

Mean: 0.24929500000000002
Std: 0.042734945917366686

样本的均值和方差

$\sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

把样本数从16增加为100

$\sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{16}} \approx 0.11$

$\sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{100}} \approx 0.043$

有了均值和方差，就可以应用百分之六十八九十五和九十九

比如95.4%的样本都会在范围0.25 ± 0.086内，称为置信区间。

n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
m = means.mean()
sd = means.std()
moe1 = m - (sd * 2)
moe2 = m + (sd * 2)means.plot.hist(title='Simulated Sampling Distribution')  plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(moe1, color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(moe2, color='magenta', linestyle='dashed', linewidth=2)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YFlDy8JR-1610248684749)(output_94_0.png)]

用样本均值生成样本分布

下例展示多个随机变量组合的效果

Sample	Weights
1	[4.020992,2.143457,2.260409,2.339641,4.699211]
2	[3.38532,4.438345,3.170228,3.499913,4.489557]
3	[3.338228,1.825221,3.53633,3.507952,2.698669]
4	[2.992756,3.292431,3.38148,3.479455,3.051273]
5	[2.969977,3.869029,4.149342,2.785682,3.03557]
6	[3.138055,2.535442,3.530052,3.029846,2.881217]
7	[1.596558,1.486385,3.122378,3.684084,3.501813]
8	[2.997384,3.818661,3.118434,3.455269,3.026508]
9	[4.078268,2.283018,3.606384,4.555053,3.344701]
10	[2.532509,3.064274,3.32908,2.981303,3.915995]
11	[4.078268,2.283018,3.606384,4.555053,3.344701]
12	[2.532509,3.064274,3.32908,2.981303,3.915995]

计算均值：

Sample	Mean Weight
x̄₁	3.092742
x̄₂	3.7966726
x̄₃	2.98128
x̄₄	3.239479
x̄₅	3.36192
x̄₆	3.0229224
x̄₇	2.6782436
x̄₈	3.2832512
x̄₉	3.5734848
x̄₁₀	3.1646322
x̄₁₁	3.5734848
x̄₁₂	3.1646322


meanweights = np.array([3.092742,3.7966726,2.98128,3.239479,3.36192,3.0229224,2.6782436,3.2832512,3.5734848,3.1646322,3.5734848,3.1646322])plt.xlabel('Mean Weights')
plt.ylabel('Frequency')
plt.hist(meanweights, bins=6)
plt.show()print('Mean: ' + str(meanweights.mean()))
print('Std: ' + str(meanweights.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bcapEDPj-1610248684749)(output_96_0.png)]

Mean: 3.2443954
Std: 0.2903283632058937

mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000))   # 比较久运行时间，几分钟data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
means.plot.hist(title='Simulated Sampling Distribution', bins=100)  
plt.show()print('Sample Mean: ' + str(data.mean()))
print('Sample StdDev: ' + str(data.std()))
print ('Sampling Mean: ' + str(means.mean()))
print ('Sampling StdErr: ' + str(means.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oWhyBnLz-1610248684750)(output_97_0.png)]

Sample Mean: 3.200047484836999
Sample StdDev: 1.1998703000813948
Sampling Mean: 3.200047484836996
Sampling StdErr: 0.052796653060840436

置信区间

置信度	标准分数
90%	1.645
95%	1.96
99%	2.576

$\times 1.96 = 0.10388$

当样本趋近无限大时，辨准差会趋近0

$\lim_{n \to \infty} \frac{\sigma}{\sqrt{n}} = 0$

mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000))data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
m = means.mean()
sd = means.std()
ci = stats.norm.interval(0.95, m, sd)means.plot.hist(title='Simulated Sampling Distribution', bins=100) 
plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[0], color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='magenta', linestyle='dashed', linewidth=2)
plt.show()print ('Sampling Mean: ' + str(m))
print ('Sampling StdErr: ' + str(sd))
print ('95% Confidence Interval: ' + str(ci))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JAPTDre7-1610248684750)(output_99_0.png)]

Sampling Mean: 3.2004775949381727
Sampling StdErr: 0.05400806134243375
95% Confidence Interval: (3.0946237398321728, 3.3063314500441727)

假设检验

单样本单侧测试

让学生对老师的授课打分，从-5 到 5.

生成50个随机样本：

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinenp.random.seed(123)
lo = np.random.randint(-5, -1, 6)
mid = np.random.randint(0, 3, 38)
hi = np.random.randint(4, 6, 6)
sample = np.append(lo,np.append(mid, hi))
print("Min:" + str(sample.min()))
print("Max:" + str(sample.max()))
print("Mean:" + str(sample.mean()))plt.hist(sample)
plt.show()

Min:-5
Max:5
Mean:0.84

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gTNMSnOT-1610248684751)(output_102_1.png)]

看起来评价是正面的，怎么知道这代表了全体同学的意见呢？

我们把假设分成正面和负面两个

$H_{0}: \mu \le 0 \\ H_{1}: \mu > 0$

假设 ${H_{0}}$ 是正确的，看看它的t检验是否落在置信区间里。

pop = np.random.normal(0, 1.15, 100000)
plt.hist(pop, bins=100)
plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-joXvCV2W-1610248684752)(output_104_0.png)]

$\frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}$

x̄ 样本均值, μ 均值, s 标准差, and n 样本数. 当样本数大于30时，t-score接近z-score

# T-Test
t,p = stats.ttest_1samp(sample, 0)p1 = '%f' % (p/2)
print ("t-statistic:" + str(t))
print("p-value:" + str(p1))ci = stats.norm.interval(0.90, 0, 1.15) # 分布（0, 1.15) 90%置信区间
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()

t-statistic:2.773584905660377
p-value:0.003911

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7t73rdVq-1610248684752)(output_106_1.png)]

黄线是均值，红线是临界线，紫线是P-Value

紫线出了置信区间，所以假设 ${H_{0}}$ 不成立。

双侧检验

对假设：

$H_{0}: \mu = 0 \\ H_{1}: \mu \neq 0$

t,p = stats.ttest_1samp(sample, 0)
print ("t-statistic:" + str(t))print("p-value:" + '%f' % p)ci = stats.norm.interval(0.95, 0, 1.15)
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[0], color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() - t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()

t-statistic:2.773584905660377
p-value:0.007822

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BEWqTrZf-1610248684753)(output_109_1.png)]

同样的，假设 ${H_{0}}$ 不成立。

双样本检验

假设有两组学生，一组已经上过课程，另一组没有，有如下假设：

$H_{0}: \mu_{1} \le \mu_{2} \\ H_{1}: \mu_{1} > \mu_{2}$

np.random.seed(123)
nonMath = np.random.normal(66.0, 1.5, 100)
math = np.random.normal(66.55, 1.5, 100)
print("non-math sample mean:" + str(nonMath.mean()))
print("math sample mean:" + str(math.mean()))t,p = stats.ttest_ind(math, nonMath)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(nonMath.mean(), nonMath.std(), 100000)ci = stats.norm.interval(0.90, nonMath.mean(), nonMath.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()

non-math sample mean:66.04066361023553
math sample mean:66.52069665713476
t-statistic:2.140008413392296
p-value:0.016789

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q2e1vxKm-1610248684754)(output_112_1.png)]

结论是上过课的人并不比没上过的强。

配对检测

上例的两组数据是无关的，如果他们相互依赖呢？

比如不是两组不同学生，而是期中和期末考试。

np.random.seed(123)
midTerm = np.random.normal(59.45, 1.5, 100)
endTerm = np.random.normal(60.05, 1.5, 100)t,p = stats.ttest_rel(endTerm, midTerm)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(midTerm.mean(), midTerm.std(), 100000)ci = stats.norm.interval(0.90, midTerm.mean(), midTerm.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()

t-statistic:2.3406857739212583
p-value:0.010627

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gpUVoALR-1610248684755)(output_115_1.png)]

期末成绩好于期中成绩。

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：技术

上一篇 > 深度学习数学基础--概率与信息论（2）
下一篇 > C++实现的哈利波特千行代码

Duilib中list控件支持ctrl和shif多行选中的实现

[ICML2015]Batch Normalization:Accelerating Deep Network Training by Reducing Internal Covariate Shif

win10系统微软输入法于eclipse ctrl+shif+f冲突间接处理办法

Codeforces Round #259 (Div. 2) B. Little Pony and Sort by Shif

读LDD3，内存映射与DMA--PAGE_SHIF…

VMware虚拟机安装XP【要先分区，再设置BOOT 启动CD，shif+上移】

更换iBus五笔的左与右Shif

sublime ctrl+shif+f 没用解决办法

idea 对 ctrl + z 的撤销是 ctrl + shif + z

计算机最早的设计师应用于,计算机应用基础选择题doc.doc

win10自带截图神器：Win+Shift+S

Python基础之文件目录操作

python简述目录_Python基础之文件目录操作(示例代码)

tp5 如何做数据采集

任务2-7(服务器字体+阿里巴巴矢量库)

html标签（1)：h1~h6,p,br,pre,hr

TI 电量计介绍与芯片选型指南

几款TI电源芯片简介

TI DSP芯片C2000系列读取FLASH数据

德州仪器(Ti)平台嵌入式开发基础

TI三相电机智能栅极驱动芯片特点分类

省选模拟（12.08） T3 圈圈圈圈圈圈圈圈

Hadoop生态圈技术栈（上）

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之6.Impala交互式查询

小猿圈之Linux下Mysql 操作命令

大数据Hadoop生态圈常用面试题

大数据开发基础入门与项目实战（三）Hadoop核心及生态圈技术栈之4.Hive DDL、DQL和数据操作

备战Noip2018模拟赛11（B组）T3 Monogatari 物语

【智能优化算法-圆圈搜索算法】基于圆圈搜索算法Circle Search Algorithm求解单目标优化问题附matlab代码

NYOJ 78 圈水池

递归问题跑道汽车绕圈问题 Python实现

Hadoop生态圈（三）：MapReduce