初识概率统计三

采样和分布

μ p ^ = p ^ \mu_{\hat{p}} = \hat{p} μp^=p^
σ p ^ = p ^ ( 1 − p ^ ) \sigma_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})} σp^=p^(1p^)

searches = np.array([0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()
print('Mean: ' + str(np.mean(searches)))
print('StDev: ' + str(np.std(searches)))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GR2hrwbC-1610248684747)(output_88_0.png)]

Mean: 0.1875
StDev: 0.3903123748998999

对数据

SampleResult
10.1875
20.2500
30.3125
40.1875
50.1250
60.3750
70.2500
80.1875
90.3125
100.2500
110.2500
120.3125

searches = np.array([0.1875,0.25,0.3125,0.1875,0.125,0.375,0.25,0.1875,0.3125,0.25,0.25,0.3125])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9jYApB16-1610248684748)(output_90_0.png)]

中心极限定理

对于很多样本数的情况,二项式分布趋近于正态分布。

再看一个例子

n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
means.plot.hist(title='Simulated Sampling Distribution')  
plt.show()
print ('Mean: ' + str(means.mean()))
print ('Std: ' + str(means.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rVfmgLJw-1610248684748)(output_92_0.png)]

Mean: 0.24929500000000002
Std: 0.042734945917366686

样本的均值和方差

σ p ^ = p ( 1 − p ) n \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} σp^=np(1p)

把样本数从16增加为100

σ p ^ = 0.25 × 0.75 16 ≈ 0.11 \sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{16}} \approx 0.11 σp^=160.25×0.75 0.11

σ p ^ = 0.25 × 0.75 100 ≈ 0.043 \sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{100}} \approx 0.043 σp^=1000.25×0.75 0.043

有了均值和方差,就可以应用百分之六十八九十五和九十九

比如95.4%的样本都会在范围0.25 ± 0.086内, 称为置信区间。

n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
m = means.mean()
sd = means.std()
moe1 = m - (sd * 2)
moe2 = m + (sd * 2)means.plot.hist(title='Simulated Sampling Distribution')  plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(moe1, color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(moe2, color='magenta', linestyle='dashed', linewidth=2)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YFlDy8JR-1610248684749)(output_94_0.png)]

用样本均值生成样本分布

下例展示多个随机变量组合的效果

SampleWeights
1[4.020992,2.143457,2.260409,2.339641,4.699211]
2[3.38532,4.438345,3.170228,3.499913,4.489557]
3[3.338228,1.825221,3.53633,3.507952,2.698669]
4[2.992756,3.292431,3.38148,3.479455,3.051273]
5[2.969977,3.869029,4.149342,2.785682,3.03557]
6[3.138055,2.535442,3.530052,3.029846,2.881217]
7[1.596558,1.486385,3.122378,3.684084,3.501813]
8[2.997384,3.818661,3.118434,3.455269,3.026508]
9[4.078268,2.283018,3.606384,4.555053,3.344701]
10[2.532509,3.064274,3.32908,2.981303,3.915995]
11[4.078268,2.283018,3.606384,4.555053,3.344701]
12[2.532509,3.064274,3.32908,2.981303,3.915995]

计算均值:

SampleMean Weight
13.092742
23.7966726
32.98128
43.239479
53.36192
63.0229224
72.6782436
83.2832512
93.5734848
103.1646322
113.5734848
123.1646322

meanweights = np.array([3.092742,3.7966726,2.98128,3.239479,3.36192,3.0229224,2.6782436,3.2832512,3.5734848,3.1646322,3.5734848,3.1646322])plt.xlabel('Mean Weights')
plt.ylabel('Frequency')
plt.hist(meanweights, bins=6)
plt.show()print('Mean: ' + str(meanweights.mean()))
print('Std: ' + str(meanweights.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bcapEDPj-1610248684749)(output_96_0.png)]

Mean: 3.2443954
Std: 0.2903283632058937
mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000))   # 比较久运行时间,几分钟data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
means.plot.hist(title='Simulated Sampling Distribution', bins=100)  
plt.show()print('Sample Mean: ' + str(data.mean()))
print('Sample StdDev: ' + str(data.std()))
print ('Sampling Mean: ' + str(means.mean()))
print ('Sampling StdErr: ' + str(means.std()))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oWhyBnLz-1610248684750)(output_97_0.png)]

Sample Mean: 3.200047484836999
Sample StdDev: 1.1998703000813948
Sampling Mean: 3.200047484836996
Sampling StdErr: 0.052796653060840436

置信区间

置信度标准分数
90%1.645
95%1.96
99%2.576

M o E = 0.053 × 1.96 = 0.10388 MoE = 0.053 \times 1.96 = 0.10388 MoE=0.053×1.96=0.10388

当样本趋近无限大时,辨准差会趋近0

lim ⁡ n → ∞ σ n = 0 \lim_{n \to \infty} \frac{\sigma}{\sqrt{n}} = 0 nlimn σ=0

mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000))data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
m = means.mean()
sd = means.std()
ci = stats.norm.interval(0.95, m, sd)means.plot.hist(title='Simulated Sampling Distribution', bins=100) 
plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[0], color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='magenta', linestyle='dashed', linewidth=2)
plt.show()print ('Sampling Mean: ' + str(m))
print ('Sampling StdErr: ' + str(sd))
print ('95% Confidence Interval: ' + str(ci))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JAPTDre7-1610248684750)(output_99_0.png)]

Sampling Mean: 3.2004775949381727
Sampling StdErr: 0.05400806134243375
95% Confidence Interval: (3.0946237398321728, 3.3063314500441727)

假设检验

单样本单侧测试

让学生对老师的授课打分,从-5 到 5.

生成50个随机样本:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinenp.random.seed(123)
lo = np.random.randint(-5, -1, 6)
mid = np.random.randint(0, 3, 38)
hi = np.random.randint(4, 6, 6)
sample = np.append(lo,np.append(mid, hi))
print("Min:" + str(sample.min()))
print("Max:" + str(sample.max()))
print("Mean:" + str(sample.mean()))plt.hist(sample)
plt.show()
Min:-5
Max:5
Mean:0.84

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gTNMSnOT-1610248684751)(output_102_1.png)]

看起来评价是正面的,怎么知道这代表了全体同学的意见呢?

我们把假设分成正面和负面两个

H 0 : μ ≤ 0 H 1 : μ > 0 H_{0}: \mu \le 0 \\ H_{1}: \mu > 0 H0:μ0H1:μ>0

假设 H 0 {H_{0}} H0 是正确的,看看它的t检验是否落在置信区间里。

pop = np.random.normal(0, 1.15, 100000)
plt.hist(pop, bins=100)
plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-joXvCV2W-1610248684752)(output_104_0.png)]

t = x ˉ − μ s n t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} t=n sxˉμ

样本均值, μ 均值, s 标准差, and n 样本数. 当样本数大于30时,t-score接近z-score

# T-Test
t,p = stats.ttest_1samp(sample, 0)p1 = '%f' % (p/2)
print ("t-statistic:" + str(t))
print("p-value:" + str(p1))ci = stats.norm.interval(0.90, 0, 1.15) # 分布(0, 1.15) 90%置信区间
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.773584905660377
p-value:0.003911

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7t73rdVq-1610248684752)(output_106_1.png)]

黄线是均值,红线是临界线,紫线是P-Value

紫线出了置信区间,所以假设 H 0 {H_{0}} H0不成立。

双侧检验

对假设:

H 0 : μ = 0 H 1 : μ ≠ 0 H_{0}: \mu = 0 \\ H_{1}: \mu \neq 0 H0:μ=0H1:μ=0

t,p = stats.ttest_1samp(sample, 0)
print ("t-statistic:" + str(t))print("p-value:" + '%f' % p)ci = stats.norm.interval(0.95, 0, 1.15)
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[0], color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() - t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.773584905660377
p-value:0.007822

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BEWqTrZf-1610248684753)(output_109_1.png)]

同样的,假设 H 0 {H_{0}} H0不成立。

双样本检验

假设有两组学生,一组已经上过课程,另一组没有,有如下假设:

H 0 : μ 1 ≤ μ 2 H 1 : μ 1 > μ 2 H_{0}: \mu_{1} \le \mu_{2} \\ H_{1}: \mu_{1} > \mu_{2} H0:μ1μ2H1:μ1>μ2

np.random.seed(123)
nonMath = np.random.normal(66.0, 1.5, 100)
math = np.random.normal(66.55, 1.5, 100)
print("non-math sample mean:" + str(nonMath.mean()))
print("math sample mean:" + str(math.mean()))t,p = stats.ttest_ind(math, nonMath)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(nonMath.mean(), nonMath.std(), 100000)ci = stats.norm.interval(0.90, nonMath.mean(), nonMath.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
non-math sample mean:66.04066361023553
math sample mean:66.52069665713476
t-statistic:2.140008413392296
p-value:0.016789

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q2e1vxKm-1610248684754)(output_112_1.png)]

结论是上过课的人并不比没上过的强。

配对检测

上例的两组数据是无关的,如果他们相互依赖呢?

比如不是两组不同学生,而是期中和期末考试。

np.random.seed(123)
midTerm = np.random.normal(59.45, 1.5, 100)
endTerm = np.random.normal(60.05, 1.5, 100)t,p = stats.ttest_rel(endTerm, midTerm)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(midTerm.mean(), midTerm.std(), 100000)ci = stats.norm.interval(0.90, midTerm.mean(), midTerm.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.3406857739212583
p-value:0.010627

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gpUVoALR-1610248684755)(output_115_1.png)]

期末成绩好于期中成绩。


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

相关文章

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部