初识概率统计三
采样和分布
μ p ^ = p ^ \mu_{\hat{p}} = \hat{p} μp^=p^
σ p ^ = p ^ ( 1 − p ^ ) \sigma_{\hat{p}} = \sqrt{\hat{p}(1-\hat{p})} σp^=p^(1−p^)
searches = np.array([0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()
print('Mean: ' + str(np.mean(searches)))
print('StDev: ' + str(np.std(searches)))
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GR2hrwbC-1610248684747)(output_88_0.png)]](https://img-blog.csdnimg.cn/20210110114744617.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
Mean: 0.1875
StDev: 0.3903123748998999
对数据
| Sample | Result |
|---|---|
| p̂1 | 0.1875 |
| p̂2 | 0.2500 |
| p̂3 | 0.3125 |
| p̂4 | 0.1875 |
| p̂5 | 0.1250 |
| p̂6 | 0.3750 |
| p̂7 | 0.2500 |
| p̂8 | 0.1875 |
| p̂9 | 0.3125 |
| p̂10 | 0.2500 |
| p̂11 | 0.2500 |
| p̂12 | 0.3125 |
searches = np.array([0.1875,0.25,0.3125,0.1875,0.125,0.375,0.25,0.1875,0.3125,0.25,0.25,0.3125])plt.xlabel('Search Results')
plt.ylabel('Frequency')
plt.hist(searches)
plt.show()
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9jYApB16-1610248684748)(output_90_0.png)]](https://img-blog.csdnimg.cn/2021011011475364.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
中心极限定理
对于很多样本数的情况,二项式分布趋近于正态分布。
再看一个例子
n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
means.plot.hist(title='Simulated Sampling Distribution')
plt.show()
print ('Mean: ' + str(means.mean()))
print ('Std: ' + str(means.std()))
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rVfmgLJw-1610248684748)(output_92_0.png)]](https://img-blog.csdnimg.cn/20210110114801871.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
Mean: 0.24929500000000002
Std: 0.042734945917366686
样本的均值和方差
σ p ^ = p ( 1 − p ) n \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} σp^=np(1−p)
把样本数从16增加为100
σ p ^ = 0.25 × 0.75 16 ≈ 0.11 \sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{16}} \approx 0.11 σp^=160.25×0.75≈0.11
σ p ^ = 0.25 × 0.75 100 ≈ 0.043 \sigma_{\hat{p}} = \sqrt{\frac{0.25 \times 0.75}{100}} \approx 0.043 σp^=1000.25×0.75≈0.043
有了均值和方差,就可以应用百分之六十八九十五和九十九
比如95.4%的样本都会在范围0.25 ± 0.086内, 称为置信区间。
n, p, s = 100, 0.25, 10000
df = pd.DataFrame(np.random.binomial(n,p,s)/n, columns=['p-hat'])means = df['p-hat']
m = means.mean()
sd = means.std()
moe1 = m - (sd * 2)
moe2 = m + (sd * 2)means.plot.hist(title='Simulated Sampling Distribution') plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(moe1, color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(moe2, color='magenta', linestyle='dashed', linewidth=2)
plt.show()
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YFlDy8JR-1610248684749)(output_94_0.png)]](https://img-blog.csdnimg.cn/20210110114812421.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
用样本均值生成样本分布
下例展示多个随机变量组合的效果
| Sample | Weights |
|---|---|
| 1 | [4.020992,2.143457,2.260409,2.339641,4.699211] |
| 2 | [3.38532,4.438345,3.170228,3.499913,4.489557] |
| 3 | [3.338228,1.825221,3.53633,3.507952,2.698669] |
| 4 | [2.992756,3.292431,3.38148,3.479455,3.051273] |
| 5 | [2.969977,3.869029,4.149342,2.785682,3.03557] |
| 6 | [3.138055,2.535442,3.530052,3.029846,2.881217] |
| 7 | [1.596558,1.486385,3.122378,3.684084,3.501813] |
| 8 | [2.997384,3.818661,3.118434,3.455269,3.026508] |
| 9 | [4.078268,2.283018,3.606384,4.555053,3.344701] |
| 10 | [2.532509,3.064274,3.32908,2.981303,3.915995] |
| 11 | [4.078268,2.283018,3.606384,4.555053,3.344701] |
| 12 | [2.532509,3.064274,3.32908,2.981303,3.915995] |
计算均值:
| Sample | Mean Weight |
|---|---|
| x̄1 | 3.092742 |
| x̄2 | 3.7966726 |
| x̄3 | 2.98128 |
| x̄4 | 3.239479 |
| x̄5 | 3.36192 |
| x̄6 | 3.0229224 |
| x̄7 | 2.6782436 |
| x̄8 | 3.2832512 |
| x̄9 | 3.5734848 |
| x̄10 | 3.1646322 |
| x̄11 | 3.5734848 |
| x̄12 | 3.1646322 |
meanweights = np.array([3.092742,3.7966726,2.98128,3.239479,3.36192,3.0229224,2.6782436,3.2832512,3.5734848,3.1646322,3.5734848,3.1646322])plt.xlabel('Mean Weights')
plt.ylabel('Frequency')
plt.hist(meanweights, bins=6)
plt.show()print('Mean: ' + str(meanweights.mean()))
print('Std: ' + str(meanweights.std()))
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bcapEDPj-1610248684749)(output_96_0.png)]](https://img-blog.csdnimg.cn/20210110114821971.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
Mean: 3.2443954
Std: 0.2903283632058937
mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000)) # 比较久运行时间,几分钟data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
means.plot.hist(title='Simulated Sampling Distribution', bins=100)
plt.show()print('Sample Mean: ' + str(data.mean()))
print('Sample StdDev: ' + str(data.std()))
print ('Sampling Mean: ' + str(means.mean()))
print ('Sampling StdErr: ' + str(means.std()))
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oWhyBnLz-1610248684750)(output_97_0.png)]](https://img-blog.csdnimg.cn/20210110114829745.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
Sample Mean: 3.200047484836999
Sample StdDev: 1.1998703000813948
Sampling Mean: 3.200047484836996
Sampling StdErr: 0.052796653060840436
置信区间
| 置信度 | 标准分数 |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
M o E = 0.053 × 1.96 = 0.10388 MoE = 0.053 \times 1.96 = 0.10388 MoE=0.053×1.96=0.10388
当样本趋近无限大时,辨准差会趋近0
lim n → ∞ σ n = 0 \lim_{n \to \infty} \frac{\sigma}{\sqrt{n}} = 0 n→∞limnσ=0
mu, sigma, n = 3.2, 1.2, 500
samples = list(range(0, 10000))data = np.array([])sampling = np.array([])for s in samples:sample = np.random.normal(mu, sigma, n)data = np.append(data,sample)sampling = np.append(sampling,sample.mean())df = pd.DataFrame(sampling, columns=['mean'])means = df['mean']
m = means.mean()
sd = means.std()
ci = stats.norm.interval(0.95, m, sd)means.plot.hist(title='Simulated Sampling Distribution', bins=100)
plt.axvline(m, color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[0], color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='magenta', linestyle='dashed', linewidth=2)
plt.show()print ('Sampling Mean: ' + str(m))
print ('Sampling StdErr: ' + str(sd))
print ('95% Confidence Interval: ' + str(ci))
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JAPTDre7-1610248684750)(output_99_0.png)]](https://img-blog.csdnimg.cn/20210110114837940.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
Sampling Mean: 3.2004775949381727
Sampling StdErr: 0.05400806134243375
95% Confidence Interval: (3.0946237398321728, 3.3063314500441727)
假设检验
单样本单侧测试
让学生对老师的授课打分,从-5 到 5.
生成50个随机样本:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinenp.random.seed(123)
lo = np.random.randint(-5, -1, 6)
mid = np.random.randint(0, 3, 38)
hi = np.random.randint(4, 6, 6)
sample = np.append(lo,np.append(mid, hi))
print("Min:" + str(sample.min()))
print("Max:" + str(sample.max()))
print("Mean:" + str(sample.mean()))plt.hist(sample)
plt.show()
Min:-5
Max:5
Mean:0.84
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gTNMSnOT-1610248684751)(output_102_1.png)]](https://img-blog.csdnimg.cn/2021011011484753.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
看起来评价是正面的,怎么知道这代表了全体同学的意见呢?
我们把假设分成正面和负面两个
H 0 : μ ≤ 0 H 1 : μ > 0 H_{0}: \mu \le 0 \\ H_{1}: \mu > 0 H0:μ≤0H1:μ>0
假设 H 0 {H_{0}} H0 是正确的,看看它的t检验是否落在置信区间里。
pop = np.random.normal(0, 1.15, 100000)
plt.hist(pop, bins=100)
plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)
plt.show()
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-joXvCV2W-1610248684752)(output_104_0.png)]](https://img-blog.csdnimg.cn/20210110114854902.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
t = x ˉ − μ s n t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} t=nsxˉ−μ
x̄ 样本均值, μ 均值, s 标准差, and n 样本数. 当样本数大于30时,t-score接近z-score
# T-Test
t,p = stats.ttest_1samp(sample, 0)p1 = '%f' % (p/2)
print ("t-statistic:" + str(t))
print("p-value:" + str(p1))ci = stats.norm.interval(0.90, 0, 1.15) # 分布(0, 1.15) 90%置信区间
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.773584905660377
p-value:0.003911
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7t73rdVq-1610248684752)(output_106_1.png)]](https://img-blog.csdnimg.cn/20210110114904962.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
黄线是均值,红线是临界线,紫线是P-Value
紫线出了置信区间,所以假设 H 0 {H_{0}} H0不成立。
双侧检验
对假设:
H 0 : μ = 0 H 1 : μ ≠ 0 H_{0}: \mu = 0 \\ H_{1}: \mu \neq 0 H0:μ=0H1:μ=0
t,p = stats.ttest_1samp(sample, 0)
print ("t-statistic:" + str(t))print("p-value:" + '%f' % p)ci = stats.norm.interval(0.95, 0, 1.15)
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[0], color='red', linestyle='dashed', linewidth=2)
plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() - t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.773584905660377
p-value:0.007822
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BEWqTrZf-1610248684753)(output_109_1.png)]](https://img-blog.csdnimg.cn/2021011011491619.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
同样的,假设 H 0 {H_{0}} H0不成立。
双样本检验
假设有两组学生,一组已经上过课程,另一组没有,有如下假设:
H 0 : μ 1 ≤ μ 2 H 1 : μ 1 > μ 2 H_{0}: \mu_{1} \le \mu_{2} \\ H_{1}: \mu_{1} > \mu_{2} H0:μ1≤μ2H1:μ1>μ2
np.random.seed(123)
nonMath = np.random.normal(66.0, 1.5, 100)
math = np.random.normal(66.55, 1.5, 100)
print("non-math sample mean:" + str(nonMath.mean()))
print("math sample mean:" + str(math.mean()))t,p = stats.ttest_ind(math, nonMath)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(nonMath.mean(), nonMath.std(), 100000)ci = stats.norm.interval(0.90, nonMath.mean(), nonMath.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
non-math sample mean:66.04066361023553
math sample mean:66.52069665713476
t-statistic:2.140008413392296
p-value:0.016789
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Q2e1vxKm-1610248684754)(output_112_1.png)]](https://img-blog.csdnimg.cn/2021011011492433.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
结论是上过课的人并不比没上过的强。
配对检测
上例的两组数据是无关的,如果他们相互依赖呢?
比如不是两组不同学生,而是期中和期末考试。
np.random.seed(123)
midTerm = np.random.normal(59.45, 1.5, 100)
endTerm = np.random.normal(60.05, 1.5, 100)t,p = stats.ttest_rel(endTerm, midTerm)p1 = '%f' % (p/2)
print("t-statistic:" + str(t))
print("p-value:" + str(p1))pop = np.random.normal(midTerm.mean(), midTerm.std(), 100000)ci = stats.norm.interval(0.90, midTerm.mean(), midTerm.std())
plt.hist(pop, bins=100)plt.axvline(pop.mean(), color='yellow', linestyle='dashed', linewidth=2)plt.axvline(ci[1], color='red', linestyle='dashed', linewidth=2)plt.axvline(pop.mean() + t*pop.std(), color='magenta', linestyle='dashed', linewidth=2)
plt.show()
t-statistic:2.3406857739212583
p-value:0.010627
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gpUVoALR-1610248684755)(output_115_1.png)]](https://img-blog.csdnimg.cn/20210110114932706.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzE1Mzc4Mzg1,size_16,color_FFFFFF,t_70)
期末成绩好于期中成绩。
本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!
