Hypothesis testing is a critical tool in inferential statistics, for determing what the value of a population parameter could be. We often draw this conclusion based on a sample data analysis.
Check this document for a quick and comprehensive link.
The basis of hypothesis testing has two attributes:
Null Hypothesis: $H_0$
Alternative Hypothesis: $H_a$
The tests we will discuss in this notebook are:
In this notebook, we will also introduce some functions (from the statsmodels
Python package) that are extremely useful when calculating a t-statistic, or a z-statistic, and corresponding p-values for a hypothesis test.
Let's quickly review the following ways to calculate a test statistic for the tests listed above.
The equation is:
$$\frac{Best\ Estimate - Hypothesized\ Estimate}{Standard\ Error\ of\ Estimate}$$import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
In previous years, 52% of parents believed that electronics and social media was the cause of their teenager’s lack of sleep. Do more parents today believe that their teenager’s lack of sleep is caused due to electronics and social media?
Population: Parents with a teenager (age 13-18)
Parameter of Interest: p
Null Hypothesis: p = 0.52
Alternative Hypthosis: p > 0.52 (note that this is a one-sided test)
Data: 1018 people were surveyed. 56% of those who were surveyed believe that their teenager’s lack of sleep is caused due to electronics and social media.
proportions_ztest()
from statsmodels
¶Note the argument alternative="larger"
indicating a one-sided test. The function returns two values - the z-statistic and the corresponding p-value.
n = 1018
pnull = .52
phat = .56
sm.stats.proportions_ztest(phat * n, n, pnull, alternative='larger')
Since the calculated p-value of the z-test is pretty small, we can reject the Null hypothesis that the percentage of parents, who believe that their teenager’s lack of sleep is caused due to electronics and social media, is as same as previous years' estimate i.e. 52%.
Although, we do not accept the alternate hypothesis, this informally means that there is a good chance of this proportion being more than 52%.
Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?
Populations: All parents of black children age 6-18 and all parents of Hispanic children age 6-18
Parameter of Interest: p1 - p2, where p1 = black and p2 = hispanic
Null Hypothesis: p1 - p2 = 0
Alternative Hypthosis: p1 - p2 $\neq$ = 0
Data: 247 Parents of Black Children. 36.8% of parents report that their child has had some swimming lessons.
308 Parents of Hispanic Children. 38.9% of parents report that their child has had some swimming lessons.
ttest_ind()
from statsmodels
¶Difference in population proportion needs t-test. Also, the population follow a binomial distribution here. We can just pass on the two population quantities with the appropriate binomial distribution parameters to the t-test function.
The function returns three values: (a) test statisic, (b) p-value of the t-test, and (c) degrees of freedom used in the t-test.
n1 = 247
p1 = .37
n2 = 308
p2 = .39
population1 = np.random.binomial(1, p1, n1)
population2 = np.random.binomial(1, p2, n2)
sm.stats.ttest_ind(population1, population2)
Since the p-value is quite high ~0.768, we cannot reject the Null hypothesis in this case i.e. the difference in the population proportions are not statistically significant.
We do not chnage the proportions, just the number of survey participants in the two population. The slight difference in the proportion could become statistically significant in this situation. There is no guarantee that when you run the code, you will get a p-value < 0.05 all the time as the samples are randomly generated each itme. But if you run it a few times, you will notice some p-values < 0.05 for sure.
n1 = 5000
p1 = .37
n2 = 5000
p2 = .39
population1 = np.random.binomial(1, p1, n1)
population2 = np.random.binomial(1, p2, n2)
sm.stats.ttest_ind(population1, population2)
Let's say a cartwheeling competition was organized for some adults. The data looks like following,
(80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01)
Is distance Is the average cartwheel distance (in inches) for adults more than 80 inches?
Population: All adults
Parameter of Interest: $\mu$, population mean cartwheel distance.
Null Hypothesis: $\mu$ = 80
Alternative Hypthosis: $\mu$ > 80
Data:
25 adult participants.
$\mu = 83.84$
$\sigma = 10.72$
cwdata = np.array([80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01,
82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01])
n = len(cwdata)
mean = cwdata.mean()
sd = cwdata.std()
(n, mean, sd)
sm.stats.ztest(cwdata, value = 80, alternative = "larger")
Since the p-value (0.0394) is lower than the standard confidence level 0.05, we can reject the Null hypothesis that the mean cartwheel distance for adults (a population quantity) is equal to 80 inches. There is strong evidence in support for the alternatine hypothesis that the mean cartwheel distance is, in fact, higher than 80 inches. Note, we used alternative="larger"
in the z-test.
We can also plot the histogram of the data to check if it approximately follows a Normal distribution.
plt.hist(cwdata,bins=5,edgecolor='k')
plt.show()
Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?
Population: Adults in the NHANES data.
Parameter of Interest: $\mu_1 - \mu_2$, Body Mass Index.
Null Hypothesis: $\mu_1 = \mu_2$
Alternative Hypthosis: $\mu_1 \neq \mu_2$
Data:
2976 Females
$\mu_1 = 29.94$
$\sigma_1 = 7.75$
2759 Male Adults
$\mu_2 = 28.78$
$\sigma_2 = 6.25$
$\mu_1 - \mu_2 = 1.16$
url = "https://raw.githubusercontent.com/kshedden/statswpy/master/NHANES/merged/nhanes_2015_2016.csv"
da = pd.read_csv(url)
da.head()
females = da[da["RIAGENDR"] == 2]
male = da[da["RIAGENDR"] == 1]
n1 = len(females)
mu1 = females["BMXBMI"].mean()
sd1 = females["BMXBMI"].std()
(n1, mu1, sd1)
n2 = len(male)
mu2 = male["BMXBMI"].mean()
sd2 = male["BMXBMI"].std()
(n2, mu2, sd2)
sm.stats.ztest(females["BMXBMI"].dropna(), male["BMXBMI"].dropna(),alternative='two-sided')
Since the p-value (6.59e-10) is extremely small, we can reject the Null hypothesis that the mean BMI of males is same as that of females. Note, we used alternative="two-sided"
in the z-test because here we are checking for inequality.
We can also plot the histogram of the data to check if it approximately follows a Normal distribution.
plt.figure(figsize=(7,4))
plt.title("Female BMI histogram",fontsize=16)
plt.hist(females["BMXBMI"].dropna(),edgecolor='k',color='pink',bins=25)
plt.show()
plt.figure(figsize=(7,4))
plt.title("Male BMI histogram",fontsize=16)
plt.hist(male["BMXBMI"].dropna(),edgecolor='k',color='blue',bins=25)
plt.show()