1.4 - Hypothesis testing (general)

by Mark Greenwood and Katharine Banner

In hypothesis testing, it is formulated to answer a specific question about a population or ture parameter(s) using a statistic based on a data set. In your previous statistics course, you (hopefully) considered one-sample hypotheses about population means and proportions and the two sample mean situation we are focused on here. Our hypotheses relate to trying to answer the question about whether the population mean sentences between the two groups are different, with an initial assumption of no difference.

Hypothesis testing is much like a criminal trial where you are in the role of a jury member (or judge if no jury is present). Initially, the defendant is assumed innocent. In our situation, the true means are assumed to be equal between the groups. Then evidence is presented and, as a juror, you analyze it. In statistical hypothesis testing, data are collected and analyzed. Then you have to decide if we had "enough" evidence to reject the initial assumption ("innocence" is initially assumed). To make this decision, you want to have previously decided on the standard of evidence required to reject the initial assumption. In criminal cases, "beyond a reasonable doubt" is used. Wikipedia's definition suggests that this standard is that "there can still be a doubt, but only to the extent that it would not affect a reasonable person's belief regarding whether or not the defendant is guilty". In civil trials, a lower standard called a "preponderance of evidence" is used. Based on that defined and pre-decided (a priori) measure, you decide that the defendant is guilty or not guilty. In statistics, we compare our p-value to a significance level, α, which is most often 5%. If our p-value is less than α, we reject the null hypothesis. The choice of the significance level is like the variation in standards of evidence between criminal and civil trials - and in all situations everyone should know the standards required for rejecting the initial assumption before any information is "analyzed". Once someone is found guilty, then there is the matter of sentencing which is related to the impacts ("size") of the crime. In statistics, this is similar to the estimated size of differences and the related judgements about whether the differences are practically important or not. If the crime is proven beyond a reasonable doubt but it is a minor crime, then the sentence will be small. With the same level of evidence and a more serious crime, the sentence will be more dramatic.

There are some important aspects of the testing process to note that inform how we interpret statistical hypothesis test results. When someone is found "not guilty", it does not mean "innocent", it just means that there was not enough evidence to find the person guilty "beyond a reasonable doubt". Not finding enough evidence to reject the null hypothesis does not imply that the true means are equal, just that there was not enough evidence to conclude that they were different. There are many potential reasons why we might fail to reject the null, but the most common one is that our sample size was too small (which is related to having too little evidence).

Throughout the semester, we will continue to re-iterate the distinctions between parameters and statistics and want you to be clear about the distinctions between estimates based on the sample and inferences for the population or true values of the parameters of interest. Remember that statistics are summaries of the sample information and parameters are characteristics of populations (which we rarely know). In the two-sample mean situation, the sample means are always at least a little different - that is not an interesting conclusion. What is interesting is whether we have enough evidence to prove that the population means differ "beyond a reasonable doubt".

The scope of any inferences is constrained based on whether there is a random sample (RS) and/or random assignment (RA). Table 1-1 contains the four possible combinations of these two characteristics of a given study. Random assignment allows for causal inferences for differences that are observed - the different in treatment levels causes differences in the mean responses. Random sampling (or at least some sort of representative sample) allows inferences to be made to the population of interest. If we do not have RA, then causal inferences cannot be made. If we do not have a representative sample, then our inferences are limited to the sampled subjects.

A simple example helps to clarify how the scope of inference can change. Suppose we are interested in studying the GPA of students and have a sample mean GPA and a confidence interval for the population mean GPA available. If we had taken a random sample from, say, the STAT 217 students in a given semester, our scope of inference would be the population of 217 students in that semester. If we had taken a random sample from the entire MSU population, then the inferences would be to the entire MSU population in that semester. These are similar types of problems but the two populations are very different and the group you are trying to make conclusions about should be noted carefully in your results - it does matter! If we did not have a representative sample, say the students could choose to provide this information or not, then we can only make inferences to volunteers. These volunteers might differ in systematic ways from the entire population of STAT 217 students so we cannot safely extend our inferences beyond the group that volunteered.

*Table 1-1: Scope of inference summary.*
Random Sampling/Random Assignment	Random Assignment (RA) - Yes (controlled experiment)	Random Assignment (RA) - No (observational study)
Random Sampling (RS) - Yes (or some method that results in a representative sample of population of interest)	Because we have RS, we can generalize inferences to the population the RS was taken from. Because we have RA we can assume the groups were equivalent on all aspects except for the treatment and can establish causal inference.	Can generalize inference to population RS was taken from but cannot establish causal inference (no RA - cannot isolate treatment variable as only difference among groups, could be confounding variables).
Random Sampling (RS) - No (usually a convenience sample )	Cannot generalize inference to the population of interest because the sample was not random and could be biased - may not be "representative" of the population of interest. Can establish causal inference due to RA → the inference from this type of study applies only to the sample.	Cannot generalize inference to the population of interest because the sample was not random and could be biased - may not be "representative" of the population of interest. Cannot establish causal inference due to lack of RA of the treatment.

To consider the impacts of RA versus observational studies, we need to be comparing groups. Suppose that we are interested in differences in the mean GPAs for different sections of STAT 217 and that we take a random sample of students from each section and compare the results and find evidence of some difference. In this scenario, we can conclude that there is some difference in the population of STAT 217 students but we can't say that being in different sections caused the differences in the mean GPAs. Now suppose that we randomly assigned every 217 student to get extra training in one of three different study techniques and found evidence of differences among the training methods. We could conclude that the training methods caused the differences in these students. These conclusions would only apply to STAT 217 students and could not be generalized to a larger population of students. If we took a random sample of STAT 217 students (say only 10 from each section) and then randomly assigned them to one of three training programs. If evidence of differences is found, then we can say that the training programs caused the differences and we can say that we have evidence that those differences pertain to the population of STAT 217 students. This seems similar to the scenario where all 217 students participated in the training programs except that by using random sampling, only a fraction of the population needs to actually be studied to make inferences to the entire population of interest - saving time and money.

A quick summary of the terminology of hypothesis testing is useful at this point. The null hypothesis (H₀) states that there is no difference or no relationship in the population. This is the statement of no effect or no difference and the claim that we are trying to find evidence against. In this chapter, it is always H₀: μ₁ = μ₂. When doing two-group problems, you always need to specify which group is 1 and which is 2. The alternative hypothesis (H₁ or H_A) states a specific difference between parameters. This is the research hypothesis and the claim about the population that we hope to demonstrate is more reasonable to conclude than the null hypothesis. In the two-group situation, we can have one-sided alternatives of H_A: μ₁ > μ₂ (greater than) or H_A: μ₁ < μ₂ (less than) or, the more common, two-sided alternative of H_A: μ₁ ≠ μ₂ (not equal to). We usually default to using two-sided tests because we often do not know enough to know the direction of a difference in advance, especially in more complicated situations. The sampling distribution is the distribution of a statistic under the assumption that H₀ is true and is used to calculate the p-value, the probability of obtaining a result as extreme or more extreme than what we observed given that the null hypothesis is true. We will find sampling distributions using nonparametric approaches (like the permutation approach used above) and parametric methods (using "named" distributions like the t, F, and χ²).

Small p-values are evidence against the null hypothesis because the the observed result is unlikely due to chance if H₀ is true. Large p-values provide no evidence against H₀ but do not allow us to conclude that there is no difference. The level of significance is an a priori definition of how small the p-value needs to be to provide "enough" (sufficient) evidence against H₀. This is most useful to prevent sliding the standards after the results are found. We compare the p-value to the level of significance to decide if the p-value is small enough to constitute sufficient evidence to reject the null hypothesis. We use a to denote the level of significance and most typically use 0.05 which we refer to as the 5% significance level. We compare the p-value to this level and make a decision. The two options for decisions are to either reject the null hypothesis if the p-value ≤ α or fail to reject the null hypothesis if the p-value > α. When interpreting hypothesis testing results, remember that the p-value is a measure of how unlikely the observed outcome was, assuming that the null hypothesis is true. It is NOT the probability of the data or the probability of either hypothesis being true. The p-value is a measure of evidence against the null hypothesis.

The specific definition of a is that it is the probability of rejecting H₀ when H₀ is true, the probability of what is called a Type I error. Type I errors are also called false rejections. In the two-group mean situation, a Type I error would be concluding that there is a difference in the true means between the groups when none really exists in the population. In the courtroom setting, this is like falsely finding someone guilty. We don't want to do this very often, so we use small values of the significance level, allowing us to control the rate of Type of I errors at α. We also have to worry about Type II errors, which are failing to reject the null hypothesis when it's false. In a courtroom, this is the same as failing to convict a guilty person. This most often occurs due to a lack of evidence. You can use the Table 1-2 to help you remember all the possibilities.

*Table 1-2: Table of decisions and truth scenarios in a hypothesis testing situation. We never know the truth in a real situation.*
	H₀ True	H₀ False
FTR H₀	Correct decision	Type II error
Reject H₀	Type I error	Correct decision

In comparing different procedures, there is an interest in studying the rate or probability of Type I and II errors. The probability of a Type I error was defined previously as α, the significance level. The power of a procedure is the probability of rejecting the null hypothesis when it is false. Power is defined as power = 1 - Probability(Type II error) = Probability(Reject H₀ | H₀ is false), or, in words, the probability of detecting a difference when it actually exists. We want to use a statistical procedure that controls the Type I error rate at the pre-specified level and has high power to detect false null alternatives. Increasing the sample size is one of the most commonly used methods for increasing the power in a given situation but sometimes we can choose among different procedures and use the power of the procedures to help us make that selection. Note that there are many ways to make H₀ false and the power changes based on how false the null hypothesis actually is. To make this concrete, suppose that the true mean sentences differed by either 1 or 20 years in previous example. The chances of rejecting the null hypothesis are much larger when the groups actually differ by 20 years than if they differ by just 1 year.

After making a decision (was there enough evidence to reject the null or not), we want to make the conclusions specific to the problem of interest. If we reject H₀, then we can conclude that there was sufficient evidence at the α-level that the null hypothesis is wrong (and the results point in the direction of the alternative). If we fail to reject H₀ (FTR H₀), then we can conclude that there was insufficient evidence at the α-level to say that the null hypothesis is wrong. We are NOT saying that the null is correct and we NEVER accept the null hypothesis. We just failed to find enough evidence to say it's wrong. If we find sufficient evidence to reject the null, then we need to revisit the method of data collection and design of the study. This allows us to consider the scope of the inferences we can make. Can we discuss causality (due to RA) and/or make inferences to a larger group than those in the sample (due to RS)?

To perform a hypothesis test, there are some steps to remember to complete to make sure you have thought through all the aspects of the results.

Outline of 6+ steps to perform a Hypothesis Test

Isolate the claim to be proved, method to use (define a test statistic T), and significance level

1) Write the null and alternative hypotheses

2) Assess the "Things To Check" for the procedure being used (discussed below)

3) Find the value of the appropriate test statistic

4) Find the p-value

5) Make a decision

6) Write a conclusion specific to the problem, including scope of inference discussion

previous next