linear hypothesis testing in r

Statistics Made Easy

How to Use the linearHypothesis() Function in R

You can use the linearHypothesis() function from the car package in R to test linear hypotheses in a specific regression model.

This function uses the following basic syntax:

This particular example tests if the regression coefficients var1 and var2 in the model called fit are jointly equal to zero.

The following example shows how to use this function in practice.

Example: How to Use linearHypothesis() Function in R

Suppose we have the following data frame in R that shows the number of hours spent studying, number of practice exams taken, and final exam score for 10 students in some class:

Now suppose we would like to fit the following multiple linear regression model in R:

Exam score = β 0 + β 1 (hours) + β 2 (practice exams)

We can use the lm() function to fit this model:

Now suppose we would like to test if the coefficient for hours and prac_exams are both equal to zero.

We can use the linearHypothesis() function to do so:

The hypothesis test returns the following values:

F test statistic : 14.035
p-value : .003553

This particular hypothesis test uses the following null and alternative hypotheses:

H 0 : Both regression coefficients are equal to zero.
H A : At least one regression coefficient is not equal to zero.

Since the p-value of the test (.003553) is less than .05, we reject the null hypothesis.

In other words, we don’t have sufficient evidence to say that the regression coefficients for hours and prac_exams are both equal to zero.

Additional Resources

The following tutorials provide additional information about linear regression in R:

How to Interpret Regression Output in R How to Perform Simple Linear Regression in R How to Perform Multiple Linear Regression in R How to Perform Logistic Regression in R

Published by Zach

Linear Hypothesis Tests

Most regression output will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in whether a linear sum of the coefficients is 0. For example, in the regression

You may be interested to see if $GoodThing$ and $BadThing$ (both binary variables) cancel each other out. So you would want to do a test of $\beta_1 - \beta_2 = 0$.

Alternately, you may want to do a joint significance test of multiple linear hypotheses. For example, you may be interested in whether $\beta_1$ or $\beta_2$ are nonzero and so would want to jointly test the hypotheses $\beta_1 = 0$ and $\beta_2=0$ rather than doing them one at a time. Note the and here, since if either one or the other is rejected, we reject the null.

Keep in Mind

Be sure to carefully interpret the result. If you are doing a joint test, rejection means that at least one of your hypotheses can be rejected, not each of them. And you don’t necessarily know which ones can be rejected!
Generally, linear hypothesis tests are performed using F-statistics. However, there are alternate approaches such as likelihood tests or chi-squared tests. Be sure you know which on you’re getting.
Conceptually, what is going on with linear hypothesis tests is that they compare the model you’ve estimated against a more restrictive one that requires your restrictions (hypotheses) to be true. If the test you have in mind is too complex for the software to figure out on its own, you might be able to do it on your own by taking the sum of squared residuals in your original unrestricted model ($SSR_{UR}$), estimate the alternate model with the restriction in place ($SSR_R$) and then calculate the F-statistic for the joint test using $F_{q,n-k-1} = ((SSR_R - SSR_{UR})/q)/(SSR_{UR}/(n-k-1))$.

Also Consider

The process for testing a nonlinear combination of your coefficients, for example testing if $\beta_1\times\beta_2 = 1$ or $\sqrt{\beta_1} = .5$, is generally different. See Nonlinear hypothesis tests .

Implementations

Linear hypothesis test in R can be performed for most regression models using the linearHypothesis() function in the car package. See this guide for more information.

Tests of coefficients in Stata can generally be performed using the built-in test command.

Life With Data

by bprasad26

How to Use the linearHypothesis() Function in R

The linearHypothesis() function is a valuable statistical tool in R programming. It’s provided in the car package and is used to perform hypothesis testing for a linear model’s coefficients.

To fully grasp the utility of linearHypothesis() , we must understand the basic principles of linear regression and hypothesis testing in the context of model fitting.

Understanding Hypothesis Testing in Regression Analysis

In regression analysis, it’s common to perform hypothesis tests on the model’s coefficients to determine whether the predictors are statistically significant. The null hypothesis asserts that the predictor has no effect on the outcome variable, i.e., its coefficient equals zero. Rejecting the null hypothesis (based on a small p-value, usually less than 0.05) suggests that there’s a statistically significant relationship between the predictor and the outcome variable.

The linearHypothesis( ) Function

linearHypothesis() is a function in R that tests the general linear hypothesis for a model object for which a formula method exists, using a specified test statistic. It allows the user to define a broader set of null hypotheses than just assuming individual coefficients equal to zero.

The linearHypothesis() function can be especially useful for comparing nested models or testing whether a group of variables significantly contributes to the model.

Here’s the basic usage of linearHypothesis() :

In this function:

model is the model object for which the linear hypothesis is to be tested.
hypothesis.matrix specifies the null hypotheses.
rhs is the right-hand side of the linear hypotheses; typically set to 0.
... are additional arguments, such as the test argument to specify the type of test statistic to be used (“F” for F-test, “Chisq” for chi-squared test, etc.).

Installing and Loading the Required Package

linearHypothesis() is part of the car package. If you haven’t installed this package yet, you can do so using the following command:

Once installed, load it into your R environment with the library() function:

Using linearHypothesis( ) in Practice

Let’s demonstrate the use of linearHypothesis() with a practical example. We’ll use the mtcars dataset that’s built into R. This dataset comprises various car attributes, and we’ll model miles per gallon (mpg) based on horsepower (hp), weight (wt), and the number of cylinders (cyl).

We first fit a linear model using the lm() function:

Let’s say we want to test the hypothesis that the coefficients for hp and wt are equal to zero. We can set up this hypothesis test using linearHypothesis() :

This command will output the Residual Sum of Squares (RSS) for the model under the null hypothesis, the RSS for the full model, the test statistic, and the p-value for the test. A low p-value suggests that we should reject the null hypothesis.

Using linearHypothesis( ) for Testing Nested Models

linearHypothesis() can also be useful for testing nested models, i.e., comparing a simpler model to a more complex one where the simpler model is a special case of the complex one.

For instance, suppose we want to test if both hp and wt can be dropped from our model without a significant loss of fit. We can formulate this as the null hypothesis that the coefficients for hp and wt are simultaneously zero:

This gives a p-value for the F-test of the hypothesis that these coefficients are zero. If the p-value is small, we reject the null hypothesis and conclude that dropping these predictors from the model would significantly degrade the model fit.

Limitations and Considerations

The linearHypothesis() function is a powerful tool for hypothesis testing in the context of model fitting. However, it’s important to consider the limitations and assumptions of this function. The linearHypothesis() function assumes that the errors of the model are normally distributed and have equal variance. Violations of these assumptions can lead to incorrect results.

As with any statistical function, it’s crucial to have a good understanding of your data and the theory behind the statistical methods you’re using.

The linearHypothesis() function in R is a powerful tool for testing linear hypotheses about a model’s coefficients. This function is very flexible and can be used in various scenarios, including testing the significance of individual predictors and comparing nested models.

Understanding and properly using linearHypothesis() can enhance your data analysis capabilities and help you extract meaningful insights from your data.

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Search Rcompanion.org

Purpose of this Book
Author of this Book
Statistics Textbooks and Other Resources
Why Statistics?
Evaluation Tools and Surveys
Types of Variables
Descriptive Statistics
Confidence Intervals
Basic Plots

Hypothesis Testing and p-values

Reporting Results of Data and Analyses
Choosing a Statistical Test
Independent and Paired Values
Introduction to Likert Data
Descriptive Statistics for Likert Item Data
Descriptive Statistics with the likert Package
Confidence Intervals for Medians
Converting Numeric Data to Categories
Introduction to Traditional Nonparametric Tests
One-sample Wilcoxon Signed-rank Test
Sign Test for One-sample Data
Two-sample Mann–Whitney U Test
Mood’s Median Test for Two-sample Data
Two-sample Paired Signed-rank Test
Sign Test for Two-sample Paired Data
Kruskal–Wallis Test
Mood’s Median Test
Friedman Test
Scheirer–Ray–Hare Test
Aligned Ranks Transformation ANOVA
Nonparametric Regression and Local Regression
Nonparametric Regression for Time Series
Introduction to Permutation Tests
One-way Permutation Test for Ordinal Data
One-way Permutation Test for Paired Ordinal Data
Permutation Tests for Medians and Percentiles
Association Tests for Ordinal Tables
Measures of Association for Ordinal Tables
Introduction to Linear Models
Using Random Effects in Models
What are Estimated Marginal Means?
Estimated Marginal Means for Multiple Comparisons
Factorial ANOVA: Main Effects, Interaction Effects, and Interaction Plots
p-values and R-square Values for Models
Accuracy and Errors for Models
Introduction to Cumulative Link Models (CLM) for Ordinal Data
Two-sample Ordinal Test with CLM
Two-sample Paired Ordinal Test with CLMM
One-way Ordinal Regression with CLM
One-way Repeated Ordinal Regression with CLMM
Two-way Ordinal Regression with CLM
Two-way Repeated Ordinal Regression with CLMM
Introduction to Tests for Nominal Variables
Confidence Intervals for Proportions
Goodness-of-Fit Tests for Nominal Variables
Association Tests for Nominal Variables
Measures of Association for Nominal Variables
Tests for Paired Nominal Data
Cochran–Mantel–Haenszel Test for 3-Dimensional Tables
Cochran’s Q Test for Paired Nominal Data
Models for Nominal Data
Introduction to Parametric Tests
One-sample t-test
Two-sample t-test
Paired t-test
One-way ANOVA
One-way ANOVA with Blocks
One-way ANOVA with Random Blocks
Two-way ANOVA
Repeated Measures ANOVA
Correlation and Linear Regression
Advanced Parametric Methods
Transforming Data
Normal Scores Transformation
Regression for Count Data
Beta Regression for Percent and Proportion Data
An R Companion for the Handbook of Biological Statistics

Initial comments

Traditionally when students first learn about the analysis of experiments, there is a strong focus on hypothesis testing and making decisions based on p -values. Hypothesis testing is important for determining if there are statistically significant effects. However, readers of this book should not place undo emphasis on p -values. Instead, they should realize that p -values are affected by sample size, and that a low p -value does not necessarily suggest a large effect or a practically meaningful effect. Summary statistics, plots, effect size statistics, and practical considerations should be used. The goal is to determine: a) statistical significance, b) effect size, c) practical importance. These are all different concepts, and they will be explored below.

Statistical inference

Most of what we’ve covered in this book so far is about producing descriptive statistics: calculating means and medians, plotting data in various ways, and producing confidence intervals. The bulk of the rest of this book will cover statistical inference: using statistical tests to draw some conclusion about the data. We’ve already done this a little bit in earlier chapters by using confidence intervals to conclude if means are different or not among groups.

As Dr. Nic mentions in her article in the “References and further reading” section, this is the part where people sometimes get stumped. It is natural for most of us to use summary statistics or plots, but jumping to statistical inference needs a little change in perspective. The idea of using some statistical test to answer a question isn’t a difficult concept, but some of the following discussion gets a little theoretical. The video from the Statistics Learning Center in the “References and further reading” section does a good job of explaining the basis of statistical inference.

One important thing to gain from this chapter is an understanding of how to use the p -value, alpha , and decision rule to test the null hypothesis. But once you are comfortable with that, you will want to return to this chapter to have a better understanding of the theory behind this process.

Another important thing is to understand the limitations of relying on p -values, and why it is important to assess the size of effects and weigh practical considerations.

Packages used in this chapter

The packages used in this chapter include:

The following commands will install these packages if they are not already installed:

if(!require(lsr)){install.packages("lsr")}

Hypothesis testing

The null and alternative hypotheses.

The statistical tests in this book rely on testing a null hypothesis, which has a specific formulation for each test. The null hypothesis always describes the case where e.g. two groups are not different or there is no correlation between two variables, etc.

The alternative hypothesis is the contrary of the null hypothesis, and so describes the cases where there is a difference among groups or a correlation between two variables, etc.

Notice that the definitions of null hypothesis and alternative hypothesis have nothing to do with what you want to find or don't want to find, or what is interesting or not interesting, or what you expect to find or what you don’t expect to find. If you were comparing the height of men and women, the null hypothesis would be that the height of men and the height of women were not different. Yet, you might find it surprising if you found this hypothesis to be true for some population you were studying. Likewise, if you were studying the income of men and women, the null hypothesis would be that the income of men and women are not different, in the population you are studying. In this case you might be hoping the null hypothesis is true, though you might be unsurprised if the alternative hypothesis were true. In any case, the null hypothesis will take the form that there is no difference between groups, there is no correlation between two variables, or there is no effect of this variable in our model.

p -value definition

Most of the tests in this book rely on using a statistic called the p -value to evaluate if we should reject, or fail to reject, the null hypothesis.

Given the assumption that the null hypothesis is true , the p -value is defined as the probability of obtaining a result equal to or more extreme than what was actually observed in the data.

We’ll unpack this definition in a little bit.

Decision rule

The p -value for the given data will be determined by conducting the statistical test.

This p -value is then compared to a pre-determined value alpha . Most commonly, an alpha value of 0.05 is used, but there is nothing magic about this value.

If the p -value for the test is less than alpha , we reject the null hypothesis.

If the p -value is greater than or equal to alpha , we fail to reject the null hypothesis.

Coin flipping example

For an example of using the p -value for hypothesis testing, imagine you have a coin you will toss 100 times. The null hypothesis is that the coin is fair—that is, that it is equally likely that the coin will land on heads as land on tails. The alternative hypothesis is that the coin is not fair. Let’s say for this experiment you throw the coin 100 times and it lands on heads 95 times out of those hundred. The p -value in this case would be the probability of getting 95, 96, 97, 98, 99, or 100 heads, or 0, 1, 2, 3, 4, or 5 heads, assuming that the null hypothesis is true .

This is what we call a two-sided test, since we are testing both extremes suggested by our data: getting 95 or greater heads or getting 95 or greater tails. In most cases we will use two sided tests.

You can imagine that the p -value for this data will be quite small. If the null hypothesis is true, and the coin is fair, there would be a low probability of getting 95 or more heads or 95 or more tails.

Using a binomial test, the p -value is < 0.0001.

(Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16 , which is 0.00000000000000022, with 15 zeros after the decimal point.)

Assuming an alpha of 0.05, since the p -value is less than alpha , we reject the null hypothesis. That is, we conclude that the coin is not fair.

binom.test(5, 100, 0.5)

Exact binomial test number of successes = 5, number of trials = 100, p-value < 2.2e-16 alternative hypothesis: true probability of success is not equal to 0.5

Passing and failing example

As another example, imagine we are considering two classrooms, and we have counts of students who passed a certain exam. We want to know if one classroom had statistically more passes or failures than the other.

In our example each classroom will have 10 students. The data is arranged into a contingency table.

Classroom Passed Failed A 8 2 B 3 7

We will use Fisher’s exact test to test if there is an association between Classroom and the counts of passed and failed students. The null hypothesis is that there is no association between Classroom and Passed/Failed , based on the relative counts in each cell of the contingency table.

Input =(" Classroom Passed Failed A 8 2 B 3 7 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) Matrix

Passed Failed A 8 2 B 3 7

fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.06978

The reported p -value is 0.070. If we use an alpha of 0.05, then the p -value is greater than alpha , so we fail to reject the null hypothesis. That is, we did not have sufficient evidence to say that there is an association between Classroom and Passed/Failed .

More extreme data in this case would be if the counts in the upper left or lower right (or both!) were greater.

Classroom Passed Failed A 9 1 B 3 7 Classroom Passed Failed A 10 0 B 3 7 and so on, with Classroom B...

In most cases we would want to consider as "extreme" not only the results when Classroom A has a high frequency of passing students, but also results when Classroom B has a high frequency of passing students. This is called a two-sided or two-tailed test. If we were only concerned with one classroom having a high frequency of passing students, relatively, we would instead perform a one-sided test. The default for the fisher.test function is two-sided, and usually you will want to use two-sided tests.

Classroom Passed Failed A 2 8 B 7 3 Classroom Passed Failed A 1 9 B 7 3 Classroom Passed Failed A 0 10 B 7 3 and so on, with Classroom B...

In both cases, "extreme" means there is a stronger association between Classroom and Passed/Failed .

Theory and practice of using p -values

Wait, does this make any sense.

Recall that the definition of the p -value is:

The astute reader might be asking herself, “If I’m trying to determine if the null hypothesis is true or not, why would I start with the assumption that the null hypothesis is true? And why am I using a probability of getting certain data given that a hypothesis is true? Don’t I want to instead determine the probability of the hypothesis given my data?”

The answer is yes , we would like a method to determine the likelihood of our hypothesis being true given our data, but we use the Null Hypothesis Significance Test approach since it is relatively straightforward, and has wide acceptance historically and across disciplines.

In practice we do use the results of the statistical tests to reach conclusions about the null hypothesis.

Technically, the p -value says nothing about the alternative hypothesis. But logically, if the null hypothesis is rejected, then its logical complement, the alternative hypothesis, is supported. Practically, this is how we handle significant p -values, though this practical approach generates disapproval in some theoretical circles.

Statistics is like a jury?

Note the language used when testing the null hypothesis. Based on the results of our statistical tests, we either reject the null hypothesis, or fail to reject the null hypothesis.

This is somewhat similar to the approach of a jury in a trial. The jury either finds sufficient evidence to declare someone guilty, or fails to find sufficient evidence to declare someone guilty.

Failing to convict someone isn’t necessarily the same as declaring someone innocent. Likewise, if we fail to reject the null hypothesis, we shouldn’t assume that the null hypothesis is true. It may be that we didn’t have sufficient samples to get a result that would have allowed us to reject the null hypothesis, or maybe there are some other factors affecting the results that we didn’t account for. This is similar to an “innocent until proven guilty” stance.

Errors in inference

For the most part, the statistical tests we use are based on probability, and our data could always be the result of chance. Considering the coin flipping example above, if we did flip a coin 100 times and came up with 95 heads, we would be compelled to conclude that the coin was not fair. But 95 heads could happen with a fair coin strictly by chance.

We can, therefore, make two kinds of errors in testing the null hypothesis:

• A Type I error occurs when the null hypothesis really is true, but based on our decision rule we reject the null hypothesis. In this case, our result is a false positive ; we think there is an effect (unfair coin, association between variables, difference among groups) when really there isn’t. The probability of making this kind error is alpha , the same alpha we used in our decision rule.

• A Type II error occurs when the null hypothesis is really false, but based on our decision rule we fail to reject the null hypothesis. In this case, our result is a false negative ; we have failed to find an effect that really does exist. The probability of making this kind of error is called beta .

The following table summarizes these errors.

Reality ___________________________________ Decision of Test Null is true Null is false Reject null hypothesis Type I error Correctly (prob. = alpha) reject null (prob. = 1 – beta) Retain null hypothesis Correctly Type II error retain null (prob. = beta) (prob. = 1 – alpha)

Statistical power

The statistical power of a test is a measure of the ability of the test to detect a real effect. It is related to the effect size, the sample size, and our chosen alpha level.

The effect size is a measure of how unfair a coin is, how strong the association is between two variables, or how large the difference is among groups. As the effect size increases or as the number of observations we collect increases, or as the alpha level increases, the power of the test increases.

Statistical power in the table above is indicated by 1 – beta , and power is the probability of correctly rejecting the null hypothesis.

An example should make these relationship clear. Imagine we are sampling a large group of 7 th grade students for their height. That is, the group is the population, and we are sampling a sub-set of these students. In reality, for students in the population, the girls are taller than the boys, but the difference is small (that is, the effect size is small), and there is a lot of variability in students’ heights. You can imagine that in order to detect the difference between girls and boys that we would have to measure many students. If we fail to sample enough students, we might make a Type II error. That is, we might fail to detect the actual difference in heights between sexes.

If we had a different experiment with a larger effect size—for example the weight difference between mature hamsters and mature hedgehogs—we might need fewer samples to detect the difference.

Note also, that our chosen alpha plays a role in the power of our test, too. All things being equal, across many tests, if we decrease our alph a, that is, insist on a lower rate of Type I errors, we are more likely to commit a Type II error, and so have a lower power. This is analogous to a case of a meticulous jury that has a very high standard of proof to convict someone. In this case, the likelihood of a false conviction is low, but the likelihood of a letting a guilty person go free is relatively high.

The 0.05 alpha value is not dogma

The level of alpha is traditionally set at 0.05 in some disciplines, though there is sometimes reason to choose a different value.

One situation in which the alpha level is increased is in preliminary studies in which it is better to include potentially significant effects even if there is not strong evidence for keeping them. In this case, the researcher is accepting an inflated chance of Type I errors in order to decrease the chance of Type II errors.

Imagine an experiment in which you wanted to see if various environmental treatments would improve student learning. In a preliminary study, you might have many treatments, with few observations each, and you want to retain any potentially successful treatments for future study. For example, you might try playing classical music, improved lighting, complimenting students, and so on, and see if there is any effect on student learning. You might relax your alpha value to 0.10 or 0.15 in the preliminary study to see what treatments to include in future studies.

On the other hand, in situations where a Type I, false positive, error might be costly in terms of money or people’s health, a lower alpha can be used, perhaps, 0.01 or 0.001. You can imagine a case in which there is an established treatment for cancer, and a new treatment is being tested. Because the new treatment is likely to be expensive and to hold people’s lives in the balance, a researcher would want to be very sure that the new treatment is more effective than the established treatment. In reality, the researchers would not just lower the alpha level, but also look at the effect size, submit the research for peer review, replicate the study, be sure there were no problems with the design of the study or the data collection, and weigh the practical implications.

The 0.05 alpha value is almost dogma

In theory, as a researcher, you would determine the alpha level you feel is appropriate. That is, the probability of making a Type I error when the null hypothesis is in fact true.

In reality, though, 0.05 is almost always used in most fields for readers of this book. Choosing a different alpha value will rarely go without question. It is best to keep with the 0.05 level unless you have good justification for another value, or are in a discipline where other values are routinely used.

Practical advice

One good practice is to report actual p -values from analyses. It is fine to also simply say, e.g. “The dependent variable was significantly correlated with variable A ( p < 0.05).” But I prefer when possible to say, “The dependent variable was significantly correlated with variable A ( p = 0.026).

It is probably best to avoid using terms like “marginally significant” or “borderline significant” for p -values less than 0.10 but greater than 0.05, though you might encounter similar phrases. It is better to simply report the p -values of tests or effects in straight-forward manner. If you had cause to include certain model effects or results from other tests, they can be reported as e.g., “Variables correlated with the dependent variable with p < 0.15 were A , B , and C .”

Is the p -value every really true?

Considering some of the examples presented, it may have occurred to the reader to ask if the null hypothesis is ever really true. For example, in some population of 7 th graders, if we could measure everyone in the population to a high degree of precision, then there must be some difference in height between girls and boys. This is an important limitation of null hypothesis significance testing. Often, if we have many observations, even small effects will be reported as significant. This is one reason why it is important to not rely too heavily on p -values, but to also look at the size of the effect and practical considerations. In this example, if we sampled many students and the difference in heights was 0.5 cm, even if significant, we might decide that this effect is too small to be of practical importance, especially relative to an average height of 150 cm. (Here, the difference would be 0.3% of the average height).

Effect sizes and practical importance

Practical importance and statistical significance.

It is important to remember to not let p -values be the only guide for drawing conclusions. It is equally important to look at the size of the effects you are measuring, as well as take into account other practical considerations like the costs of choosing a certain path of action.

For example, imagine we want to compare the SAT scores of two SAT preparation classes with a t -test.

Class.A = c(1500, 1505, 1505, 1510, 1510, 1510, 1515, 1515, 1520, 1520) Class.B = c(1510, 1515, 1515, 1520, 1520, 1520, 1525, 1525, 1530, 1530) t.test(Class.A, Class.B)

Welch Two Sample t-test t = -3.3968, df = 18, p-value = 0.003214 mean of x mean of y 1511 1521

The p -value is reported as 0.003, so we would consider there to be a significant difference between the two classes ( p < 0.05).

But we have to ask ourselves the practical question, is a difference of 10 points on the SAT large enough for us to care about? What if enrolling in one class costs significantly more than the other class? Is it worth the extra money for a difference of 10 points on average?

Sizes of effects

It should be remembered that p -values do not indicate the size of the effect being studied. It shouldn’t be assumed that a small p -value indicates a large difference between groups, or vice-versa.

For example, in the SAT example above, the p -value is fairly small, but the size of the effect (difference between classes) in this case is relatively small (10 points, especially small relative to the range of scores students receive on the SAT).

In converse, there could be a relatively large size of the effects, but if there is a lot of variability in the data or the sample size is not large enough, the p -value could be relatively large.

In this example, the SAT scores differ by 100 points between classes, but because the variability is greater than in the previous example, the p -value is not significant.

Class.C = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.D = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.C, Class.D)

Welch Two Sample t-test t = -1.4174, df = 18, p-value = 0.1735 mean of x mean of y 1290 1390

boxplot(cbind(Class.C, Class.D))

p -values and sample sizes

It should also be remembered that p -values are affected by sample size. For a given effect size and variability in the data, as the sample size increases, the p -value is likely to decrease. For large data sets, small effects can result in significant p -values.

As an example, let’s take the data from Class.C and Class.D and double the number of observations for each without changing the distribution of the values in each, and rename them Class.E and Class.F .

Class.E = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500, 1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.F = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600, 1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.E, Class.F)

Welch Two Sample t-test t = -2.0594, df = 38, p-value = 0.04636 mean of x mean of y 1290 1390

boxplot(cbind(Class.E, Class.F))

Notice that the p -value is lower for the t -test for Class.E and Class.F than it was for Class.C and Class.D . Also notice that the means reported in the output are the same, and the box plots would look the same.

Effect size statistics

One way to account for the effect of sample size on our statistical tests is to consider effect size statistics. These statistics reflect the size of the effect in a standardized way, and are unaffected by sample size.

An appropriate effect size statistic for a t -test is Cohen’s d . It takes the difference in means between the two groups and divides by the pooled standard deviation of the groups. Cohen’s d equals zero if the means are the same, and increases to infinity as the difference in means increases relative to the standard deviation.

In the following, note that Cohen’s d is not affected by the sample size difference in the Class.C / Class.D and the Class.E / Class.F examples.

library(lsr) cohensD(Class.C, Class.D, method = "raw")

cohensD(Class.E, Class.F, method = "raw")

Effect size statistics are standardized so that they are not affected by the units of measurements of the data. This makes them interpretable across different situations, or if the reader is not familiar with the units of measurement in the original data. A Cohen’s d of 1 suggests that the two means differ by one pooled standard deviation. A Cohen’s d of 0.5 suggests that the two means differ by one-half the pooled standard deviation.

For example, if we create new variables— Class.G and Class.H —that are the SAT scores from the previous example expressed as a proportion of a 1600 score, Cohen’s d will be the same as in the previous example.

Class.G = Class.E / 1600 Class.H = Class.F / 1600 Class.G Class.H cohensD(Class.G, Class.H, method="raw")

Good practices for statistical analyses

Statistics is not like a trial.

When analyzing data, the analyst should not approach the task as would a lawyer for the prosecution. That is, the analyst should not be searching for significant effects and tests, but should instead be like an independent investigator using lines of evidence to find out what is most likely to true given the data, graphical analysis, and statistical analysis available.

The problem of multiple p -values

One concept that will be in important in the following discussion is that when there are multiple tests producing multiple p -values, that there is an inflation of the Type I error rate. That is, there is a higher chance of making false-positive errors.

This simply follows mathematically from the definition of alpha . If we allow a probability of 0.05, or 5% chance, of making a Type I error for any one test, as we do more and more tests, the chances that at least one of them having a false positive becomes greater and greater.

p -value adjustment

One way we deal with the problem of multiple p -values in statistical analyses is to adjust p -values when we do a series of tests together (for example, if we are comparing the means of multiple groups).

Don’t use Bonferroni adjustments

There are various p -value adjustments available in R. In some cases, we will use FDR, which stands for false discovery rate , and in R is an alias for the Benjamini and Hochberg method. There are also cases in which we’ll use Tukey range adjustment to correct for the family-wise error rate.

Unfortunately, students in analysis of experiments courses often learn to use Bonferroni adjustment for p -values. This method is simple to do with hand calculations, but is excessively conservative in most situations, and, in my opinion, antiquated.

There are other p -value adjustment methods, and the choice of which one to use is dictated either by which are common in your field of study, or by doing enough reading to understand which are statistically most appropriate for your application.

Preplanned tests

The statistical tests covered in this book assume that tests are preplanned for their p -values to be accurate. That is, in theory, you set out an experiment, collect the data as planned, and then say “I’m going to analyze it with kind of model and do these post-hoc tests afterwards”, and report these results, and that’s all you would do.

Some authors emphasize this idea of preplanned tests. In contrast is an exploratory data analysis approach that relies upon examining the data with plots and using simple tests like correlation tests to suggest what statistical analysis makes sense.

If an experiment is set out in a specific design, then usually it is appropriate to use the analysis suggested by this design.

p -value hacking

It is important when approaching data from an exploratory approach, to avoid committing p -value hacking. Imagine the case in which the researcher collects many different measurements across a range of subjects. The researcher might be tempted to simply try different tests and models to relate one variable to another, for all the variables. He might continue to do this until he found a test with a significant p -value.

But this would be a form of p -value hacking.

Because an alpha value of 0.05 allows us to make a false-positive error five percent of the time, finding one p -value below 0.05 after several successive tests may simply be due to chance.

Some forms of p -value hacking are more egregious. For example, if one were to collect some data, run a test, and then continue to collect data and run tests iteratively until a significant p -value is found.

Publication bias

A related issue in science is that there is a bias to publish, or to report, only significant results. This can also lead to an inflation of the false-positive rate. As a hypothetical example, imagine if there are currently 20 similar studies being conducted testing a similar effect—let’s say the effect of glucosamine supplements on joint pain. If 19 of those studies found no effect and so were discarded, but one study found an effect using an alpha of 0.05, and was published, is this really any support that glucosamine supplements decrease joint pain?

Clarification of terms and reporting on assignments

"statistically significant".

In the context of this book, the term "significant" means "statistically significant".

Whenever the decision rule finds that p < alpha , the difference in groups, the association, or the correlation under consideration is then considered "statistically significant" or "significant".

No effect size or practical considerations enter into determining whether an effect is “significant” or not. The only exception is that test assumptions and requirements for appropriate data must also be met in order for the p -value to be valid.

What you need to consider :

• The null hypothesis

• p , alpha , and the decision rule,

• Your result. That is, whether the difference in groups, the association, or the correlation is significant or not.

What you should report on your assignments:

• The p -value

• The conclusion, e.g. "There was a significant difference in the mean heights of boys and girls in the class." It is best to preface this with the "reject" or "fail to reject" language concerning your decision about the null hypothesis.

“Size of the effect” / “effect size”

In the context of this book, I use the term "size of the effect" to suggest the use of summary statistics to indicate how large an effect is. This may be, for example the difference in two medians. I try reserve the term “effect size” to refer to the use of effect size statistics. This distinction isn’t necessarily common.

Usually you will consider an effect in relation to the magnitude of measurements. That is, you might look at the difference in medians as a percent of the median of one group or of the global median. Or, you might look at the difference in medians in relation to the range of answers. For example, a one-point difference on a 5-point Likert item. Counts might be expressed as proportions of totals or subsets.

What you should report on assignments :

• The size of the effect. That is, the difference in medians or means, the difference in counts, or the proportions of counts among groups.

• Where appropriate, the size of the effect expressed as a percentage or proportion.

• If there is an effect size statistic—such as r , epsilon -squared, phi , Cramér's V , or Cohen's d —: report this and its interpretation (small, medium, large), and incorporate this into your conclusion.

"Practical" / "Practical importance"

If there is a significant result, the question of practical importance asks if the difference or association is large enough to matter in the real world.

If there is no significant result, the question of practical importance asks if the a difference or association is large enough to warrant another look, for example by running another test with a larger sample size or that controls variability in observations better.

• Your conclusion as to whether this effect is large enough to be important in the real world.

• The context, explanation, or support to justify your conclusion.

• In some cases you might include considerations that aren't included in the data presented. Examples might include the cost of one treatment over another, including time investment, or whether there is a large risk in selecting one treatment over another (e.g., if people's lives are on the line).

A few of xkcd comics

Significant.

xkcd.com/882/

Null hypothesis

xkcd.com/892/

xkcd.com/1478/

Experiments, sampling, and causation

Types of experimental designs, experimental designs.

A true experimental design assigns treatments in a systematic manner. The experimenter must be able to manipulate the experimental treatments and assign them to subjects. Since treatments are randomly assigned to subjects, a causal inference can be made for significant results. That is, we can say that the variation in the dependent variable is caused by the variation in the independent variable.

For interval/ratio data, traditional experimental designs can be analyzed with specific parametric models, assuming other model assumptions are met. These traditional experimental designs include:

• Completely random design

• Randomized complete block design

• Factorial

• Split-plot

• Latin square

Quasi-experiment designs

Often a researcher cannot assign treatments to individual experimental units, but can assign treatments to groups. For example, if students are in a specific grade or class, it would not be practical to randomly assign students to grades or classes. But different classes could receive different treatments (such as different curricula). Causality can be inferred cautiously if treatments are randomly assigned and there is some understanding of the factors that affect the outcome.

Observational studies

In observational studies, the independent variables are not manipulated, and no treatments are assigned. Surveys are often like this, as are studies of natural systems without experimental manipulation. Statistical analysis can reveal the relationships among variables, but causality cannot be inferred. This is because there may be other unstudied variables that affect the measured variables in the study.

Good sampling practices are critical for producing good data. In general, samples need to be collected in a random fashion so that bias is avoided.

In survey data, bias is often introduced by a self-selection bias. For example, internet or telephone surveys include only those who respond to these requests. Might there be some relevant difference in the variables of interest between those who respond to such requests and the general population being surveyed? Or bias could be introduced by the researcher selecting some subset of potential subjects, for example only surveying a 4-H program with particularly cooperative students and ignoring other clubs. This is sometimes called “convenience sampling”.

In election forecasting, good pollsters need to account for selection bias and other biases in the survey process. For example, if a survey is done by landline telephone, those being surveyed are more likely to be older than the general population of voters, and so likely to have a bias in their voting patterns.

Plan ahead and be consistent

It is sometimes necessary to change experimental conditions during the course of an experiment. Equipment might fail, or unusual weather may prevent making meaningful measurements.

But in general, it is much better to plan ahead and be consistent with measurements.

Consistency

People sometimes have the tendency to change measurement frequency or experimental treatments during the course of a study. This inevitably causes headaches in trying to analyze data, and makes writing up the results messy. Try to avoid this.

Controls and checks

If you are testing an experimental treatment, include a check treatment that almost certainly will have an effect and a control treatment that almost certainly won’t. A control treatment will receive no treatment and a check treatment will receive a treatment known to be successful. In an educational setting, perhaps a control group receives no instruction on the topic but on another topic, and the check group will receive standard instruction.

Including checks and controls helps with the analysis in a practical sense, since they serve as standard treatments against which to compare the experimental treatments. In the case where the experimental treatments have similar effects, controls and checks allow you say, for example, “Means for the all experimental treatments were similar, but were higher than the mean for control, and lower than the mean for check treatment.”

Include alternate measurements

It often happens that measuring equipment fails or that a certain measurement doesn’t produce the expected results. It is therefore helpful to include measurements of several variables that can capture the potential effects. Perhaps test scores of students won’t show an effect, but a self-assessment question on how much students learned will.

Include covariates

Including additional independent variables that might affect the dependent variable is often helpful in an analysis. In an educational setting, you might assess student age, grade, school, town, background level in the subject, or how well they are feeling that day.

The effects of covariates on the dependent variable may be of interest in itself. But also, including co-variates in an analysis can better model the data, sometimes making treatment effects more clear or making a model better meet model assumptions.

Optional discussion: Alternative methods to the Null Hypothesis Significance Test

The nhst controversy.

Particularly in the fields of psychology and education, there has been much criticism of the null hypothesis significance test approach. From my reading, the main complaints against NHST tend to be:

• Students and researchers don’t really understand the meaning of p -values.

• p -values don’t include important information like confidence intervals or parameter estimates.

• p -values have properties that may be misleading, for example that they do not represent effect size, and that they change with sample size.

• We often treat an alpha of 0.05 as a magical cutoff value.

Personally, I don’t find these to be very convincing arguments against the NHST approach.

The first complaint is in some sense pedantic: Like so many things, students and researchers learn the definition of p -values at some point and then eventually forget. This doesn’t seem to impact the usefulness of the approach.

The second point has weight only if researchers use only p -values to draw conclusions from statistical tests. As this book points out, one should always consider the size of the effects and practical considerations of the effects, as well present finding in table or graphical form, including confidence intervals or measures of dispersion. There is no reason why parameter estimates, goodness-of-fit statistics, and confidence intervals can’t be included when a NHST approach is followed.

The properties in the third point also don’t count much as criticism if one is using p -values correctly. One should understand that it is possible to have a small effect size and a small p -value, and vice-versa. This is not a problem, because p -values and effect sizes are two different concepts. We shouldn’t expect them to be the same. The fact that p -values change with sample size is also in no way problematic to me. It makes sense that when there is a small effect size or a lot of variability in the data that we need many samples to conclude the effect is likely to be real.

(One case where I think the considerations in the preceding point are commonly problematic is when people use statistical tests to check for the normality or homogeneity of data or model residuals. As sample size increases, these tests are better able to detect small deviations from normality or homoscedasticity. Too many people use them and think their model is inappropriate because the test can detect a small effect size, that is, a small deviation from normality or homoscedasticity).

The fourth point is a good one. It doesn’t make much sense to come to one conclusion if our p -value is 0.049 and the opposite conclusion if our p -value is 0.051. But I think this can be ameliorated by reporting the actual p -values from analyses, and relying less on p -values to evaluate results.

Overall it seems to me that these complaints condemn poor practices that the authors observe: not reporting the size of effects in some manner; not including confidence intervals or measures of dispersion; basing conclusions solely on p -values; and not including important results like parameter estimates and goodness-of-fit statistics.

Alternatives to the NHST approach

Estimates and confidence intervals.

One approach to determining statistical significance is to use estimates and confidence intervals. Estimates could be statistics like means, medians, proportions, or other calculated statistics. This approach can be very straightforward, easy for readers to understand, and easy to present clearly.

Bayesian approach

The most popular competitor to the NHST approach is Bayesian inference. Bayesian inference has the advantage of calculating the probability of the hypothesis given the data , which is what we thought we should be doing in the “Wait, does this make any sense?” section above. Essentially it takes prior knowledge about the distribution of the parameters of interest for a population and adds the information from the measured data to reassess some hypothesis related to the parameters of interest. If the reader will excuse the vagueness of this description, it makes intuitive sense. We start with what we suspect to be the case, and then use new data to assess our hypothesis.

One disadvantage of the Bayesian approach is that it is not obvious in most cases what could be used for legitimate prior information. A second disadvantage is that conducting Bayesian analysis is not as straightforward as the tests presented in this book.

References and further reading

[Video] “Understanding statistical inference” from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=tFRXsngz4UQ .

[Video] “Hypothesis tests, p-value” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=0zZYBALbZgg .

[Video] “Understanding the p-value” from Statistics Learning Center (Dr. Nic). 2011.

www.youtube.com/watch?v=eyknGvncKLw .

[Video] “Important statistical concepts: significance, strength, association, causation” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=FG7xnWmZlPE .

“Understanding statistical inference” from Dr. Nic. 2015. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/understanding-statistical-inference/ .

“Basic concepts of hypothesis testing” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/hypothesistesting.html .

“Hypothesis testing” , section 4.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Hypothesis Testing with One Sample”, sections 9.1–9.2 in Openstax. 2013. Introductory Statistics . openstax.org/textbooks/introductory-statistics .

"Proving causation" from Dr. Nic. 2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/proving-causation/ .

[Video] “Variation and Sampling Error” from Statistics Learning Center (Dr. Nic). 2014. www.youtube.com/watch?v=y3A0lUkpAko .

[Video] “Sampling: Simple Random, Convenience, systematic, cluster, stratified” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=be9e-Q-jC-0 .

“Confounding variables” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/confounding.html .

“Overview of data collection principles” , section 1.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Observational studies and sampling strategies” , section 1.4, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Experiments” , section 1.5, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

Exercises F

1. Which of the following pair is the null hypothesis?

A) The number of heads from the coin is not different from the number of tails.

B) The number of heads from the coin is different from the number of tails.

2. Which of the following pair is the null hypothesis?

A) The height of boys is different than the height of girls.

B) The height of boys is not different than the height of girls.

3. Which of the following pair is the null hypothesis?

A) There is an association between classroom and sex. That is, there is a difference in counts of girls and boys between the classes.

B) There is no association between classroom and sex. That is, there is no difference in counts of girls and boys between the classes.

4. We flip a coin 10 times and it lands on heads 7 times. We want to know if the coin is fair.

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alpha of 0.05,

What do you decide (use the reject or fail to reject language)?

c. In practical terms, what do you conclude?

binom.test(7, 10, 0.5)

Exact binomial test number of successes = 7, number of trials = 10, p-value = 0.3438

5. We measure the height of 9 boys and 9 girls in a class, in centimeters. We want to know if one group is taller than the other.

c. In practical terms, what do you conclude? Address the practical importance of the results.

Girls = c(152, 150, 140, 160, 145, 155, 150, 152, 147) Boys = c(144, 142, 132, 152, 137, 147, 142, 144, 139) t.test(Girls, Boys)

Welch Two Sample t-test t = 2.9382, df = 16, p-value = 0.009645 mean of x mean of y 150.1111 142.1111

mean(Boys) sd(Boys) quantile(Boys)

mean(Girls) sd(Girls) quantile(Girls) boxplot(cbind(Girls, Boys))

6. We count the number of boys and girls in two classrooms. We are interested to know if there is an association between the classrooms and the number of girls and boys. That is, does the proportion of boys and girls differ statistically across the two classrooms?

Classroom Girls Boys A 13 7 B 5 15

Input =(" Classroom Girls Boys A 13 7 B 5 15 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.02484

Matrix rowSums(Matrix) colSums(Matrix) prop.table(Matrix, margin=1) ### Proportions for each row barplot(t(Matrix), beside = TRUE, legend = TRUE, ylim = c(0, 25), xlab = "Class", ylab = "Count")

7. Why should you not rely solely on p -values to make a decision in the real world? (You should have at least two reasons.)

8. Create your own example to show the importance of considering the size of the effect . Describe the scenario: what the research question is, and what kind of data were collected. You may make up data and provide real results, or report hypothetical results.

9. Create your own example to show the importance of weighing other practical considerations . Describe the scenario: what the research question is, what kind of data were collected, what statistical results were reached, and what other practical considerations were brought to bear.

10. What is 5e-4 in common decimal notation?

Non-commercial reproduction of this content, with attribution, is permitted. For-profit reproduction without permission is prohibited.

If you use the code or information in this site in a published work, please cite it as a source. Also, if you are an instructor and use this book in your course, please let me know. My contact information is on the About the Author of this Book page.

Mangiafico, S.S. 2016. Summary and Analysis of Extension Program Evaluation in R, version 1.20.05, revised 2023. rcompanion.org/handbook/ . (Pdf version: rcompanion.org/documents/RHandbookProgramEvaluation.pdf .)

Hypothesis Tests in R

This tutorial covers basic hypothesis testing in R.

Normality tests
Shapiro-Wilk normality test
Kolmogorov-Smirnov test
Comparing central tendencies: Tests with continuous / discrete data
One-sample t-test : Normally-distributed sample vs. expected mean
Two-sample t-test : Two normally-distributed samples
Wilcoxen rank sum : Two non-normally-distributed samples
Weighted two-sample t-test : Two continuous samples with weights
Comparing proportions: Tests with categorical data
Chi-squared goodness of fit test : Sampled frequencies of categorical values vs. expected frequencies
Chi-squared independence test : Two sampled frequencies of categorical values
Weighted chi-squared independence test : Two weighted sampled frequencies of categorical values
Comparing multiple groups: Tests with categorical and continuous / discrete data
Analysis of Variation (ANOVA) : Normally-distributed samples in groups defined by categorical variable(s)
Kruskal-Wallace One-Way Analysis of Variance : Nonparametric test of the significance of differences between two or more groups

Hypothesis Testing

Science is "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method" (Merriam-Webster 2022) .

The idealized world of the scientific method is question-driven , with the collection and analysis of data determined by the formulation of research questions and the testing of hypotheses. Hypotheses are tentative assumptions about what the answers to your research questions may be.

Formulate questions: How can I understand some phenomenon?
Literature review: What does existing research say about my questions?
Formulate hypotheses: What do I think the answers to my questions will be?
Collect data: What data can I gather to test my hypothesis?
Test hypotheses: Does the data support my hypothesis?
Communicate results: Who else needs to know about this?
Formulate questions: Frame missing knowledge about a phenomenon as research question(s).
Literature review: A literature review is an investigation of what existing research says about the phenomenon you are studying. A thorough literature review is essential to identify gaps in existing knowledge you can fill, and to avoid unnecessarily duplicating existing research.
Formulate hypotheses: Develop possible answers to your research questions.
Collect data: Acquire data that supports or refutes the hypothesis.
Test hypotheses: Run tools to determine if the data corroborates the hypothesis.
Communicate results: Share your findings with the broader community that might find them useful.

While the process of knowledge production is, in practice, often more iterative than this waterfall model, the testing of hypotheses is usually a fundamental element of scientific endeavors involving quantitative data.

The Problem of Induction

The scientific method looks to the past or present to build a model that can be used to infer what will happen in the future. General knowledge asserts that given a particular set of conditions, a particular outcome will or is likely to occur.

The problem of induction is that we cannot be 100% certain that what we are assuming is a general principle is not, in fact, specific to the particular set of conditions when we made our empirical observations. We cannot prove that that such principles will hold true under future conditions or different locations that we have not yet experienced (Vickers 2014) .

The problem of induction is often associated with the 18th-century British philosopher David Hume . This problem is especially vexing in the study of human beings, where behaviors are a function of complex social interactions that vary over both space and time.

Falsification

One way of addressing the problem of induction was proposed by the 20th-century Viennese philosopher Karl Popper .

Rather than try to prove a hypothesis is true, which we cannot do because we cannot know all possible situations that will arise in the future, we should instead concentrate on falsification , where we try to find situations where a hypothesis is false. While you cannot prove your hypothesis will always be true, you only need to find one situation where the hypothesis is false to demonstrate that the hypothesis can be false (Popper 1962) .

If a hypothesis is not demonstrated to be false by a particular test, we have corroborated that hypothesis. While corroboration does not "prove" anything with 100% certainty, by subjecting a hypothesis to multiple tests that fail to demonstrate that it is false, we can have increasing confidence that our hypothesis reflects reality.

Null and Alternative Hypotheses

In scientific inquiry, we are often concerned with whether a factor we are considering (such as taking a specific drug) results in a specific effect (such as reduced recovery time).

To evaluate whether a factor results in an effect, we will perform an experiment and / or gather data. For example, in a clinical drug trial, half of the test subjects will be given the drug, and half will be given a placebo (something that appears to be the drug but is actually a neutral substance).

Because the data we gather will usually only be a portion (sample) of total possible people or places that could be affected (population), there is a possibility that the sample is unrepresentative of the population. We use a statistical test that considers that uncertainty when assessing whether an effect is associated with a factor.

Statistical testing begins with an alternative hypothesis (H 1 ) that states that the factor we are considering results in a particular effect. The alternative hypothesis is based on the research question and the type of statistical test being used.
Because of the problem of induction , we cannot prove our alternative hypothesis. However, under the concept of falsification , we can evaluate the data to see if there is a significant probability that our data falsifies our alternative hypothesis (Wilkinson 2012) .
The null hypothesis (H 0 ) states that the factor has no effect. The null hypothesis is the opposite of the alternative hypothesis. The null hypothesis is what we are testing when we perform a hypothesis test.

The output of a statistical test like the t-test is a p -value. A p -value is the probability that any effects we see in the sampled data are the result of random sampling error (chance).

If a p -value is greater than the significance level (0.05 for 5% significance) we fail to reject the null hypothesis since there is a significant possibility that our results falsify our alternative hypothesis.
If a p -value is lower than the significance level (0.05 for 5% significance) we reject the null hypothesis and have corroborated (provided evidence for) our alternative hypothesis.

The calculation and interpretation of the p -value goes back to the central limit theorem , which states that random sampling error has a normal distribution.

Using our example of a clinical drug trial, if the mean recovery times for the two groups are close enough together that there is a significant possibility ( p > 0.05) that the recovery times are the same (falsification), we fail to reject the null hypothesis.

However, if the mean recovery times for the two groups are far enough apart that the probability they are the same is under the level of significance ( p < 0.05), we reject the null hypothesis and have corroborated our alternative hypothesis.

Significance means that an effect is "probably caused by something other than mere chance" (Merriam-Webster 2022) .

The significance level (α) is the threshold for significance and, by convention, is usually 5%, 10%, or 1%, which corresponds to 95% confidence, 90% confidence, or 99% confidence, respectively.
A factor is considered statistically significant if the probability that the effect we see in the data is a result of random sampling error (the p -value) is below the chosen significance level.
A statistical test is used to evaluate whether a factor being considered is statistically significant (Gallo 2016) .

Type I vs. Type II Errors

Although we are making a binary choice between rejecting and failing to reject the null hypothesis, because we are using sampled data, there is always the possibility that the choice we have made is an error.

There are two types of errors that can occur in hypothesis testing.

Type I error (false positive) occurs when a low p -value causes us to reject the null hypothesis, but the factor does not actually result in the effect.
Type II error (false negative) occurs when a high p -value causes us to fail to reject the null hypothesis, but the factor does actually result in the effect.

The numbering of the errors reflects the predisposition of the scientific method to be fundamentally skeptical . Accepting a fact about the world as true when it is not true is considered worse than rejecting a fact about the world that actually is true.

Statistical Significance vs. Importance

When we fail to reject the null hypothesis, we have found information that is commonly called statistically significant . But there are multiple challenges with this terminology.

First, statistical significance is distinct from importance (NIST 2012) . For example, if sampled data reveals a statistically significant difference in cancer rates, that does not mean that the increased risk is important enough to justify expensive mitigation measures. All statistical results require critical interpretation within the context of the phenomenon being observed. People with different values and incentives can have different interpretations of whether statistically significant results are important.

Second, the use of 95% probability for defining confidence intervals is an arbitrary convention. This creates a good vs. bad binary that suggests a "finality and certitude that are rarely justified." Alternative approaches like Beyesian statistics that express results as probabilities can offer more nuanced ways of dealing with complexity and uncertainty (Clayton 2022) .

Science vs. Non-science

Not all ideas can be falsified, and Popper uses the distinction between falsifiable and non-falsifiable ideas to make a distinction between science and non-science. In order for an idea to be science it must be an idea that can be demonstrated to be false.

While Popper asserts there is still value in ideas that are not falsifiable, such ideas are not science in his conception of what science is. Such non-science ideas often involve questions of subjective values or unseen forces that are complex, amorphous, or difficult to objectively observe.

Example Data

As example data, this tutorial will use a table of anonymized individual responses from the CDC's Behavioral Risk Factor Surveillance System . The BRFSS is a "system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services" (CDC 2019) .

A CSV file with the selected variables used in this tutorial is available here and can be imported into R with read.csv() .

Guidance on how to download and process this data directly from the CDC website is available here...

Variable Types

The publicly-available BRFSS data contains a wide variety of discrete, ordinal, and categorical variables. Variables often contain special codes for non-responsiveness or missing (NA) values. Examples of how to clean these variables are given here...

The BRFSS has a codebook that gives the survey questions associated with each variable, and the way that responses are encoded in the variable values.

Normality Tests

Tests are commonly divided into two groups depending on whether they are built on the assumption that the continuous variable has a normal distribution.

Parametric tests presume a normal distribution.
Non-parametric tests can work with normal and non-normal distributions.

The distinction between parametric and non-parametric techniques is especially important when working with small numbers of samples (less than 40 or so) from a larger population.

The normality tests given below do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012) .

The Shapiro-Wilk Normality Test

Data: A continuous or discrete sampled variable
R Function: shapiro.test()
Null hypothesis (H 0 ): The population distribution from which the sample is drawn is not normal
History: Samuel Sanford Shapiro and Martin Wilk (1965)

This is an example with random values from a normal distribution.

This is an example with random values from a uniform (non-normal) distribution.

The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov is a more-generalized test than the Shapiro-Wilks test that can be used to test whether a sample is drawn from any type of distribution.

Data: A continuous or discrete sampled variable and a reference probability distribution
R Function: ks.test()
Null hypothesis (H 0 ): The population distribution from which the sample is drawn does not match the reference distribution
History: Andrey Kolmogorov (1933) and Nikolai Smirnov (1948)
pearson.test() The Pearson Chi-square Normality Test from the nortest library. Lower p-values (closer to 0) means to reject the reject the null hypothesis that the distribution IS normal.

Modality Tests of Samples

Comparing two central tendencies: tests with continuous / discrete data, one sample t-test (two-sided).

The one-sample t-test tests the significance of the difference between the mean of a sample and an expected mean.

Data: A continuous or discrete sampled variable and a single expected mean (μ)
Parametric (normal distributions)
R Function: t.test()
Null hypothesis (H 0 ): The means of the sampled distribution matches the expected mean.
History: William Sealy Gosset (1908)

t = ( Χ - μ) / (σ̂ / √ n )

t : The value of t used to find the p-value
Χ : The sample mean
μ: The population mean
σ̂: The estimate of the standard deviation of the population (usually the stdev of the sample
n : The sample size

T-tests should only be used when the population is at least 20 times larger than its respective sample. If the sample size is too large, the low p-value makes the insignificant look significant. .

For example, we test a hypothesis that the mean weight in IL in 2020 is different than the 2005 continental mean weight.

Walpole et al. (2012) estimated that the average adult weight in North America in 2005 was 178 pounds. We could presume that Illinois is a comparatively normal North American state that would follow the trend of both increased age and increased weight (CDC 2021) .

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight changed between 2005 and 2020 in Illinois.

One Sample T-Test (One-Sided)

Because we were expecting an increase, we can modify our hypothesis that the mean weight in 2020 is higher than the continental weight in 2005. We can perform a one-sided t-test using the alternative="greater" parameter.

The low p-value leads us to again reject the null hypothesis and corroborate our alternative hypothesis that mean weight in 2020 is higher than the continental weight in 2005.

Note that this does not clearly evaluate whether weight increased specifically in Illinois, or, if it did, whether that was caused by an aging population or decreasingly healthy diets. Hypotheses based on such questions would require more detailed analysis of individual data.

Although we can see that the mean cancer incidence rate is higher for counties near nuclear plants, there is the possiblity that the difference in means happened by accident and the nuclear plants have nothing to do with those higher rates.

The t-test allows us to test a hypothesis. Note that a t-test does not "prove" or "disprove" anything. It only gives the probability that the differences we see between two areas happened by chance. It also does not evaluate whether there are other problems with the data, such as a third variable, or inaccurate cancer incidence rate estimates.

Note that this does not prove that nuclear power plants present a higher cancer risk to their neighbors. It simply says that the slightly higher risk is probably not due to chance alone. But there are a wide variety of other other related or unrelated social, environmental, or economic factors that could contribute to this difference.

Box-and-Whisker Chart

One visualization commonly used when comparing distributions (collections of numbers) is a box-and-whisker chart. The boxes show the range of values in the middle 25% to 50% to 75% of the distribution and the whiskers show the extreme high and low values.

Although Google Sheets does not provide the capability to create box-and-whisker charts, Google Sheets does have candlestick charts , which are similar to box-and-whisker charts, and which are normally used to display the range of stock price changes over a period of time.

This video shows how to create a candlestick chart comparing the distributions of cancer incidence rates. The QUARTILE() function gets the values that divide the distribution into four equally-sized parts. This shows that while the range of incidence rates in the non-nuclear counties are wider, the bulk of the rates are below the rates in nuclear counties, giving a visual demonstration of the numeric output of our t-test.

While categorical data can often be reduced to dichotomous data and used with proportions tests or t-tests, there are situations where you are sampling data that falls into more than two categories and you would like to make hypothesis tests about those categories. This tutorial describes a group of tests that can be used with that type of data.

Two-Sample T-Test

When comparing means of values from two different groups in your sample, a two-sample t-test is in order.

The two-sample t-test tests the significance of the difference between the means of two different samples.

Two normally-distributed, continuous or discrete sampled variables, OR
A normally-distributed continuous or sampled variable and a parallel dichotomous variable indicating what group each of the values in the first variable belong to
Null hypothesis (H 0 ): The means of the two sampled distributions are equal.

For example, given the low incomes and delicious foods prevalent in Mississippi, we might presume that average weight in Mississippi would be higher than in Illinois.

We test a hypothesis that the mean weight in IL in 2020 is less than the 2020 mean weight in Mississippi.

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight in Illinois is less than in Mississippi.

While the difference in means is statistically significant, it is small (182 vs. 187), which should lead to caution in interpretation that you avoid using your analysis simply to reinforce unhelpful stigmatization.

Wilcoxen Rank Sum Test (Mann-Whitney U-Test)

The Wilcoxen rank sum test tests the significance of the difference between the means of two different samples. This is a non-parametric alternative to the t-test.

Data: Two continuous sampled variables
Non-parametric (normal or non-normal distributions)
R Function: wilcox.test()
Null hypothesis (H 0 ): For randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.
History: Frank Wilcoxon (1945) and Henry Mann and Donald Whitney (1947)

The test is is implemented with the wilcox.test() function.

When the test is performed on one sample in comparison to an expected value around which the distribution is symmetrical (μ), the test is known as a Mann-Whitney U test .
When the test is performed to compare two samples, the test is known as a Wilcoxon rank sum test .

For this example, we will use AVEDRNK3: During the past 30 days, on the days when you drank, about how many drinks did you drink on the average?

1 - 76: Number of drinks
77: Don’t know/Not sure
99: Refused
NA: Not asked or Missing

The histogram clearly shows this to be a non-normal distribution.

Continuing the comparison of Illinois and Mississippi from above, we might presume that with all that warm weather and excellent food in Mississippi, they might be inclined to drink more. The means of average number of drinks per month seem to suggest that Mississippians do drink more than Illinoians.

We can test use wilcox.test() to test a hypothesis that the average amount of drinking in Illinois is different than in Mississippi. Like the t-test, the alternative can be specified as two-sided or one-sided, and for this example we will test whether the sampled Illinois value is indeed less than the Mississippi value.

The low p-value leads us to reject the null hypothesis and corroborates our hypothesis that average drinking is lower in Illinois than in Mississippi. As before, this tells us nothing about why this is the case.

Weighted Two-Sample T-Test

The downloadable BRFSS data is raw, anonymized survey data that is biased by uneven geographic coverage of survey administration (noncoverage) and lack of responsiveness from some segments of the population (nonresponse). The X_LLCPWT field (landline, cellphone weighting) is a weighting factor added by the CDC that can be assigned to each response to compensate for these biases.

The wtd.t.test() function from the weights library has a weights parameter that can be used to include a weighting factor as part of the t-test.

Comparing Proportions: Tests with Categorical Data

Chi-squared goodness of fit.

Tests the significance of the difference between sampled frequencies of different values and expected frequencies of those values
Data: A categorical sampled variable and a table of expected frequencies for each of the categories
R Function: chisq.test()
Null hypothesis (H 0 ): The relative proportions of categories in one variable are different from the expected proportions
History: Karl Pearson (1900)
Example Question: Are the voting preferences of voters in my district significantly different from the current national polls?

For example, we test a hypothesis that smoking rates changed between 2000 and 2020.

In 2000, the estimated rate of adult smoking in Illinois was 22.3% (Illinois Department of Public Health 2004) .

The variable we will use is SMOKDAY2: Do you now smoke cigarettes every day, some days, or not at all?

1: Current smoker - now smokes every day
2: Current smoker - now smokes some days
3: Not at all
7: Don't know
NA: Not asked or missing - NA is used for people who have never smoked

We subset only yes/no responses in Illinois and convert into a dummy variable (yes = 1, no = 0).

The listing of the table as percentages indicates that smoking rates were halved between 2000 and 2020, but since this is sampled data, we need to run a chi-squared test to make sure the difference can't be explained by the randomness of sampling.

In this case, the very low p-value leads us to reject the null hypothesis and corroborates the alternative hypothesis that smoking rates changed between 2000 and 2020.

Chi-Squared Contingency Analysis / Test of Independence

Tests the significance of the difference between frequencies between two different groups
Data: Two categorical sampled variables
Null hypothesis (H 0 ): The relative proportions of one variable are independent of the second variable.

We can also compare categorical proportions between two sets of sampled categorical variables.

The chi-squared test can is used to determine if two categorical variables are independent. What is passed as the parameter is a contingency table created with the table() function that cross-classifies the number of rows that are in the categories specified by the two categorical variables.

The null hypothesis with this test is that the two categories are independent. The alternative hypothesis is that there is some dependency between the two categories.

For this example, we can compare the three categories of smokers (daily = 1, occasionally = 2, never = 3) across the two categories of states (Illinois and Mississippi).

The low p-value leads us to reject the null hypotheses that the categories are independent and corroborates our hypotheses that smoking behaviors in the two states are indeed different.

p-value = 1.516e-09

Weighted Chi-Squared Contingency Analysis

As with the weighted t-test above, the weights library contains the wtd.chi.sq() function for incorporating weighting into chi-squared contingency analysis.

As above, the even lower p-value leads us to again reject the null hypothesis that smoking behaviors are independent in the two states.

Suppose that the Macrander campaign would like to know how partisan this election is. If people are largely choosing to vote along party lines, the campaign will seek to get their base voters out to the polls. If people are splitting their ticket, the campaign may focus their efforts more broadly.

In the example below, the Macrander campaign took a small poll of 30 people asking who they wished to vote for AND what party they most strongly affiliate with.

The output of table() shows fairly strong relationship between party affiliation and candidates. Democrats tend to vote for Macrander, while Republicans tend to vote for Stewart, while independents all vote for Miller.

This is reflected in the very low p-value from the chi-squared test. This indicates that there is a very low probability that the two categories are independent. Therefore we reject the null hypothesis.

In contrast, suppose that the poll results had showed there were a number of people crossing party lines to vote for candidates outside their party. The simulated data below uses the runif() function to randomly choose 50 party names.

The contingency table() shows no clear relationship between party affiliation and candidate. This is validated quantitatively by the chi-squared test. The fairly high p-value of 0.4018 indicates a 40% chance that the two categories are independent. Therefore, we fail to reject the null hypothesis and the campaign should focus their efforts on the broader electorate.

The warning message given by the chisq.test() function indicates that the sample size is too small to make an accurate analysis. The simulate.p.value = T parameter adds Monte Carlo simulation to the test to improve the estimation and get rid of the warning message. However, the best way to get rid of this message is to get a larger sample.

Comparing Categorical and Continuous Variables

Analysis of variation (anova).

Analysis of Variance (ANOVA) is a test that you can use when you have a categorical variable and a continuous variable. It is a test that considers variability between means for different categories as well as the variability of observations within groups.

There are a wide variety of different extensions of ANOVA that deal with covariance (ANCOVA), multiple variables (MANOVA), and both of those together (MANCOVA). These techniques can become quite complicated and also assume that the values in the continuous variables have a normal distribution.

Data: One or more categorical (independent) variables and one continuous (dependent) sampled variable
R Function: aov()
Null hypothesis (H 0 ): There is no difference in means of the groups defined by each level of the categorical (independent) variable
History: Ronald Fisher (1921)
Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?

As an example, we look at the continuous weight variable (WEIGHT2) split into groups by the eight income categories in INCOME2: Is your annual household income from all sources?

1: Less than $10,000
2: $10,000 to less than $15,000
3: $15,000 to less than $20,000
4: $20,000 to less than $25,000
5: $25,000 to less than $35,000
6: $35,000 to less than $50,000
7: $50,000 to less than $75,000)
8: $75,000 or more

The barplot() of means does show variation among groups, although there is no clear linear relationship between income and weight.

To test whether this variation could be explained by randomness in the sample, we run the ANOVA test.

The low p-value leads us to reject the null hypothesis that there is no difference in the means of the different groups, and corroborates the alternative hypothesis that mean weights differ based on income group.

However, it gives us no clear model for describing that relationship and offers no insights into why income would affect weight, especially in such a nonlinear manner.

Suppose you are performing research into obesity in your city. You take a sample of 30 people in three different neighborhoods (90 people total), collecting information on health and lifestyle. Two variables you collect are height and weight so you can calculate body mass index . Although this index can be misleading for some populations (notably very athletic people), ordinary sedentary people can be classified according to BMI:

Average BMI in the US from 2007-2010 was around 28.6 and rising, standard deviation of around 5 .

You would like to know if there is a difference in BMI between different neighborhoods so you can know whether to target specific neighborhoods or make broader city-wide efforts. Since you have more than two groups, you cannot use a t-test().

Kruskal-Wallace One-Way Analysis of Variance

A somewhat simpler test is the Kruskal-Wallace test which is a nonparametric analogue to ANOVA for testing the significance of differences between two or more groups.

R Function: kruskal.test()
Null hypothesis (H 0 ): The samples come from the same distribution.
History: William Kruskal and W. Allen Wallis (1952)

For this example, we will investigate whether mean weight varies between the three major US urban states: New York, Illinois, and California.

To test whether this variation could be explained by randomness in the sample, we run the Kruskal-Wallace test.

The low p-value leads us to reject the null hypothesis that the samples come from the same distribution. This corroborates the alternative hypothesis that mean weights differ based on state.

A convienent way of visualizing a comparison between continuous and categorical data is with a box plot , which shows the distribution of a continuous variable across different groups:

A percentile is the level at which a given percentage of the values in the distribution are below: the 5th percentile means that five percent of the numbers are below that value.

The quartiles divide the distribution into four parts. 25% of the numbers are below the first quartile. 75% are below the third quartile. 50% are below the second quartile, making it the median.

Box plots can be used with both sampled data and population data.

The first parameter to the box plot is a formula: the continuous variable as a function of (the tilde) the second variable. A data= parameter can be added if you are using variables in a data frame.

The chi-squared test can be used to determine if two categorical variables are independent of each other.

linear.hypothesis: Test Linear Hypothesis

Description.

For a univariate model, an object of class "anova" which contains the residual degrees of freedom in the model, the difference in degrees of freedom, Wald statistic (either "F" or "Chisq" ) and corresponding p value. For a multivariate linear model, an object of class "linear.hypothesis.mlm" , which contains sums-of-squares-and-product matrices for the hypothesis and for error, degrees of freedom for the hypothesis and error, and some other information. The returned object normally would be printed.
The transformation matrix can be specified directly via the P argument.
A data frame can be provided defining the repeated-measures factor or factors via idata , with default contrasts given by the icontrasts argument. An intra-subject model-matrix is generated from the one-sided formula specified by the idesign argument; columns of the model matrix corresponding to different terms in the intra-subject model must be orthogonal (as is insured by the default contrasts). Note that the contrasts given in icontrasts can be overridden by assigning specific contrasts to the factors in idata . The repeated-measures transformation matrix consists of the columns of the intra-subject model matrix corresponding to the term or terms in iterms . In most instances, this will be the simpler approach, and indeed, most tests of interests can be generated automatically via the Anova function.

Run the code above in your browser using DataCamp Workspace

Test Linear Hypothesis

Description.

Generic function for testing a linear hypothesis, and methods for linear models, generalized linear models, and other models that have methods for coef and vcov .

Computes either a finite sample F statistic or asymptotic Chi-squared statistic for carrying out a Wald-test-based comparison between a model and a linearly restricted model. The default method will work with any model object for which the coefficient vector can be retrieved by coef and the coefficient-covariance matrix by vcov (otherwise the argument vcov. has to be set explicitely). For computing the F statistic (but not the Chi-squared statistic) a df.residual method needs to be available. If a formula method exists, it is used for pretty printing.

The method for "lm" objects calls the default method, but it changes the default test to "F" , supports the convenience argument white.adjust (for backwards compatibility), and enhances the output by residual sums of squares. For "glm" objects just the default method is called (bypassing the "lm" method).

The function lht also dispatches to linear.hypothesis .

The hypothesis matrix can be supplied as a numeric matrix (or vector), the rows of which specify linear combinations of the model coefficients, which are tested equal to the corresponding entries in the righ-hand-side vector, which defaults to a vector of zeroes.

Alternatively, the hypothesis can be specified symbolically as a character vector with one or more elements, each of which gives either a linear combination of coefficients, or a linear equation in the coefficients (i.e., with both a left and right side separated by an equals sign). Components of a linear expression or linear equation can consist of numeric constants, or numeric constants multiplying coefficient names (in which case the number precedes the coefficient, and may be separated from it by spaces or an asterisk); constants of 1 or -1 may be omitted. Spaces are always optional. Components are separated by positive or negative signs. See the examples below.

An object of class "anova" which contains the residual degrees of freedom in the model, the difference in degrees of freedom, Wald statistic (either "F" or "Chisq" ) and corresponding p value.

Achim Zeleis and John Fox [email protected]

Fox, J. (1997) Applied Regression, Linear Models, and Related Methods. Sage.

anova , Anova , waldtest , hccm , vcovHC , vcovHAC , coef , vcov

school Campus Bookshelves
menu_book Bookshelves
perm_media Learning Objects
login Login
how_to_reg Request Instructor Account
hub Instructor Commons
Download Page (PDF)
Download Full Book (PDF)
Periodic Table
Physics Constants
Scientific Calculator
Reference & Cite
Tools expand_more
Readability

selected template will load here

This action is not available.

15.5: Hypothesis Tests for Regression Models

Last updated
Save as PDF
Page ID 36197

Danielle Navarro
University of New South Wales

So far we’ve talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size). The next thing we need to talk about is hypothesis tests. There are two different (but related) kinds of hypothesis tests that we need to talk about: those in which we test whether the regression model as a whole is performing significantly better than a null model; and those in which we test whether a particular regression coefficient is significantly different from zero.

At this point, you’re probably groaning internally, thinking that I’m going to introduce a whole new collection of tests. You’re probably sick of hypothesis tests by now, and don’t want to learn any new ones. Me too. I’m so sick of hypothesis tests that I’m going to shamelessly reuse the F-test from Chapter 14 and the t-test from Chapter 13. In fact, all I’m going to do in this section is show you how those tests are imported wholesale into the regression framework.

Testing the model as a whole

Okay, suppose you’ve estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts . Formally, our “null model” corresponds to the fairly trivial “regression” model in which we include 0 predictors, and only include the intercept term b 0

H 0 :Y i =b 0 +ϵ i

If our regression model has K predictors, the “alternative model” is described using the usual formula for a multiple regression model:

$H_{1}: Y_{i}=\left(\sum_{k=1}^{K} b_{k} X_{i k}\right)+b_{0}+\epsilon_{i}$

How can we test these two hypotheses against each other? The trick is to understand that just like we did with ANOVA, it’s possible to divide up the total variance SS tot into the sum of the residual variance SS res and the regression model variance SS mod . I’ll skip over the technicalities, since we covered most of them in the ANOVA chapter, and just note that:

SS mod =SS tot −SS res

And, just like we did with the ANOVA, we can convert the sums of squares in to mean squares by dividing by the degrees of freedom.

$\mathrm{MS}_{m o d}=\dfrac{\mathrm{SS}_{m o d}}{d f_{m o d}}$ $\mathrm{MS}_{r e s}=\dfrac{\mathrm{SS}_{r e s}}{d f_{r e s}}$

So, how many degrees of freedom do we have? As you might expect, the df associated with the model is closely tied to the number of predictors that we’ve included. In fact, it turns out that df mod =K. For the residuals, the total degrees of freedom is df res =N−K−1.

$\ F={MS_{mod} \over MS_{res}}$

and the degrees of freedom associated with this are K and N−K−1. This F statistic has exactly the same interpretation as the one we introduced in Chapter 14. Large F values indicate that the null hypothesis is performing poorly in comparison to the alternative hypothesis. And since we already did some tedious “do it the long way” calculations back then, I won’t waste your time repeating them. In a moment I’ll show you how to do the test in R the easy way, but first, let’s have a look at the tests for the individual regression coefficients.

Tests for individual coefficients

The F-test that we’ve just introduced is useful for checking that the model as a whole is performing better than chance. This is important: if your regression model doesn’t produce a significant result for the F-test then you probably don’t have a very good regression model (or, quite possibly, you don’t have very good data). However, while failing this test is a pretty strong indicator that the model has problems, passing the test (i.e., rejecting the null) doesn’t imply that the model is good! Why is that, you might be wondering? The answer to that can be found by looking at the coefficients for the regression.2 model:

I can’t help but notice that the estimated regression coefficient for the baby.sleep variable is tiny (0.01), relative to the value that we get for dan.sleep (-8.95). Given that these two variables are absolutely on the same scale (they’re both measured in “hours slept”), I find this suspicious. In fact, I’m beginning to suspect that it’s really only the amount of sleep that I get that matters in order to predict my grumpiness.

Once again, we can reuse a hypothesis test that we discussed earlier, this time the t-test. The test that we’re interested has a null hypothesis that the true regression coefficient is zero (b=0), which is to be tested against the alternative hypothesis that it isn’t (b≠0). That is:

H 1 : b≠0

How can we test this? Well, if the central limit theorem is kind to us, we might be able to guess that the sampling distribution of $\ \hat{b}$, the estimated regression coefficient, is a normal distribution with mean centred on b. What that would mean is that if the null hypothesis were true, then the sampling distribution of $\ \hat{b}$ has mean zero and unknown standard deviation. Assuming that we can come up with a good estimate for the standard error of the regression coefficient, SE ($\ \hat{b}$), then we’re in luck. That’s exactly the situation for which we introduced the one-sample t way back in Chapter 13. So let’s define a t-statistic like this,

$\ t = { \hat{b} \over SE(\hat{b})}$

I’ll skip over the reasons why, but our degrees of freedom in this case are df=N−K−1. Irritatingly, the estimate of the standard error of the regression coefficient, SE($\ \hat{b}$), is not as easy to calculate as the standard error of the mean that we used for the simpler t-tests in Chapter 13. In fact, the formula is somewhat ugly, and not terribly helpful to look at. For our purposes it’s sufficient to point out that the standard error of the estimated regression coefficient depends on both the predictor and outcome variables, and is somewhat sensitive to violations of the homogeneity of variance assumption (discussed shortly).

In any case, this t-statistic can be interpreted in the same way as the t-statistics that we discussed in Chapter 13. Assuming that you have a two-sided alternative (i.e., you don’t really care if b>0 or b<0), then it’s the extreme values of t (i.e., a lot less than zero or a lot greater than zero) that suggest that you should reject the null hypothesis.

Running the hypothesis tests in R

To compute all of the quantities that we have talked about so far, all you need to do is ask for a summary() of your regression model. Since I’ve been using regression.2 as my example, let’s do that:

The output that this command produces is pretty dense, but we’ve already discussed everything of interest in it, so what I’ll do is go through it line by line. The first line reminds us of what the actual regression model is:

You can see why this is handy, since it was a little while back when we actually created the regression.2 model, and so it’s nice to be reminded of what it was we were doing. The next part provides a quick summary of the residuals (i.e., the ϵi values),

which can be convenient as a quick and dirty check that the model is okay. Remember, we did assume that these residuals were normally distributed, with mean 0. In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly off, there’s a good chance that the assumptions of regression are violated. These ones look pretty nice to me, so let’s move on to the interesting stuff. The next part of the R output looks at the coefficients of the regression model:

Each row in this table refers to one of the coefficients in the regression model. The first row is the intercept term, and the later ones look at each of the predictors. The columns give you all of the relevant information. The first column is the actual estimate of b (e.g., 125.96 for the intercept, and -8.9 for the dan.sleep predictor). The second column is the standard error estimate $\ \hat{\sigma_b}$. The third column gives you the t-statistic, and it’s worth noticing that in this table t= $\ \hat{b}$ /SE($\ \hat{b}$) every time. Finally, the fourth column gives you the actual p value for each of these tests. 217 The only thing that the table itself doesn’t list is the degrees of freedom used in the t-test, which is always N−K−1 and is listed immediately below, in this line:

The value of df=97 is equal to N−K−1, so that’s what we use for our t-tests. In the final part of the output we have the F-test and the R 2 values which assess the performance of the model as a whole

So in this case, the model performs significantly better than you’d expect by chance (F(2,97)=215.2, p<.001), which isn’t all that surprising: the R 2 =.812 value indicate that the regression model accounts for 81.2% of the variability in the outcome measure. However, when we look back up at the t-tests for each of the individual coefficients, we have pretty strong evidence that the baby.sleep variable has no significant effect; all the work is being done by the dan.sleep variable. Taken together, these results suggest that regression.2 is actually the wrong model for the data: you’d probably be better off dropping the baby.sleep predictor entirely. In other words, the regression.1 model that we started with is the better model.

IMAGES

Introduction to Hypothesis Testing in R
Introduction to Hypothesis Testing in R
How to Perform Hypothesis Testing in R using T-tests and μ-Tests
Introduction to Hypothesis Testing in R
Introduction to Hypothesis Testing in R
Introduction to Hypothesis Testing in R

VIDEO

Hypothesis testing in R
Simple Linear Regression, hypothesis tests
R6. Testing Multiple Linear Hypotheses (Econometrics in R)
Simple Linear Regression in R
Hypothesis Test for Simple Linear Regession
Linear regression

COMMENTS

How to Use the linearHypothesis() Function in R - Statology
F test statistic: 14.035; p-value: .003553; This particular hypothesis test uses the following null and alternative hypotheses: H 0: Both regression coefficients are equal to zero. H A: At least one regression coefficient is not equal to zero. Since the p-value of the test (.003553) is less than .05, we reject the null hypothesis.
linearHypothesis function - RDocumentation
rhs. right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. For a multivariate linear model, rhs is a matrix, defaulting to 0. This argument isn't available for F-tests for linear mixed models. singular.ok.
Linear Hypothesis Tests | LOST
Linear Hypothesis Tests Most regression output will include the results of frequentist hypothesis tests comparing each coefficient to 0. However, in many cases, you may be interested in whether a linear sum of the coefficients is 0. For example, in the regression \[Outcome = \beta_0 + \beta_1\times GoodThing + \beta_2\times BadThing\]
How to Use the linearHypothesis() Function in R - Life With Data
linearHypothesis () is a function in R that tests the general linear hypothesis for a model object for which a formula method exists, using a specified test statistic. It allows the user to define a broader set of null hypotheses than just assuming individual coefficients equal to zero. The linearHypothesis () function can be especially useful ...
R Handbook: Hypothesis Testing and p-values
Using a binomial test, the p -value is < 0.0001. (Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16, which is 0.00000000000000022, with 15 zeros after the decimal point.) Assuming an alpha of 0.05, since the p -value is less than alpha, we reject the null hypothesis.
Hypothesis Tests in R
This tutorial covers basic hypothesis testing in R. Normality tests. Shapiro-Wilk normality test. Kolmogorov-Smirnov test. Comparing central tendencies: Tests with continuous / discrete data. One-sample t-test : Normally-distributed sample vs. expected mean. Two-sample t-test: Two normally-distributed samples.
linear.hypothesis function - RDocumentation
The function lht also dispatches to linear.hypothesis. The hypothesis matrix can be supplied as a numeric matrix (or vector), the rows of which specify linear combinations of the model coefficients, which are tested equal to the corresponding entries in the righ-hand-side vector, which defaults to a vector of zeroes.
R: Test Linear Hypothesis - Furman University
hypothesis.matrix. matrix (or vector) giving linear combinations of coefficients by rows, or a character vector giving the hypothesis in symbolic form (see Details ). rhs. right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. test.
15.5: Hypothesis Tests for Regression Models - Statistics ...
Formally, our “null model” corresponds to the fairly trivial “regression” model in which we include 0 predictors, and only include the intercept term b 0. H 0 :Y i =b 0 +ϵ i. If our regression model has K predictors, the “alternative model” is described using the usual formula for a multiple regression model: H1: Yi = (∑K k=1 ...

How to Use the linearHypothesis() Function in R

Example: How to Use linearHypothesis() Function in R

Additional Resources

Published by Zach

Linear Hypothesis Tests

Keep in Mind

Also Consider

Implementations

Life With Data

How to Use the linearHypothesis() Function in R

Understanding Hypothesis Testing in Regression Analysis

The linearHypothesis( ) Function

Installing and Loading the Required Package

Using linearHypothesis( ) in Practice

Using linearHypothesis( ) for Testing Nested Models

Limitations and Considerations

Share this:

Summary and Analysis of Extension Program Evaluation in R

Hypothesis Testing and p-values

Initial comments

Statistical inference

Packages used in this chapter

Hypothesis testing

p -value definition

Decision rule

Coin flipping example

Passing and failing example

Theory and practice of using p -values

Statistics is like a jury?

Errors in inference

Statistical power

The 0.05 alpha value is not dogma

The 0.05 alpha value is almost dogma

Practical advice

Is the p -value every really true?

Effect sizes and practical importance

Sizes of effects

p -values and sample sizes

Effect size statistics

Good practices for statistical analyses

p -value adjustment

Don’t use Bonferroni adjustments

Preplanned tests

p -value hacking

Publication bias

Clarification of terms and reporting on assignments

What you should report on your assignments:

“Size of the effect” / “effect size”

"Practical" / "Practical importance"

A few of xkcd comics

Null hypothesis

Experiments, sampling, and causation

Quasi-experiment designs

Observational studies

Plan ahead and be consistent

Consistency

Controls and checks

Include alternate measurements

Include covariates

Optional discussion: Alternative methods to the Null Hypothesis Significance Test

Alternatives to the NHST approach

Bayesian approach

References and further reading

Exercises F

Hypothesis Tests in R

Hypothesis Testing

The Problem of Induction

Falsification

Null and Alternative Hypotheses

Type I vs. Type II Errors

Statistical Significance vs. Importance

Science vs. Non-science

Example Data

Variable Types

Normality Tests

The Shapiro-Wilk Normality Test

The Kolmogorov-Smirnov Test

Modality Tests of Samples

One Sample T-Test (One-Sided)

Box-and-Whisker Chart