• ggplot2 Short Tutorial
  • ggplot2 Tutorial 1 - Intro
  • ggplot2 Tutorial 2 - Theme
  • ggplot2 Tutorial 3 - Masterlist
  • ggplot2 Quickref
  • Foundations

Linear Regression

  • Statistical Tests
  • Missing Value Treatment
  • Outlier Analysis
  • Feature Selection
  • Model Selection
  • Logistic Regression
  • Advanced Linear Regression
  • Advanced Regression Models
  • Time Series
  • Time Series Analysis
  • Time Series Forecasting
  • More Time Series Forecasting
  • High Performance Computing
  • Parallel computing
  • Strategies to Speedup R code
  • Useful Techniques
  • Association Mining
  • Multi Dimensional Scaling
  • Optimization
  • InformationValue package

r-statistics.co by Selva Prabhakaran

Stay up-to-date. Subscribe!

How to contribute

Edit this page

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X . The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y , when only the predictors ( X s ) values are known.

Introduction

The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:

Y  =  β 1  +  β 2 X  +  ϵ

where, β 1 is the intercept and β 2 is the slope. Collectively, they are called regression coefficients . ϵ is the error term, the part of Y the regression model is unable to explain.

Example Problem

For this analysis, we will use the cars dataset that comes with R by default. cars is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in cars in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – dist and speed . Lets print out the first six observations here..

Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.

Graphical Analysis

The aim of this exercise is to build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:

  • Scatter plot : Visualize the linear relationship between the predictor and response
  • Box plot : To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
  • Density plot : To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.

Scatter Plot

Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best as seen below.

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.

BoxPlot – Check for outliers

Generally, any datapoint that lies outside the 1.5 * interquartile-range ( 1.5 *  I Q R ) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

Density plot – Check if the response variable is close to normality

Correlation.

Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable ( Y ) is unexplained by the predictor ( X ), in which case, we should probably look for better explanatory variables.

Build Linear Model

Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, lets see the syntax for building the linear model. The function used for building linear models is lm() . The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class formula . But the most common convention is to write out the formula directly in place of the argument as written below.

Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output, you can notice the ‘Coefficients’ part having two components: Intercept : -17.579, speed : 3.932 These are also called the beta coefficients. In other words, d i s t  =  I n t e r c e p t  + ( β  ∗  s p e e d ) => dist = −17.579 + 3.932∗speed

Linear Regression Diagnostics

Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known. Is this enough to actually use this model? NO! Before using a regression model, you have to ensure that it is statistically significant. How do you ensure this? Lets begin by printing the summary statistics for linearMod.

The p Value: Checking for statistical significance

The summary statistics above tells us a number of things. One of them is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column under ‘Coefficients’). The p-Values are very important because, We can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variable’s p-Value, the more significant the variable.

Null and alternate hypothesis

When there is a p-value, there is a hull and alternative hypothesis associated with it. In Linear Regression, the Null Hypothesis is that the coefficients associated with the variables is equal to zero. The alternate hypothesis is that the coefficients are not equal to zero (i.e. there exists a relationship between the independent variable in question and the dependent variable).

We can interpret the t-value something like this. A larger t-value indicates that it is less likely that the coefficient is not equal to zero purely by chance. So, higher the t-value, the better.

Pr(>|t|) or p-value is the probability that you get a t-value as high or higher than the observed value when the Null Hypothesis (the β coefficient is equal to zero or that there is no relationship) is true. So if the Pr(>|t|) is low, the coefficients are significant (significantly different from zero). If the Pr(>|t|) is high, the coefficients are not significant.

What this means to us? when p Value is less than significance level (< 0.05), we can safely reject the null hypothesis that the co-efficient β of the predictor is zero. In our case, linearMod , both these p-Values are well below the 0.05 threshold, so we can conclude our model is indeed statistically significant.

It is absolutely important for the model to be statistically significant before we can go ahead and use it to predict (or estimate) the dependent variable, otherwise, the confidence in predicted values from that model reduces and may be construed as an event of chance.

How to calculate the t Statistic and p-Values?

When the model co-efficients and standard error are known, the formula for calculating t Statistic and p-Value is as follows: $$t−Statistic = {β−coefficient \over Std.Error}$$

R-Squared and Adj R-Squared

The actual information in a data is the total variation it contains, remember?. What R-Squared tells us is the proportion of variation in the dependent (response) variable that has been explained by this model.

$$ R^{2} = 1 - \frac{SSE}{SST}$$

where, S S E is the sum of squared errors given by $SSE = \sum_{i}^{n} \left( y_{i} - \hat{y_{i}} \right) ^{2}$ and $SST = \sum_{i}^{n} \left( y_{i} - \bar{y_{i}} \right) ^{2}$ is the sum of squared total . Here, $\hat{y_{i}}$ is the fitted value for observation i and $\bar{y}$ is the mean of Y .

We don’t necessarily discard a model based on a low R-Squared value. Its a better practice to look at the AIC and prediction accuracy on validation sample when deciding on the efficacy of a model.

Now thats about R-Squared. What about adjusted R-Squared? As you add more X variables to your model, the R-Squared value of the new bigger model will always be greater than that of the smaller subset. This is because, since all the variables in the original model is also present, their contribution to explain the dependent variable will be present in the super-set as well, therefore, whatever new variable we add can only add (if not significantly) to the variation that was already explained. It is here, the adjusted R-Squared value comes to help. Adj R-Squared penalizes total value for the number of terms (read predictors) in your model. Therefore when comparing nested models, it is a good practice to look at adj-R-squared value over R-squared.

$$ R^{2}_{adj} = 1 - \frac{MSE}{MST}$$

where, M S E is the mean squared error given by $MSE = \frac{SSE}{\left( n-q \right)}$ and $MST = \frac{SST}{\left( n-1 \right)}$ is the mean squared total , where n is the number of observations and q is the number of coefficients in the model.

Therefore, by moving around the numerators and denominators, the relationship between R 2 and R a d j 2 becomes:

$$R^{2}_{adj} = 1 - \left( \frac{\left( 1 - R^{2}\right) \left(n-1\right)}{n-q}\right)$$

Standard Error and F-Statistic

Both standard errors and F-statistic are measures of goodness of fit.

$$Std. Error = \sqrt{MSE} = \sqrt{\frac{SSE}{n-q}}$$

$$F-statistic = \frac{MSR}{MSE}$$

where, n is the number of observations, q is the number of coefficients and M S R is the mean square regression , calculated as,

$$MSR=\frac{\sum_{i}^{n}\left( \hat{y_{i} - \bar{y}}\right)}{q-1} = \frac{SST - SSE}{q - 1}$$

AIC and BIC

The Akaike’s information criterion - AIC (Akaike, 1974) and the Bayesian information criterion - BIC (Schwarz, 1978) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection. Both criteria depend on the maximized value of the likelihood function L for the estimated model.

The AIC is defined as:

A I C  = (−2) ×  l n ( L ) + (2× k )

where, k is the number of model parameters and the BIC is defined as:

B I C  = (−2) ×  l n ( L ) +  k  ×  l n ( n )

where, n is the sample size.

For model comparison, the model with the lowest AIC and BIC score is preferred.

How to know if the model is best fit for your data?

The most common metrics to look at while selecting the model are:

Predicting Linear Models

So far we have seen how to build a linear regression model using the whole dataset. If we build it that way, there is no way to tell how the model will perform with new data. So the preferred practice is to split your dataset into a 80:20 sample (training:test), then, build the model on the 80% sample and then use the model thus built to predict the dependent variable on test data.

Doing it this way, we will have the model predicted values for the 20% data (test) as well as the actuals (from the original dataset). By calculating accuracy measures (like min_max accuracy) and error rates (MAPE or MSE), we can find out the prediction accuracy of the model. Now, lets see how to actually do this..

Step 1: Create the training (development) and test (validation) data samples from original data.

Step 2: develop the model on the training data and use it to predict the distance on test data, step 3: review diagnostic measures..

From the model summary, the model p value and predictor’s p value are less than the significance level, so we know we have a statistically significant model. Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.

Step 4: Calculate prediction accuracy and error rates

A simple correlation between the actuals and predicted values can be used as a form of accuracy measure. A higher correlation accuracy implies that the actuals and predicted values have similar directional movement, i.e. when the actuals values increase the predicteds also increase and vice-versa.

Now lets calculate the Min Max accuracy and MAPE: $$MinMaxAccuracy = mean \left( \frac{min\left(actuals, predicteds\right)}{max\left(actuals, predicteds \right)} \right)$$

$$MeanAbsolutePercentageError \ (MAPE) = mean\left( \frac{abs\left(predicteds−actuals\right)}{actuals}\right)$$

k- Fold Cross validation

Suppose, the model predicts satisfactorily on the 20% split (test data), is that enough to believe that your model will perform equally well all the time? It is important to rigorously test the model’s performance as much as possible. One way is to ensure that the model equation you have will perform well, when it is ‘built’ on a different subset of training data and predicted on the remaining data.

How to do this is? Split your data into ‘k’ mutually exclusive random sample portions. Keeping each portion as test data, we build the model on the remaining (k-1 portion) data and calculate the mean squared error of the predictions. This is done for each of the ‘k’ random sample portions. Then finally, the average of these mean squared errors (for ‘k’ portions) is computed. We can use this metric to compare different linear models.

By doing this, we need to check two things:

  • If the model’s prediction accuracy isn’t varying too much for any one particular sample, and
  • If the lines of best fit don’t vary too much with respect the the slope and level.

In other words, they should be parallel and as close to each other as possible. You can find a more detailed explanation for interpreting the cross validation charts when you learn about advanced linear model building.

In the below plot, Are the dashed lines parallel? Are the small and big symbols are not over dispersed for one particular color?

Where to go from here?

We have covered the basic concepts about linear regression. Besides these, you need to understand that linear regression is based on certain underlying assumptions that must be taken care especially when working with multiple X s . Once you are familiar with that, the advanced regression models will show you around the various special cases where a different form of regression would be more suitable.

© 2016-17 Selva Prabhakaran. Powered by jekyll , knitr , and pandoc . This work is licensed under the Creative Commons License.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

15.5: Hypothesis Tests for Regression Models

  • Last updated
  • Save as PDF
  • Page ID 36197

  • Danielle Navarro
  • University of New South Wales

So far we’ve talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size). The next thing we need to talk about is hypothesis tests. There are two different (but related) kinds of hypothesis tests that we need to talk about: those in which we test whether the regression model as a whole is performing significantly better than a null model; and those in which we test whether a particular regression coefficient is significantly different from zero.

At this point, you’re probably groaning internally, thinking that I’m going to introduce a whole new collection of tests. You’re probably sick of hypothesis tests by now, and don’t want to learn any new ones. Me too. I’m so sick of hypothesis tests that I’m going to shamelessly reuse the F-test from Chapter 14 and the t-test from Chapter 13. In fact, all I’m going to do in this section is show you how those tests are imported wholesale into the regression framework.

Testing the model as a whole

Okay, suppose you’ve estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts . Formally, our “null model” corresponds to the fairly trivial “regression” model in which we include 0 predictors, and only include the intercept term b 0

H 0 :Y i =b 0 +ϵ i

If our regression model has K predictors, the “alternative model” is described using the usual formula for a multiple regression model:

\(H_{1}: Y_{i}=\left(\sum_{k=1}^{K} b_{k} X_{i k}\right)+b_{0}+\epsilon_{i}\)

How can we test these two hypotheses against each other? The trick is to understand that just like we did with ANOVA, it’s possible to divide up the total variance SS tot into the sum of the residual variance SS res and the regression model variance SS mod . I’ll skip over the technicalities, since we covered most of them in the ANOVA chapter, and just note that:

SS mod =SS tot −SS res

And, just like we did with the ANOVA, we can convert the sums of squares in to mean squares by dividing by the degrees of freedom.

\(\mathrm{MS}_{m o d}=\dfrac{\mathrm{SS}_{m o d}}{d f_{m o d}}\) \(\mathrm{MS}_{r e s}=\dfrac{\mathrm{SS}_{r e s}}{d f_{r e s}}\)

So, how many degrees of freedom do we have? As you might expect, the df associated with the model is closely tied to the number of predictors that we’ve included. In fact, it turns out that df mod =K. For the residuals, the total degrees of freedom is df res =N−K−1.

\(\ F={MS_{mod} \over MS_{res}}\)

and the degrees of freedom associated with this are K and N−K−1. This F statistic has exactly the same interpretation as the one we introduced in Chapter 14. Large F values indicate that the null hypothesis is performing poorly in comparison to the alternative hypothesis. And since we already did some tedious “do it the long way” calculations back then, I won’t waste your time repeating them. In a moment I’ll show you how to do the test in R the easy way, but first, let’s have a look at the tests for the individual regression coefficients.

Tests for individual coefficients

The F-test that we’ve just introduced is useful for checking that the model as a whole is performing better than chance. This is important: if your regression model doesn’t produce a significant result for the F-test then you probably don’t have a very good regression model (or, quite possibly, you don’t have very good data). However, while failing this test is a pretty strong indicator that the model has problems, passing the test (i.e., rejecting the null) doesn’t imply that the model is good! Why is that, you might be wondering? The answer to that can be found by looking at the coefficients for the regression.2 model:

I can’t help but notice that the estimated regression coefficient for the baby.sleep variable is tiny (0.01), relative to the value that we get for dan.sleep (-8.95). Given that these two variables are absolutely on the same scale (they’re both measured in “hours slept”), I find this suspicious. In fact, I’m beginning to suspect that it’s really only the amount of sleep that I get that matters in order to predict my grumpiness.

Once again, we can reuse a hypothesis test that we discussed earlier, this time the t-test. The test that we’re interested has a null hypothesis that the true regression coefficient is zero (b=0), which is to be tested against the alternative hypothesis that it isn’t (b≠0). That is:

H 1 : b≠0

How can we test this? Well, if the central limit theorem is kind to us, we might be able to guess that the sampling distribution of \(\ \hat{b}\), the estimated regression coefficient, is a normal distribution with mean centred on b. What that would mean is that if the null hypothesis were true, then the sampling distribution of \(\ \hat{b}\) has mean zero and unknown standard deviation. Assuming that we can come up with a good estimate for the standard error of the regression coefficient, SE (\(\ \hat{b}\)), then we’re in luck. That’s exactly the situation for which we introduced the one-sample t way back in Chapter 13. So let’s define a t-statistic like this,

\(\ t = { \hat{b} \over SE(\hat{b})}\)

I’ll skip over the reasons why, but our degrees of freedom in this case are df=N−K−1. Irritatingly, the estimate of the standard error of the regression coefficient, SE(\(\ \hat{b}\)), is not as easy to calculate as the standard error of the mean that we used for the simpler t-tests in Chapter 13. In fact, the formula is somewhat ugly, and not terribly helpful to look at. For our purposes it’s sufficient to point out that the standard error of the estimated regression coefficient depends on both the predictor and outcome variables, and is somewhat sensitive to violations of the homogeneity of variance assumption (discussed shortly).

In any case, this t-statistic can be interpreted in the same way as the t-statistics that we discussed in Chapter 13. Assuming that you have a two-sided alternative (i.e., you don’t really care if b>0 or b<0), then it’s the extreme values of t (i.e., a lot less than zero or a lot greater than zero) that suggest that you should reject the null hypothesis.

Running the hypothesis tests in R

To compute all of the quantities that we have talked about so far, all you need to do is ask for a summary() of your regression model. Since I’ve been using regression.2 as my example, let’s do that:

The output that this command produces is pretty dense, but we’ve already discussed everything of interest in it, so what I’ll do is go through it line by line. The first line reminds us of what the actual regression model is:

You can see why this is handy, since it was a little while back when we actually created the regression.2 model, and so it’s nice to be reminded of what it was we were doing. The next part provides a quick summary of the residuals (i.e., the ϵi values),

which can be convenient as a quick and dirty check that the model is okay. Remember, we did assume that these residuals were normally distributed, with mean 0. In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly off, there’s a good chance that the assumptions of regression are violated. These ones look pretty nice to me, so let’s move on to the interesting stuff. The next part of the R output looks at the coefficients of the regression model:

Each row in this table refers to one of the coefficients in the regression model. The first row is the intercept term, and the later ones look at each of the predictors. The columns give you all of the relevant information. The first column is the actual estimate of b (e.g., 125.96 for the intercept, and -8.9 for the dan.sleep predictor). The second column is the standard error estimate \(\ \hat{\sigma_b}\). The third column gives you the t-statistic, and it’s worth noticing that in this table t= \(\ \hat{b}\) /SE(\(\ \hat{b}\)) every time. Finally, the fourth column gives you the actual p value for each of these tests. 217 The only thing that the table itself doesn’t list is the degrees of freedom used in the t-test, which is always N−K−1 and is listed immediately below, in this line:

The value of df=97 is equal to N−K−1, so that’s what we use for our t-tests. In the final part of the output we have the F-test and the R 2 values which assess the performance of the model as a whole

So in this case, the model performs significantly better than you’d expect by chance (F(2,97)=215.2, p<.001), which isn’t all that surprising: the R 2 =.812 value indicate that the regression model accounts for 81.2% of the variability in the outcome measure. However, when we look back up at the t-tests for each of the individual coefficients, we have pretty strong evidence that the baby.sleep variable has no significant effect; all the work is being done by the dan.sleep variable. Taken together, these results suggest that regression.2 is actually the wrong model for the data: you’d probably be better off dropping the baby.sleep predictor entirely. In other words, the regression.1 model that we started with is the better model.

linearHypothesis: Test Linear Hypothesis

Description.

Generic function for testing a linear hypothesis, and methods for linear models, generalized linear models, multivariate linear models, linear and generalized linear mixed-effects models, generalized linear models fit with svyglm in the survey package, robust linear models fit with rlm in the MASS package, and other models that have methods for coef and vcov . For mixed-effects models, the tests are Wald chi-square tests for the fixed effects.

lht(model, ...)

# S3 method for default linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("Chisq", "F"), vcov.=NULL, singular.ok=FALSE, verbose=FALSE, coef. = coef(model), suppress.vcov.msg=FALSE, error.df, ...)

# S3 method for lm linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("F", "Chisq"), vcov.=NULL, white.adjust=c(FALSE, TRUE, "hc3", "hc0", "hc1", "hc2", "hc4"), singular.ok=FALSE, ...)

# S3 method for glm linearHypothesis(model, ...)

# S3 method for lmList linearHypothesis(model, ..., vcov.=vcov, coef.=coef)

# S3 method for nlsList linearHypothesis(model, ..., vcov.=vcov, coef.=coef)

# S3 method for mlm linearHypothesis(model, hypothesis.matrix, rhs=NULL, SSPE, V, test, idata, icontrasts=c("contr.sum", "contr.poly"), idesign, iterms, check.imatrix=TRUE, P=NULL, title="null hypothesis linear model r", singular.ok=FALSE, verbose=FALSE, ...) # S3 method for polr linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov., verbose=FALSE, ...) # S3 method for linearHypothesis.mlm print(x, SSP=TRUE, SSPE=SSP, digits=getOption("digits"), ...) # S3 method for lme linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, singular.ok=FALSE, verbose=FALSE, ...) # S3 method for mer linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, test=c("Chisq", "F"), singular.ok=FALSE, verbose=FALSE, ...) # S3 method for merMod linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, test=c("Chisq", "F"), singular.ok=FALSE, verbose=FALSE, ...) # S3 method for svyglm linearHypothesis(model, ...)

# S3 method for rlm linearHypothesis(model, ...)

# S3 method for survreg linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("Chisq", "F"), vcov., verbose=FALSE, ...) matchCoefs(model, pattern, ...)

# S3 method for default matchCoefs(model, pattern, coef.=coef, ...)

# S3 method for lme matchCoefs(model, pattern, ...)

# S3 method for mer matchCoefs(model, pattern, ...)

# S3 method for merMod matchCoefs(model, pattern, ...)

# S3 method for mlm matchCoefs(model, pattern, ...)

# S3 method for lmList matchCoefs(model, pattern, ...)

For a univariate model, an object of class "anova"

which contains the residual degrees of freedom in the model, the difference in degrees of freedom, Wald statistic (either "F" or "Chisq" ), and corresponding p value. The value of the linear hypothesis and its covariance matrix are returned respectively as "value" and "vcov" attributes of the object (but not printed).

For a multivariate linear model, an object of class

"linearHypothesis.mlm" , which contains sums-of-squares-and-product matrices for the hypothesis and for error, degrees of freedom for the hypothesis and error, and some other information.

The returned object normally would be printed.

fitted model object. The default method of linearHypothesis works for models for which the estimated parameters can be retrieved by coef and the corresponding estimated covariance matrix by vcov . See the Details for more information.

matrix (or vector) giving linear combinations of coefficients by rows, or a character vector giving the hypothesis in symbolic form (see Details ).

right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. For a multivariate linear model, rhs is a matrix, defaulting to 0. This argument isn't available for F-tests for linear mixed models.

if FALSE (the default), a model with aliased coefficients produces an error; if TRUE , the aliased coefficients are ignored, and the hypothesis matrix should not have columns for them. For a multivariate linear model: will return the hypothesis and error SSP matrices even if the latter is singular; useful for computing univariate repeated-measures ANOVAs where there are fewer subjects than df for within-subject effects.

For the default linearHypothesis method, if an F-test is requested and if error.df is missing, the error degrees of freedom will be computed by applying the df.residual function to the model; if df.residual returns NULL or NA , then a chi-square test will be substituted for the F-test (with a message to that effect.

an optional data frame giving a factor or factors defining the intra-subject model for multivariate repeated-measures data. See Details for an explanation of the intra-subject design and for further explanation of the other arguments relating to intra-subject factors.

names of contrast-generating functions to be applied by default to factors and ordered factors, respectively, in the within-subject ``data''; the contrasts must produce an intra-subject model matrix in which different terms are orthogonal.

a one-sided model formula using the ``data'' in idata and specifying the intra-subject design.

the quoted name of a term, or a vector of quoted names of terms, in the intra-subject design to be tested.

check that columns of the intra-subject model matrix for different terms are mutually orthogonal (default, TRUE ). Set to FALSE only if you have already checked that the intra-subject model matrix is block-orthogonal.

transformation matrix to be applied to the repeated measures in multivariate repeated-measures data; if NULL and no intra-subject model is specified, no response-transformation is applied; if an intra-subject model is specified via the idata , idesign , and (optionally) icontrasts arguments, then P is generated automatically from the iterms argument.

in linearHypothesis method for mlm objects: optional error sum-of-squares-and-products matrix; if missing, it is computed from the model. In print method for linearHypothesis.mlm objects: if TRUE , print the sum-of-squares and cross-products matrix for error.

character string, "F" or "Chisq" , specifying whether to compute the finite-sample F statistic (with approximate F distribution) or the large-sample Chi-squared statistic (with asymptotic Chi-squared distribution). For a multivariate linear model, the multivariate test statistic to report --- one or more of "Pillai" , "Wilks" , "Hotelling-Lawley" , or "Roy" , with "Pillai" as the default.

an optional character string to label the output.

inverse of sum of squares and products of the model matrix; if missing it is computed from the model.

a function for estimating the covariance matrix of the regression coefficients, e.g., hccm , or an estimated covariance matrix for model . See also white.adjust . For the "lmList" and "nlsList" methods, vcov. must be a function (defaulting to vcov ) to be applied to each model in the list.

a vector of coefficient estimates. The default is to get the coefficient estimates from the model argument, but the user can input any vector of the correct length. For the "lmList" and "nlsList" methods, coef. must be a function (defaulting to coef ) to be applied to each model in the list.

logical or character. Convenience interface to hccm (instead of using the argument vcov. ). Can be set either to a character value specifying the type argument of hccm or TRUE , in which case "hc3" is used implicitly. The default is FALSE .

If TRUE , the hypothesis matrix, right-hand-side vector (or matrix), and estimated value of the hypothesis are printed to standard output; if FALSE (the default), the hypothesis is only printed in symbolic form and the value of the hypothesis is not printed.

an object produced by linearHypothesis.mlm .

if TRUE (the default), print the sum-of-squares and cross-products matrix for the hypothesis and the response-transformation matrix.

minimum number of signficiant digits to print.

a regular expression to be matched against coefficient names.

for internal use by methods that call the default method.

arguments to pass down.

Achim Zeileis and John Fox [email protected]

linearHypothesis computes either a finite-sample F statistic or asymptotic Chi-squared statistic for carrying out a Wald-test-based comparison between a model and a linearly restricted model. The default method will work with any model object for which the coefficient vector can be retrieved by coef and the coefficient-covariance matrix by vcov (otherwise the argument vcov. has to be set explicitly). For computing the F statistic (but not the Chi-squared statistic) a df.residual method needs to be available. If a formula method exists, it is used for pretty printing.

The method for "lm" objects calls the default method, but it changes the default test to "F" , supports the convenience argument white.adjust (for backwards compatibility), and enhances the output by the residual sums of squares. For "glm" objects just the default method is called (bypassing the "lm" method). The "svyglm" method also calls the default method.

Multinomial logit models fit by the multinom function in the nnet package invoke the default method, and the coefficient names are composed from the response-level names and conventional coefficient names, separated by a period ( "." ): see one of the examples below.

The function lht also dispatches to linearHypothesis .

The hypothesis matrix can be supplied as a numeric matrix (or vector), the rows of which specify linear combinations of the model coefficients, which are tested equal to the corresponding entries in the right-hand-side vector, which defaults to a vector of zeroes.

Alternatively, the hypothesis can be specified symbolically as a character vector with one or more elements, each of which gives either a linear combination of coefficients, or a linear equation in the coefficients (i.e., with both a left and right side separated by an equals sign). Components of a linear expression or linear equation can consist of numeric constants, or numeric constants multiplying coefficient names (in which case the number precedes the coefficient, and may be separated from it by spaces or an asterisk); constants of 1 or -1 may be omitted. Spaces are always optional. Components are separated by plus or minus signs. Newlines or tabs in hypotheses will be treated as spaces. See the examples below.

If the user sets the arguments coef. and vcov. , then the computations are done without reference to the model argument. This is like assuming that coef. is normally distibuted with estimated variance vcov. and the linearHypothesis will compute tests on the mean vector for coef. , without actually using the model argument.

A linear hypothesis for a multivariate linear model (i.e., an object of class "mlm" ) can optionally include an intra-subject transformation matrix for a repeated-measures design. If the intra-subject transformation is absent (the default), the multivariate test concerns all of the corresponding coefficients for the response variables. There are two ways to specify the transformation matrix for the repeated measures:

The transformation matrix can be specified directly via the P argument.

A data frame can be provided defining the repeated-measures factor or factors via idata , with default contrasts given by the icontrasts argument. An intra-subject model-matrix is generated from the one-sided formula specified by the idesign argument; columns of the model matrix corresponding to different terms in the intra-subject model must be orthogonal (as is insured by the default contrasts). Note that the contrasts given in icontrasts can be overridden by assigning specific contrasts to the factors in idata . The repeated-measures transformation matrix consists of the columns of the intra-subject model matrix corresponding to the term or terms in iterms . In most instances, this will be the simpler approach, and indeed, most tests of interests can be generated automatically via the Anova function.

matchCoefs is a convenience function that can sometimes help in formulating hypotheses; for example matchCoefs(mod, ":") will return the names of all interaction coefficients in the model mod .

Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models , Third Edition. Sage.

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression , Third Edition, Sage.

Hand, D. J., and Taylor, C. C. (1987) Multivariate Analysis of Variance and Repeated Measures: A Practical Approach for Behavioural Scientists. Chapman and Hall.

O'Brien, R. G., and Kaiser, M. K. (1985) MANOVA method for analyzing repeated measures designs: An extensive primer. Psychological Bulletin 97 , 316--333.

anova , Anova , waldtest , hccm , vcovHC , vcovHAC , coef , vcov

Run the code above in your browser using DataCamp Workspace

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Linear Regression in R | A Step-by-Step Guide & Examples

Linear Regression in R | A Step-by-Step Guide & Examples

Published on February 25, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Linear regression is a regression model that uses a straight line to describe the relationship between variables . It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

  • Simple linear regression uses only one independent variable
  • Multiple linear regression uses two or more independent variables

In this step-by-step guide, we will walk you through linear regression in R using two sample datasets.

Download the sample datasets to try it yourself.

Simple regression dataset Multiple regression dataset

Table of contents

Getting started in r, step 1: load the data into r, step 2: make sure your data meet the assumptions, step 3: perform the linear regression analysis, step 4: check for homoscedasticity, step 5: visualize the results with a graph, step 6: report your results, other interesting articles.

Start by downloading R and RStudio . Then open RStudio and click on File > New File > R Script .

As we go through each step , you can copy and paste the code from the text boxes directly into your script. To run the code, highlight the lines you want to run and click on the Run button on the top right of the text editor (or press ctrl + enter on the keyboard).

To install the packages you need for the analysis, run this code (you only need to do this once):

Next, load the packages into your R environment by running this code (you need to do this every time you restart R):

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

null hypothesis linear model r

Follow these four steps for each dataset:

  • In RStudio, go to File > Import dataset  > From Text (base) .
  • Choose the data file you have downloaded ( income.data or heart.data ), and an Import Dataset window pops up.
  • In the Data Frame window, you should see an X (index) column and columns listing the data for each of the variables ( income and happiness or biking , smoking , and heart.disease ).
  • Click on the Import button and the file should appear in your Environment tab on the upper right side of the RStudio screen.

After you’ve loaded the data, check that it has been read in correctly using summary() .

Simple regression

Because both our variables are quantitative , when we run this function we see a table in our console with a numeric summary of the data. This tells us the minimum, median , mean , and maximum values of the independent variable (income) and dependent variable (happiness):

Simple linear regression summary output in R

Multiple regression

Again, because the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):

Multiple regression summary output in R

We can use R to check that our data meet the four main assumptions for linear regression .

  • Independence of observations (aka no autocorrelation)

Because we only have one independent variable and one dependent variable, we don’t need to test for any hidden relationships among variables.

If you know that you have autocorrelation within variables (i.e. multiple observations of the same test subject), then do not proceed with a simple linear regression! Use a structured model, like a linear mixed-effects model, instead.

To check whether the dependent variable follows a normal distribution , use the hist() function.

Simple regression histogram

The observations are roughly bell-shaped (more observations in the middle of the distribution, fewer on the tails), so we can proceed with the linear regression.

The relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.

Simple regression scatter plot

The relationship looks roughly linear, so we can proceed with the linear model.

  • Homoscedasticity  (aka homogeneity of variance )

This means that the prediction error doesn’t change significantly over the range of prediction of the model. We can test this assumption later, after fitting the linear model.

Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.

When we run this code, the output is 0.015. The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.

Use the hist() function to test whether your dependent variable follows a normal distribution .

Multiple regression histogram

The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.

We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.

Multiple regression scatter plot 1

Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.

  • Homoscedasticity

We will check this after we make the model.

Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.

Simple regression: income and happiness

Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.

To perform a simple linear regression analysis and check the results, you need to run two lines of code. The first line of code makes the linear model, and the second line prints out the summary of the model:

The output looks like this:

Simple regression results

This output table first presents the model equation, then summarizes the model residuals (see step 4).

The Coefficients section shows:

  • The estimates ( Estimate ) for the model parameters – the value of the y-intercept (in this case 0.204) and the estimated effect of income on happiness (0.713).
  • The standard error of the estimated values ( Std. Error ).
  • The test statistic ( t value , in this case the t statistic ).
  • The p value ( Pr(>| t | ) ), aka the probability of finding the given t statistic if the null hypothesis of no relationship were true.

The final three lines are model diagnostics – the most important thing to note is the p value (here it is 2.2e-16, or almost zero), which will indicate whether the model fits the data well.

From these results, we can say that there is a significant positive relationship between income and happiness ( p value < 0.001), with a 0.713-unit (+/- 0.01) increase in happiness for every unit increase in income.

Multiple regression: biking, smoking, and heart disease

Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.

To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:

Multiple regression results

The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.

This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.

The standard errors for these regression coefficients are very small, and the t statistics are very large (-147 and 50.4, respectively). The p values reflect these small errors and large t statistics. For both parameters, there is almost zero probability that this effect is due to chance.

Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!

Prevent plagiarism. Run a free check.

Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity assumption of the linear model.

We can run plot(income.happiness.lm) to check whether the observed data meets our model assumptions:

Note that the par(mfrow()) command will divide the Plots window into the number of rows and columns specified in the brackets. So par(mfrow=c(2,2)) divides it up into two rows and two columns. To go back to plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).

These are the residual plots produced by the code:

Simple regression diagnostic plots lm

Residuals are the unexplained variance . They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.

The most important thing to look for is that the red lines representing the mean of the residuals are all basically horizontal and centered around zero. This means there are no outliers or biases in the data that would make a linear regression invalid.

In the Normal Q-Qplot in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model.

Based on these residuals, we can say that our model meets the assumption of homoscedasticity.

Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:

Multiple regression diagnostic plots lm

As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.

Next, we can plot the data and the regression line from our linear regression model so that the results can be shared.

Follow 4 steps to visualize the results of your simple linear regression.

  • Plot the data points on a graph

Simple regression scatter plot

  • Add the linear regression line to the plotted data

Add the regression line using geom_smooth() and typing in lm as your method for creating the line. This will add the line of the linear regression as well as the standard error of the estimate (in this case +/- 0.01) as a light grey stripe surrounding the line:

Simple regression line

  • Add the equation for the regression line.

Simple regression equation

  • Make the graph ready for publication

We can add some style parameters using theme_bw() and making custom labels using labs() .

This produces the finished graph that you can include in your papers:

Simple linear regression in R graph example

The visualization step for multiple regression is more difficult than for simple regression, because we now have two predictors. One option is to plot a plane, but these are difficult to read and not often published.

We will try a different method: plotting the relationship between biking and heart disease at different levels of smoking. In this example, smoking will be treated as a factor with three levels, just for the purposes of displaying the relationships in our data.

There are 7 steps to follow.

  • Create a new dataframe with the information needed to plot the model

Use the function expand.grid() to create a dataframe with the parameters you supply. Within this function we will:

  • Create a sequence from the lowest to the highest value of your observed biking data;
  • Choose the minimum, mean, and maximum values of smoking, in order to make 3 levels of smoking over which to predict rates of heart disease.

This will not create anything new in your console, but you should see a new data frame appear in the Environment tab. Click on it to view it.

  • Predict the values of heart disease based on your linear model

Next we will save our ‘predicted y’ values as a new column in the dataset we just created.

  • Round the smoking numbers to two decimals

This will make the legend easier to read later on.

  • Change the ‘smoking’ variable into a factor

This allows us to plot the interaction between biking and heart disease at each of the three levels of smoking we chose.

  • Plot the original data

Multiple linear regression scatter plot

  • Add the regression lines

Multiple regression lines

Because this graph has two regression coefficients, the stat_regline_equation() function won’t work here. But if we want to add our regression model to the graph, we can do so like this:

This is the finished graph that you can include in your papers!

In addition to the graph, include a brief statement explaining the results of the regression model.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Linear Regression in R | A Step-by-Step Guide & Examples. Scribbr. Retrieved March 27, 2024, from https://www.scribbr.com/statistics/linear-regression-in-r/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, multiple linear regression | a quick guide (examples), choosing the right statistical test | types & examples, what is your plagiarism score.

Advanced Statistics using R

Applied Data Science Meeting, July 4-6, 2023, Shanghai, China . Register for the workshops: (1) Deep Learning Using R, (2) Introduction to Social Network Analysis, (3) From Latent Class Model to Latent Transition Model Using Mplus, (4) Longitudinal Data Analysis, and (5) Practical Mediation Analysis. Click here for more information .

  • Example Datasets
  • Basics of R
  • Graphs in R
  • Hypothesis testing
  • Confidence interval
  • Simple Regression
  • Multiple Regression
  • Logistic regression
  • Moderation analysis
  • Mediation analysis
  • Path analysis
  • Factor analysis
  • Multilevel regression
  • Longitudinal data analysis
  • Power analysis

Null hypothesis testing

Null hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given/assuming the null hypothesis is true, we evaluate the likelihood of obtaining the observed evidence or more extreme, when the study is on a randomly-selected representative sample. The null hypothesis assumes no difference/relationship/effect in the population from which the sample is selected. The likelihood is measured by a $p$ value. If the $p$ value is small enough, we reject the null. In the significance testing approach of Ronald Fisher , a null hypothesis is rejected on the basis of data that are significantly unlikely if the null is true. However, the null hypothesis is never accepted or proved. This is analogous to a criminal trial: The defendant is assumed to be innocent (null is not rejected) until proven guilty (null is rejected) beyond a reasonable doubt (to a statistically significant degree).

To conduct a typical null hypothesis testing, the following 7 steps can be followed:

  • State the research question
  • State the null and alternative hypotheses based on the research question
  • Select a value for significance level \(\alpha\)
  • Collect or locate data
  • Calculate the test statistic and the p value
  • Make a decision on rejecting or failing to reject the hypothesis
  • Answer the research question

Step 1. State the research question

A hypothesis testing is used to answer a question. Therefore, the first step is to state a research question. For example, a research question could be "Does memory training improve participants' performance on a memory test?" in the ACTIVE study .

Step 2. State the null and alternative hypotheses

Based on the research question, one then forms the null and the alternative hypotheses. For example, to answer the research question in Step 1, we would need to compare the memory test score for two groups of participants, those who receive training and those who do not. Let \(\mu_1\) and \(\mu_2\) be the population means of the two groups.

The null hypothesis \(H_0\) should be a statement about parameter(s), typically, of "no effect" or "no difference":

\[ H_{0}:\;\mu_{1}=\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}=0.\]

The alternative hypothesis \(H_1\) or \(H_a\) is the statement we hope or suspect is true. In this example, we hope the training group has a higher score than the control group, and, therefore, our alternative hypothesis would be

\[ H_{a}:\:\mu_{1}>\mu_{2}\mbox{ or }\mu_{1}-\mu_{2}>0. \]

But note that it is cheating to first look at the data and then frame \(H_a\) to fit what the data show. If we do not have direction firmly in mind in advance, we must use a two-sided alternative (default) hypothesis such that

\[H_{a}:\:\mu_{1} \neq \mu_{2}\mbox{ or }\mu_{1}-\mu_{2} \neq 0.\]

Step 3. Set the significance level \(\alpha\)

Hypothesis testing is a procedure to evaluate the strength of evidence against a null hypothesis. Given the null hypothesis is true, we calculate the probability of obtaining the observed evidence or more extreme, which is called $p$-value. If the $p$ value is small enough, reject the null. In practice, a value 0.05 is considered as small but other values can be used. For example, recently a group of researchers recommended to use 0.005 instead (Benjamin et al., 2017) . It is called the significance level, often denoted by \(\alpha\) and should be decided before data analysis. If \(p\leq\alpha\), we reject the null hypothesis, and if \(p>\alpha\), we fail to reject the null and the evidence is insufficient to support a conclusion.

Step 4. Collect or locate data

In this step, we can conduct an experiment to collect data or we can use some existing data. Note that even data exist, we should not form our hypothesis by peeking into the data.

The ACTIVE study has data on memory training. Therefore, we use the data as an example. The following code gets the data for the training group and the control group. hvltt2 has information on all 4 training groups (memory=1, reasoning=2, speed=3, control=4). Note that we use hvltt2[group==1] to select a subset of data from hvltt2 . This means we want to get the data from hvltt2 when the group value is equal to 1. Similarly, we select the data for the control group.

Step 5. Calculate the test statistic and the $p$ value

When the null hypothesis is true, the population mean difference (\(\mu_{1}-\mu_{2}=0\)) is zero. Based on our data, the observed mean difference for the two group is \(\bar{x}_{1}-\bar{x}_{2} = 1.54\). To conduct a test, we would need to calculate the probability of drawing a random sample with the difference of 1.54 or more extreme when \(H_{0}\) is true? That is

\[\Pr(\bar{x}_{1}-\bar{x}_{2}\geq1.54|\:\mu_{1}-\mu_{2}=0)=?\]

In obtaining the above probability, we need to know the sampling distribution of \(\bar{x}_{1}-\bar{x}_{2}\), which leads to the $t$ distribution in a $t$ test. We calculate a test statistic

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s}\]

where \(s\) and the distribution of \(t\) need to be decided.

Welch's t test (unpooled two independent sample t test)

When the two population variances of the two groups are not equal (the two sample sizes may or may not be equal). The \(t\) statistic to test whether the population means are different is calculated as:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{\overline{\Delta}}}\]

\[s_{\overline{\Delta}}=\sqrt{\frac{s_{1}^{2}}{n_{1}}+\frac{s_{2}^{2}}{n_{2}}}.\]

Here, \(s_{1}^{2}\) and \(s_{2}^{2}\) are the unbiased estimators of the variances of the two samples with \(n_{k}\) = number of participants in group \(k\) = 1 or 2. For use in significance testing, the distribution of the test statistic is approximated as an ordinary Student's \(t\) distribution with the degrees of freedom calculated as

\[\mathrm{d.f.}=\frac{(s_{1}^{2}/n_{1}+s_{2}^{2}/n_{2})^{2}}{(s_{1}^{2}/n_{1})^{2}/(n_{1}-1)+(s_{2}^{2}/n_{2})^{2}/(n_{2}-1)}.\]

This is known as the Welch-Satterthwaite equation. The true distribution of the test statistic actually depends (slightly) on the two unknown population variances.

In R, the function t.test() can be used to conduct a $t$ test. The following code conducts the Welch's $t$ test. Note that alternative = "greater" sets the alternative hypothesis. The other options include two.sided and less .

Pooled two independent sample t test

When the two groups have the same population variance.The \(t\) statistic can be calculated as follows:

\[t=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}\cdot\sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}\]

\[s_{p}=\sqrt{\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{n_{1}+n_{2}-2}}\]

is an estimator of the pooled standard deviation of the two samples. \(n_{k}-1\) is the degrees of freedom for each group, and the total sample size minus two (\(n_{1}+n_{2}-2\)) is the total number of degrees of freedom, which is used in significance testing.

The pooled two independent sample $t$ test can also be conducted using the t.test() function by setting the option var.equal=T or TRUE .

Step 6. Make a decision

Based on the $t$ test, we have a $p$-value about 2e-06. Since the $p$-value is smaller than the chosen significance level \(\alpha=0.05\), the null hypothesis is rejected.

Step 7. Answer the research question

Using the ACTIVE data, we tested whether the memory training can improve participants' performance on a memory test. Because we rejected the null hypothesis, we may conclude that the memory training statistically significantly increased the memory test performance.

Remarks on hypothesis testing

  • Hypothesis testing is more of a confirmatory data analysis than exploratory data analysis method. Therefore, one starts with a hypothesis and then tests whether the collected data support the hypothesis.
  • The logic of hypothesis testing is - Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?
  • If the null hypothesis is true while one rejects the null hypothesis, one would make the Type I error.  If the alternative hypothesis is true while one fails to reject the null hypothesis, one would make the Type II error. Statistical power is when one would reject the null hypothesis when the alternative hypothesis is true.
  • Statistical significance means that the results are unlikely to have occurred by chance, given that the null is true.
  • Statistical significance does not imply practical importance. For example, in comparing two groups, the difference can still be statistically significant even if the difference is tiny.

Effect size

To measure the practical importance, effect size is often recommended to use. For example, for mean difference, the commonly used effect size measure is Cohen's "d" ( Cohen, 1988 ). Cohen's d is defined as the difference between two means divided by a standard deviation.

\[d=\frac{\bar{x}_{1}-\bar{x}_{2}}{s_{p}}.\]

Cohen defined \(s_{p}\), the pooled standard deviation, as

A Cohen's d with the value around 0.2 is considered small, .5, median, and \(\geq\).8, large.

For example, the Cohen's d for the memory training example is 0.25, representing a small effect even though the p-value is small and indicates a statistical significance.

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R . Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe .

Regression Analysis

David gerbing, brief output, full output, standardize the variables, k-fold cross-validation, output as a stored object, likert type data, analysis of covariance, moderation analysis, default analysis, change classification threshold, plot conditional means across bins, interpreted output, full manual.

The Regression() function performs multiple aspects of a complete regression analysis. Abbreviate with reg() . To illustrate, first read the Employee data included as part of lessR . Read into the default lessR data frame d .

As an option, also read the table of variable labels. Create the table formatted as two columns. The first column is the variable name and the second column is the corresponding variable label. Not all variables need to be entered into the table. The table can be a csv file or an Excel file.

Read the label file into the l data frame, currently the only permitted name. The labels are displayed on both the text and visualization output. Each displayed label consists of the variable name juxtaposed with the corresponding label, as shown in the display of the label file.

The brief version provides just the basic analysis, what Excel provides, plus a scatterplot with the regression line, which becomes a scatterplot matrix with multiple regression. Because d is the default name of the data frame that contains the variables for analysis, the data parameter that names the input data frame need not be specified. Here, specify Salary as the target or response variable with features, or predictor variables, Years and Pre .

The full output is extensive: Summary of the analysis, estimated model, fit indices, ANOVA, correlation matrix, collinearity analysis, best subset regression, residuals and influence statistics, and prediction intervals. The motivation is to provide virtually all of the information needed for a proper regression analysis.

Request a briefer output with the reg_brief() version of the function. Standardize the predictor variables in the model by setting the new_scale parameter to "z" . Plot the residuals as a line connecting each data point to the corresponding point on the regression line as specified with the plot_errors parameter. To also standardize the response variable, set parameter scale_response to TRUE .

Specify a cross-validation with the kfold parameter. Here, specify three folds. The function automatically creates the training and testing data sets.

The standard output also includes \(R^2_{press}\) , the value of \(R^2\) when applied to new, previously unseen data, a value comparable to the average \(R^2\) on test data.

The output of Regression() can be stored into an R object, here named r . The output object consists of various components that together define the output of a comprehensive regression analysis. R refers to the resulting output structure as a list object.

Entering the name of the object displays the full output, the default output when the output is directed to the R console instead of saving into an R object.

Or, work with the components individually. Use the base R names() function to identify all of the output components. Component names that begin with out_ are part of the standard output. Other components include just data and statistics designed to be input in additional procedures, including R markdown documents.

Here, only display the estimates and their inferential analysis as part of the standard text output.

Here, display the numeric values of the coefficients.

An analysis of hundreds or thousands of rows of data can make it difficult to locate a specific prediction interval of interest. To initiate a search for a specific row, first do the regression and request all prediction intervals with parameter pred_rows . Then convert that output to a data frame named dp with base R read.table() . As a data frame, do a standard search for an individual row for a specific prediction interval (see the Subset a Data Frame vignette for directions to subset).

This particular conversion to a data frame requires one more step. One or more spaces in the out_predict output delimit adjacent columns, but the names in this data set are formatted with a comma followed by a space. Use base R sub() to remove the space after the comma before converting to a data frame.

Because reg() accomplishes its computations with base R function lm() , lm() parameters can be passed to reg() , which then passes the values to lm() to define the corresponding indicator variables. Here, first use base R function contr.sum() to calculate an effect coding contrast matrix for a categorical variable with three levels, such as the variable Plan in the Employee data set.

Now use the lm() parameter contrasts to define the effect coding for JobSat , passed to reg_brief() . Contrasts only apply to factors, so convert JobSat to an R factor before the regression analysis, a task that should generally be done for all categorical variables in an analysis. Here, designate the order of the levels on output displays such as a bar graph.

The \(R^2\) fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable \(y\) . The corresponding slope intercept is the mean of \(y\) , and the standard deviation of the residuals is the standard deviation of \(y\) .

The following submits the null model for Salary , and plots the errors. Compare the variability of the residuals to a regression model of Salary with one or more predictor variables. To the extent that the inclusion of one or more predictor variables in the model reduces the variability of the data about the regression line compared to the null model, the model fits the data.

Can also get the null model plot from the lessR function Plot() with the fit parameter set to "null" .

The scatterplot is displayed as a bubble plot when both variables consist of less than 10 unique integer values. With the bubble plot, there is no overprinting of the same point so that the number of values that represent a point is displayed.

Obtain an ANCOVA by entering categorical and continuous variables as predictor variables. For a single categorical variable and a single continuous variable, Regression() displays the regression line for each level of the categorical variable.

The ANCOVA assumes that the slopes for the different levels of the categorical variable are the same for the pairing of the continuous predictor variable and continuous response variable. Visually evaluate this assumption by plotting each separate slope and scatterplot.

Then, if the slopes are not too dissimilar, run the ANCOVA. The categorical variable must be interpretable as a categorical variable, either as an R variable type factor or as a non-numerical type character string. If the categorical variable is coded numerically, convert to a factor , such as d$CatVar <- factor(d$CatVar) which retains the original numerical values as the value labels.

The ANCOVA displays the appropriate Type II Sum of Squares in its ANOVA table for properly evaluating the group effect that corresponds to the entered categorical variable. Note that this SS is only displayed for an ANOVA with a single categorical variable and a single covariate

To do a moderation analysis, specify one of (only) two predictor variables with the parameter mod .

In this analysis, Pre is not a moderator of the impact of Years on Salary. There is a tendency expressed by the non-parallel lines in the visualization, and an almost significant interaction, but the interaction was not detected at the \(\alpha=0.5\) level.

Logistic Regression

For a model with a binary response variable, \(y\) , specify multiple logistic regression with the usual R formula syntax applied to the lessR function Logit() . The output includes the confusion matrix and various classification fit indices.

Specify additional probability thresholds for classification beyond just the default 0.5 with the prob_cut parameter.

Categorize Hand size into six bins. Compute the conditional mean of Gender, scored as 0 and 1, at each level of Hand size. Both variables must be numeric. The visualization approximates the form of the sigmoid function from logistic regression. The point (bubble) size depends on the sample size for the corresponding bin.

The parameter Rmd creates an R markdown file that is automatically generated and then the corresponding html document from knitting the various output components together with full interpretation. A new, much more complete form of computer output.

Not run here.

Use the base R help() function to view the full manual for Regression() . Simply enter a question mark followed by the name of the function, or its abbreviation.

An Introduction to R

6.3 simple linear modelling.

Linear models are one of the most widely used models in statistics and data science. They are often thought of as simple models but they’re very flexible and able to model a wide variety of experimental and survey designs. Many of the statistical approaches you may have used previously (such as linear regression, t -test, ANOVA, ANCOVA etc) can be expressed as a linear model so the good news is that you’re probably already familiar with linear models (albeit indirectly). They also form the foundation of more complicated modelling approaches and are relatively easy to extended to incorporate additional complexity. During this section we’ll learn how to fit some simple linear models using R and cover some of the more common applications. We won’t go into any detail of the underlying linear modelling theory but rather focus on the practicalities of model fitting and R code.

The main function for fitting linear models in R is the lm() function (short for linear model!). The lm() function has many arguments but the most important is the first argument which specifies the model you want to fit using a model formula which typically takes the general form:

response variable ~ explanatory variable(s)

This model formula is simply read as

‘variation in the response variable modelled as a function (~) of the explanatory variable(s)’.

The response variable is also commonly known as the ‘dependent variable’ and the explanatory variables are sometimes referred to as ‘independent variables’ (or less frequently as ‘predictor variables’). There is also an additional term in our model formula which represents the variation in our response variable not explained by our explanatory variables but you don’t need to specify this when using the lm() function.

As mentioned above, many of the statistical ‘tests’ you might have previously used can be expressed as a linear model. For example, if we wanted to perform a bivariate linear regression between a response variable ( y ) and a single continuous explanatory variable ( x ) our model formula would simply be

On the other hand, if we wanted to use an ANOVA to test whether the group means of a response variable ( y ) were different between a three level factor ( x ) our model formula would look like

OK, hang on, they both look identical, what gives? In addition to the model formula, the type of linear model you fit is also determined by the type of data in your explanatory variable(s) (i.e. what class of data). If your explanatory variable is continuous then you will fit a bivariate linear regression. If your explanatory variable is a factor (i.e. categorical data) you will fit an ANOVA type model.

You can also increase the complexity of your linear model by including additional explanatory variables in your model formula. For example, if we wanted to fit a two-way ANOVA both of our explanatory variables x and z would need to be factors and separated by a + symbol

If we wanted to perform a factorial ANOVA to identify an interaction between both explanatory variables we would separate our explanatory variables with a : symbol whilst also including our main effects in our model formula

or by using the equivalent shortcut notation

It’s important that you get comfortable with using model formula (and we’ve only given the briefest of explanations above) when using the lm() function (and other functions) as it’s remarkably easy to specifiy a model which is either nonsense or isn’t the model you really wanted to fit. A summary table of various linear model formula and equivalent R code given below.

OK, time for an example. The data file smoking.txt summarises the results of a study investigating the possible relationship between mortality rate and smoking across 25 occupational groups in the UK. The variable occupational.group specifies the different occupational groups studied, the risk.group variable indicates the relative risk to lung disease for the various occupational groups and smoking is an index of the average number of cigarettes smoked each day (relative to the number smoked across all occupations). The variable mortality is an index of the death rate from lung cancer in each group (relative to the death rate across all occupational groups). In this data set, the response variable is mortality and the potential explanatory variables are smoking which is numeric and risk.group which is a three level factor. The first thing to do is import our data file using the read.table() function as usual and assign the data to an object called smoke . You can find a link to download these data here .

Next, let’s investigate the relationship between the mortality and smoking variables by plotting a scatter plot. We can use either the ggplot2 package or base R graphics to do this. We’ll use ggplot2 this time and our old friend the ggplot() function.

null hypothesis linear model r

The plot does suggest that there is a positive relationship between the smoking index and mortality index.

To fit a simple linear model to these data we will use the lm() function and include our model formula mortality ~ smoking and assign the results to an object called smoke_lm .

Notice that we have not used the $ notation to specify the variables in our model formula, instead we’ve used the data = smoke argument. Although the $ notation will work (i.e.  smoke$mortality ~ smoke$smoking ) it will more than likely cause you problems later on and should be avoided. In fact, we would go as far to suggest that if any function has a data = argument you should always use it. How do you know if a function has a data = argument? Just look in the associated help file.

Perhaps somewhat confusingly (at least at first) it appears that nothing much has happened, you don’t automatically get the voluminous output that you normally get with other statistical packages. In fact, what R does, is store the output of the analysis in what is known as a lm class object (which we have called smoke_lm ) from which you are able to extract exactly what you want using other functions. If you’re brave, you can examine the structure of the smoke_lm model object using the str() function.

To obtain a summary of our analysis we can use the summary() function on our smoke_lm model object.

This shows you everything you need to know about the parameter estimates (intercept and slope), their standard errors and associated t statistics and p values. The estimate for the Intercept suggests that when the relative smoking index is 0 the relative mortality rate is -2.885 ! The p value associated with the intercept tests the null hypothesis that the intercept is equal to zero. As the p value is large we fail to reject this null hypothesis. The smoking parameter estimate ( 1.0875 ) is the estimate of the slope and suggests that for every unit increase in the average number of cigarettes smoked each day the mortality risk index increases by 1.0875. The p value associated with the smoking parameter tests whether the slope of this relationship is equal to zero (i.e. no relationship). As our p value is small we reject this null hypothesis and therefore the slope is different from zero and therefore there is a significant relationship. The summary table also includes other important information such as the coefficient of determination ( R 2 ), adjusted R 2 , F statistic, associated degrees of freedom and p value. This information is a condensed form of an ANOVA table which you can see by using the anova() function.

Now let’s fit another linear model, but this time we will use the risk.group variable as our explanatory variable. Remember the risk.group variable is a factor and so our linear model will be equivalent to an ANOVA type analysis. We will be testing the null hypothesis that there is no difference in the mean mortality rate between the low , medium and high groups. We fit the model in exactly the same way as before.

Again, we can produce an ANOVA table using the anova() function

The results presented in the ANOVA table suggest that we can reject the null hypothesis (very small p value) and therefore the mean mortality rate index is different between low , medium and high risk groups.

As we did with our first linear model we can also produce a summary of the estimated parameters using the summary() function.

In the summary table the Intercept is set to the first level of risk.group ( high ) as this occurs first alphabetically. Therefore, the estimated mean mortality index for high risk individuals is 135 . The estimates for risk.grouplow and risk.groupmedium are mean differences from the intercept ( high group). So the mortality index for the low group is 135 - 57.83 = 77.17 and for the medium group is 135 - 27.55 = 107.45 . The t values and p values in the summary table are associated with testing specific hypotheses. The p value associated with the intercept tests the null hypothesis that the mean mortality index for the high group is equal to zero. To be honest this is not a particularly meaningful hypothesis to test but we can reject it anyway as we have a very small p value. The p value for the risk.grouplow parameter tests the null hypothesis that the mean difference between high and low risk groups is equal to zero (i.e. there is no difference). Again we reject this null hypothesis and conclude that the means are different between these two groups. Similarly, the p value for risk.groupmedium tests the null hypothesis that the mean difference between high and medium groups is equal to zero which we also reject.

Don’t worry too much if you find the output from the summary() function a little confusing. Its takes a bit of practice and experience to be able to make sense of all the numbers. Remember though, the more complicated your model is, the more complicated your interpretion will be. And always remember, a model that you can’t interpret is not worth fitting (most of the time!).

Another approach to interpreting your model output is to plot a graph of your data and then add the fitted model to this plot. Let’s go back to the first linear model we fitted ( smoke_lm ). We can add the fitted line to our previous plot using the ggplot2 package and the geom_smooth geom. We can easily include the standard errors by specifying the se = TRUE argument.

null hypothesis linear model r

You can also do this with R’s base graphics. Note though that the fitted line extends beyond the data which is not great practice. If you want to prevent this you can generate predicted values from the model using the predict() function within the range of your data and then add these values to the plot using the lines() function (not shown).

null hypothesis linear model r

Before we sit back and relax and admire our model (or go write that high impact paper your supervisor/boss has been harassing you about) our work is not finished. It’s vitally important to check the underlying assumptions of your linear model. Two of the most important assumption are equal variances (homogeneity of variance) and normality of residuals. To check for equal variances we can construct a graph of residuals versus fitted values. We can do this by first extracting the residuals and fitted values from our model object using the resid() and fitted() functions.

And then plot them using ggplot or base R graphics.

null hypothesis linear model r

It takes a little practice to interpret these types of graph, but what you are looking for is no pattern or structure in your residuals. What you definitely don’t want to see is the scatter increasing around the zero line (red dashed line) as the fitted values get bigger (this has been described as looking like a trumpet, a wedge of cheese or even a slice of pizza) which would indicate unequal variances (heteroscedacity).

To check for normality of residuals we can use our old friend the Q-Q plot using the residuals stored in the smoke_res object we created earlier.

null hypothesis linear model r

Or the same plot with base graphics.

null hypothesis linear model r

Alternatively, you can get R to do most of the hard work by using the plot() function on the model object smoke_lm . Before we do this we should tell R that we want to plot four graphs in the same plotting window in RStudio using the par(mfrow = c(2,2)) . This command splits the plotting window into 2 rows and 2 columns.

null hypothesis linear model r

The first two graphs (top left and top right) are the same residual versus fitted and Q-Q plots we produced before. The third graph (bottom left) is the same as the first but plotted on a different scale (the absolute value of the square root of the standardised residuals) and again you are looking for no pattern or structure in the data points. The fourth graph (bottom right) gives you an indication whether any of your observations are having a large influence (Cook’s distance) on your regression coefficient estimates. Levearge identifies observations which have unusually large values in their explanatory variables.

You can also produce these diagnostic plots using ggplot by installing the package ggfortify and using the autoplot() function.

null hypothesis linear model r

What you do about influential data points or data points with high leverage is up to you. If you would like to examine the effect of removing one of these points on the parameter estimates you can use the update() function. Let’s remove data point 2 (miners, mortality = 116 and smoking = 137) and store the results in a new object called smoke_lm2 . Note, we do this to demonstrate the use of the update() function. You should think long and hard about removing any data point(s) and if you do you should always report this and justify your reasoning.

There are numerous other functions which are useful for producing diagnostic plots. For example, rstandard() and rstudent() returns the standardised and studentised residuals. The function dffits() expresses how much an observation influences the associated fitted value and the function dfbetas() gives the change in the estimated parameters if an observation is excluded, relative to its standard error (intercept is the solid line and slope is the dashed line in the example below). The solid bold line in the same graph represents the Cook’s distance. Examples of how to use these functions are given below.

null hypothesis linear model r

Life With Data

  • by bprasad26

How to Use the linearHypothesis() Function in R

null hypothesis linear model r

The linearHypothesis() function is a valuable statistical tool in R programming. It’s provided in the car package and is used to perform hypothesis testing for a linear model’s coefficients.

To fully grasp the utility of linearHypothesis() , we must understand the basic principles of linear regression and hypothesis testing in the context of model fitting.

Understanding Hypothesis Testing in Regression Analysis

In regression analysis, it’s common to perform hypothesis tests on the model’s coefficients to determine whether the predictors are statistically significant. The null hypothesis asserts that the predictor has no effect on the outcome variable, i.e., its coefficient equals zero. Rejecting the null hypothesis (based on a small p-value, usually less than 0.05) suggests that there’s a statistically significant relationship between the predictor and the outcome variable.

The linearHypothesis( ) Function

linearHypothesis() is a function in R that tests the general linear hypothesis for a model object for which a formula method exists, using a specified test statistic. It allows the user to define a broader set of null hypotheses than just assuming individual coefficients equal to zero.

The linearHypothesis() function can be especially useful for comparing nested models or testing whether a group of variables significantly contributes to the model.

Here’s the basic usage of linearHypothesis() :

In this function:

  • model is the model object for which the linear hypothesis is to be tested.
  • hypothesis.matrix specifies the null hypotheses.
  • rhs is the right-hand side of the linear hypotheses; typically set to 0.
  • ... are additional arguments, such as the test argument to specify the type of test statistic to be used (“F” for F-test, “Chisq” for chi-squared test, etc.).

Installing and Loading the Required Package

linearHypothesis() is part of the car package. If you haven’t installed this package yet, you can do so using the following command:

Once installed, load it into your R environment with the library() function:

Using linearHypothesis( ) in Practice

Let’s demonstrate the use of linearHypothesis() with a practical example. We’ll use the mtcars dataset that’s built into R. This dataset comprises various car attributes, and we’ll model miles per gallon (mpg) based on horsepower (hp), weight (wt), and the number of cylinders (cyl).

We first fit a linear model using the lm() function:

Let’s say we want to test the hypothesis that the coefficients for hp and wt are equal to zero. We can set up this hypothesis test using linearHypothesis() :

This command will output the Residual Sum of Squares (RSS) for the model under the null hypothesis, the RSS for the full model, the test statistic, and the p-value for the test. A low p-value suggests that we should reject the null hypothesis.

Using linearHypothesis( ) for Testing Nested Models

linearHypothesis() can also be useful for testing nested models, i.e., comparing a simpler model to a more complex one where the simpler model is a special case of the complex one.

For instance, suppose we want to test if both hp and wt can be dropped from our model without a significant loss of fit. We can formulate this as the null hypothesis that the coefficients for hp and wt are simultaneously zero:

This gives a p-value for the F-test of the hypothesis that these coefficients are zero. If the p-value is small, we reject the null hypothesis and conclude that dropping these predictors from the model would significantly degrade the model fit.

Limitations and Considerations

The linearHypothesis() function is a powerful tool for hypothesis testing in the context of model fitting. However, it’s important to consider the limitations and assumptions of this function. The linearHypothesis() function assumes that the errors of the model are normally distributed and have equal variance. Violations of these assumptions can lead to incorrect results.

As with any statistical function, it’s crucial to have a good understanding of your data and the theory behind the statistical methods you’re using.

The linearHypothesis() function in R is a powerful tool for testing linear hypotheses about a model’s coefficients. This function is very flexible and can be used in various scenarios, including testing the significance of individual predictors and comparing nested models.

Understanding and properly using linearHypothesis() can enhance your data analysis capabilities and help you extract meaningful insights from your data.

Share this:

Leave a reply cancel reply, discover more from life with data.

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading

Analysing Data using Linear Models

Chapter 11 post-hoc comparisons, 11.1 introduction.

Analysis of variance, as we have seen, can be used to test null-hypotheses about overall effects of certain factors (categorical variable) or combinations of factors (moderation). This is done with \(F\) -test statistics, with degrees of freedom that depend on the number of (combinations of) categories. The regression table, with \(t\) -tests in the output, can be used to compute specific contrasts, either the standard contrasts based on dummy coding, or contrasts based on alternative coding schemes.

In the previous chapter, all alternatives for specifying contrasts have been discussed in the context of specific research questions. It is important to make a distinction here between research questions that are posed before the data gathering and analysis, and research questions that pop up during the data analysis. Research questions of the first kind we call a priori (“at the outset”) questions, and questions of the second kind we call post hoc (“after the fact”) questions.

We’ve seen that the whole data analysis approach for inference is based on sampling distributions, for instance the sampling distribution of the \(t\) -statistic given that a population value equals 0. We then look at what \(t\) -value can be deemed large enough to reject the null-hypothesis (or to construct a confidence interval). Such a critical \(t\) -value is chosen in a way that if the null-hypothesis is true, it only happens in a low percentage of cases ( \(\alpha\) ) that we find a \(t\) -value more extreme than this critical value. This helps us to reject the null-hypothesis: we see something extreme that can’t be explained very well by the null-hypothesis.

However, if we look at a linear regression output table, we often see many \(t\) -values: one for the intercept, several for slope coefficients, and if the model includes moderation, we also may see several \(t\) -values for interaction effects. Every single \(t\) -value is based on a hypothesis that the null-hypothesis is true, that is, that the actual parameter value (or contrast) is 0 in the population. For every single \(t\) -test, we therefore know that if we would draw many many samples, in only \(\alpha\) % of the samples, we would find a \(t\) -value more extreme than the critical value (given that the null-hypothesis is true). But if we have for instance 6 different \(t\) -values in our output, how large is the probability that any of these 6 different \(t\) -values is more extreme than than the critical value?

Let’s use a very simple example. Let’s assume we have a relatively large data set, so that the \(t\) -distibution is very close to the normal distribution. When we assume we use two-sided testing with an \(\alpha\) of 5%, we know that the critical values for the null-hypothesis are \(-1.96\) and \(1.96\) . Imagine we have a dependent variable \(Y\) and a categorical independent variable \(X\) that consists of two levels, A and B. If we run a standard linear model on those variables \(Y\) and \(X\) , using dummy coding, we will see two parameters in the regression table: one for reference level A (labelled “(Intercept)”), and one coefficient for the difference between level B and A (labelled “XB”). Suppose that in reality, the population values for these parameters are both 0. That would mean that the two group means are equal to 0. When we do the research many times, drawing a large sample and repeat taking new samples 100 times, how many times will the intercept have a \(t\) -value more extreme than \(\pm 1.96\) ? Well, by definition, that frequency would be about 5, because we know that for the \(t\) -distribution, 5% of this distribution has values more extreme than \(\pm 1.96\) . Thus, if the intercept is truly 0 in the population, we will see a significant \(t\) -value in 5% of the samples.

The same is true for the second parameter “XB”: if this value is 0 in the population, then we will see a significant \(t\) -value for this parameter in the output in 5% of the samples.

Both of these events would be Type I errors: the kind of error that you make when you reject the null-hypothesis while it is actually true (see Chapter 2 ).

For any given sample, there can be either no Type I error, or there is one Type I error, of there are two Type I errors. Now, if the probability for a Type I error is 5% for a significant value for “(Intercept)”, and the probability is 5% for a type I error for “XB”, what then is the probability for at least one Type I error ?

This is a question for probability theory. If we assume that the Type I errors for the intercept and the slope are independent, we can use the binomial distribution (Ch. 3 ) and know that the probability of finding no Type I errors equals

\[P(errors = 0 | \alpha = 0.05) = {2 \choose 0} \times 0.05^0 \times (1-0.05)^2 = 0.95^2 = 0.9025\]

[Note: In case you skipped Chapter 3 , \(4 \choose 2\) is pronounced as “4 choose 2” and it stands for the number of combinations of two elements that you can have when you have four elements. For instance, if you have 4 letters A, B, C and D, then there are 6 possible pairs: AB, AC, AD, BC, BD, and CD. The general case of \(a \choose b\) can be computed by R using choose (a, b) . \(2 \choose 0\) is defined as 1 (there is only one way in which neither of the 2 tests results in a Type I error.)]

Therefore, since probabilities sum to 1, we know that the probability of at least one Type I error equals \(1- 0.9025= 0.0975\) .

We see that when we look at two \(t\) -tests in one analysis, the probability of a Type I error is no longer 5%, but almost twice that: 9.75%. This is under the assumption that the \(t\) -tests are independent of each other, which is often not the case. We will discuss what we mean with independent later. For now it suffices to know that the more null-hypotheses you test, the higher the risk of a Type I error.

For instance, if you have a categorical variable \(X\) with not two, but ten different groups, your regression output table will contain ten null-hypothesis tests: one for the intercept (reference category) and nine tests for the difference between the remaining groups and the reference group. In that case, the probability of at least one Type I error, if you perform each test with an \(\alpha\) of 5%, will be

\[1 - P(errors = 0| \alpha = 0.05) = 1 - { 10 \choose 0} \times 0.05^0 \times 0.95^{10} = 1 - 0.95^{10} = 0.4013\] And all this is only in the situation that you stick to the default output of a regression. Imagine that you not only test the difference between each group and the reference group, but that you also make many other contrasts: difference between group 2 and group 9, etcetera. If we would look at each possible pair among these ten groups, there would be \({10 \choose 2} = 144\) contrasts and consequently 144 \(t\) -tests. The probability of a Type I error will then be very close to 1, almost certainty!

The problem is further complicated by the fact that tests for all these possible pairs cannot be independent of each other. This is easy to see: If Jim is 5 centimetres taller than Sophie, and Sophie is 4 centimetres taller than Wendy, we already know the difference in height between Jim and Sophie: \(5 + 4 = 9\) centimetres. In other words, if you want to know the contrast between Jim and Wendy, you don’t need a new analysis: the answer is already there in the other two contrasts. Such a dependence in the contrasts that you estimate can lead to an even higher probability of a Type I error.

In order to get some grip on this dependence problem, we can use the so-called Bonferroni inequality . In the context of null-hypothesis testing, this states that the probability of at least one Type I error is less than or equal to \(\alpha\) times the number of contrasts J . This is called the upper bound.

\[P (errors > 0) = 1 - P (errors = 0) \leq J\alpha\]

This inequality is true whether two contrast are heavily dependent (as in the height example above) or only slightly, or not at all. For instance, if you have two contrasts in your output (the intercept and the slope), the probability of at least one Type I error equals 0.0975, but only if we assume these two contrasts are independent. In contrasts, if the two contrasts are dependent , we can use the Bonferroni inequality to know that the probability of at least one Type I error is less than or equal to \(0.05 \times 2 = 0.10\) . Thus, if there is dependency we know that the probability of at least one Type I error is at the most 0.10 (it could be less bad).

Note that if \(J\alpha > 1\) , then the upper bound is set equal to 1.

This Bonferroni upperbound can help us to take control over the overall probability of making Type I errors. Here we make a distinction between the test-wise Type I error rate, \(\alpha_{TW}\) , and the family-wise Type I error rate, \(\alpha_{FW}\) . Here, \(\alpha_{TW}\) is the probability of a Type I error used for one individual hypothesis test, and \(\alpha_{FW}\) is the probability of at least one Type I error among all tests performed. If we have a series of null-hypothesis tests, and if we want to have an overall probability of at most 5% (i.e., \(\alpha_{FW}=0.05\) ), then we should set the level for any individual test \(\alpha_{TW}\) at \(\alpha_{FW}/J\) . Then we know that the probability of at least one error is 5% or less.

Note that what is true here for null-hypothesis testing is also true for the calculation of confidence intervals. Also note that we should only look at output for which we have research questions. Below we see an example of how to apply these principles.

We use the ChickWeight data available in R. It is a data set on the weight of chicks over time, where the chicks are categorised into four different groups, each on a different feeding regime. Suppose we do a study on diet in chicks with one control group and three experimental groups. For each of these three experimental groups, we want to estimate the difference with the control condition (hence there are three research questions). We perform a regression analysis with dummy coding with the control condition (Diet 3) as the reference group to obtain these three contrasts. For the calculation of the confidence intervals, we want to have a family-wise Type I error rate of 0.05. That means that we need to have a test-wise Type I error rate of 0.05/3 = 0.0167. We therefore need to compute \(100-1.67 = 98.33\) % confidence intervals and we do null-hypothesis testing where we reject the null-hypothesis if \(p < 0.0167\) . The R code would look something like the following:

In the output we see the three contrasts that we need to answer our research questions. We can then report:

"We estimated the difference between each of the three experimental diet with the control diet. In order to control the family-wise Type I error rate and keep it below 5%, we used Bonferroni correction and chose a test-wise significance level of 0.0167 and computed 98.3% confidence intervals. The chicks on Diet 1 had a significantly lower weight than the chicks in the control conditions (Diet 3, \(b = -40.3\) , \(SE = 7.87\) , \(t(574) = -5.12\) , \(p < .001\) , 98.33% CI: -59.2, -21.4). The chicks on Diet 2 also had a lower weight than chicks on Diet 3, although the null-hypothesis could not be rejected ( \(b = -20.3\) , \(SE = 8.95\) , \(t(574) = -2.27\) , \(p = .024\) , 98.33% CI: -41.8, 1.15). The chicks on Diet 4 also had a weight not significantly different from chicks on Diet 3, ( \(b = -7.69\) , \(SE = 8.99\) , \(t(574) = -0.855\) , \(p = .039\) , 98.33% CI: -29.3, 13.9). "

Note that we do not report on the Intercept. Since we had no research question about the average weight of chicks on Diet 3, we ignore those results in the regression table, and divide the desired family-wise error rate by 3 (and not 4).

As we said, there are two kinds of research questions: a priori questions and post hoc questions. A priori questions are questions posed before the data collection. Often they are the whole reason why data were collected in the first place. Post hoc questions are questions posed during data analysis. When analysing data, some findings may strike you and they inspire you to do some more analyses. In order to explain the difference, let’s think of two different scenarios for analysing the ChickWeight data.

In Scenario 1, researchers are interested in the relationship between the diet and the weight of chicks. They see that in different farms, chicks show different mean sizes, and they are also on different diets. The researchers suspect that the weight differences are induced by the different diets, but they are not sure, because there are also many other differences between the farms (differences in climate, chicken breed and daily regime). In order to control for these other differences, the researchers pick one specific farm, they use one particular breed of chicken, and assign the chicks randomly to one of four diets. They reason that if they find differences between the four groups regarding weight, then diet is the factor responsible for those differences. Thus, their research question is: “Are there any differences in mean weight as a function of diet?” They run an ANOVA and find the following results:

They answer their (a priori) research question in the following way.

"We tested the null-hypothesis of equal mean weights in the four diet groups at an alpha of 5%, and found that it could be rejected, \(F(3, 574), p < .001\) . We conclude that diet does have an influence on the mean weight in chicks."

When analysing the data more closely, they also look at the mean weights per group.

They are struck by the relatively small difference in mean weight between Diet 4 and Diet 3. They are surprised because they know that one of them contains a lot more protein than the other. They are therefore curious to see whether the difference between Diet 4 and Diet 3 is actually significant. Moreover, they are very keen on finding out which Diets are different from each other, and which Diets are not different from each other. They decide to perform 6 additional \(t\) -tests: one for every possible pair of diets.

In this Scenario I, we see two kinds of research questions: the initial a priori question was whether Diet affects weight, and they answer this question with one F -test. The second question only arose during the analysis of the data. They look at the means, see some kind of pattern and want to know whether all means are different from each other. This follow-up question that arises during data analysis is a post hoc question.

Let’s look at Scenario II. A group of researchers wonder whether they can get chicks to grow more quickly using alternative diets. There is one diet, Diet 3, that is used in most farms across the world. They browse through the scientific literature and find three alternative diets. These alternative diets each have a special ingredient that makes the researchers suspect that they might lead to weight gain in chicks. The objective of their research is to estimate the differences between these three alternative diets and the standard diet, Diet 3. They use one farm and one breed of chicken and assign chicks randomly to one of the four diets. They perform the regression analysis with the dummy coding and reference group Diet 3 as shown above, and find that the differences between the three experimental diets and the standard diet are all negative: they show slower growth rather than faster growth. They report the estimates, the standard errors and the confidence intervals.

In general we can say that a priori questions can be answered with regular alphas and confidence intervals. For instance, if you state that your Type I error rate \(\alpha\) is set at 0.05, then you can use this \(\alpha = 0.05\) also for all the individual tests that you perform confidence intervals that you calculate. However, for post hoc questions, where your questions are guided by the results that you see, you should correct the test-wise error rate \(\alpha_{TW}\) in such a way that you control the family-wise error rate \(\alpha_{FW}\) .

Returning to the two scenarios, let’s look at the question whether Diet 4 differs from Diet 3. In Scenario I, this is a post hoc question, where in total you have 6 post hoc questions. You should therefore do the hypothesis test with an alpha of 0.05/6 = 0.0083, and/or compute a 99.17% confidence interval. In contrast, in Scenario II, the same question about Diets 4 and 3 is an a priori question, and can therefore be answered with an \(\alpha=5\) % and/or a 95% confidence interval.

Summarising, for post hoc questions you adjust your test-wise type I error rate, whereas you do not for a priori questions. The reason for this different treatment has to do with the dependency in contrasts that we talked about earlier. It also has to do with the fact that you only have a limited number of model degrees of freedom. In the example of the ChickWeight data, we have four groups, hence we can estimate only four contrasts. In the regression analysis with dummy coding, we see one contrast for the intercept and then three contrasts between the experimental groups and the reference group. Also if we use Helmert contrasts, we will only obtain four estimates in the output. This has to do with the dependency between contrasts: if you know that group A has a mean of 5, group B differs from group A by +2 and group C differs from group A by +3, you don’t need to estimate the difference between B and C any more, because you know that based on these numbers, the difference can only be +1. In other words, the contrast C to B totally depends on the contrast A versus B and A versus C. The next section discusses the dependency problem in more detail.

11.2 Independent (orthogonal) contrasts

Whether two contrasts are dependent is easily determined. Suppose we have \(J\) independent samples (groups), each containing values from a population of normally distributed values (assumption of normality). Each group is assumed to come from a population with the same variance \(\sigma^2_e\) (assumption of equal variance). For the moment also assume that the \(J\) groups have an equal sample size \(n\) . Any group \(j\) will have a mean \(\bar{Y}_j\) . Now imagine two contrasts among these means. The first contrast, \(L1\) , has the weights \(c_{1j}\) , and the second contrasts, \(L2\) , has the weights \(c_{2j}\) . Then we know that contrasts \(L1\) and \(L2\) are independent if

\[\sum_{j=1}^J c_{1j}c_{2j}=0\]

Thus, if you have \(J\) independent samples (groups), each of size \(n\) , one can decide if two contrasts are dependent by checking if the products of the weights sum to zero:

\[c_{11}c_{21} + c_{12}c_{22} + \dots + c_{1J}c_{2J} = 0\]

Another word for independent is orthogonal . Two contrasts are said to be orthogonal if the two contrasts are independent. Let’s look at some examples for a situation of four groups: one set of dependent contrasts and a set of orthogonal contrasts. For the first example, we look at default dummy coding. For contrast \(L1\) , we estimate the mean of group 1. Hence

\[L1 = \begin{bmatrix} 1 & 0 & 0 & 0 \end{bmatrix}\]

Let contrast \(L2\) be the contrast between group 2 and group 1:

\[L2 = \begin{bmatrix} -1 & 1 & 0 & 0 \end{bmatrix}\]

If we calculate the products of the weights, we get:

\[\sum_{j} c_{1j}c_{2j} = 1\times -1 + 0 \times 1 + 0 \times 0 + 0\times 0 = -1\] So we see that when we use dummy coding, the contrasts are not independent (not orthogonal).

For the second example, we look at Helmert contrasts. Helmert contrasts are independent (orthogonal). The Helmert contrast matrix for four groups looks like

\[L = \begin{bmatrix} \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ -1 & 1 & 0 & 0 \\ -\frac{1}{2} & -\frac{1}{2} & 1 & 0 \\ -\frac{1}{3} & -\frac{1}{3} & -\frac{1}{3} & 1 \end{bmatrix}\]

For the first two contrasts, we see that the product of the weights equals zero:

\[\sum_{j} c_{1j}c_{2j} = \frac{1}{4} \times -1 + \frac{1}{4} \times 1 + \frac{1}{4} \times 0 + \frac{1}{4} \times 0 = 0\] Check for yourself and find that all four Helmert contrasts are independent of each other.

11.3 The number of independent contrasts is limited

Earlier we saw that there is only so much information you can gain from a data set. Once you have certain information, asking further questions leads to answers that depend on the answers already available.

This dependency has a bearing on the number of orthogonal comparisons that can be made with \(J\) group means. Given \(J\) independent sample means, there can be, apart from the grand mean, no more than \(J-1\) comparisons, without them being dependent on each other. This means that if you have \(J\) completely independent contrasts for \(J\) group means, it is impossible to find one more comparison which is also orthogonal to the first \(J\) ones.

This implies that if you ask more questions (i.e., ask for more contrasts) you should tread carefully. If you ask more questions, the answers to your questions will not be independent of each other (you are to some extent asking the same thing twice).

As an example, earlier we saw that if you know that group B differs from group A by +2 and group C differs from group A by -3, you don’t need to estimate the difference between B and C any more, because you know that based on these numbers, the difference can only be 5. In other words, the contrast C to B totally depends on the contrasts A versus B and A versus C. You can also see this in the contrast matrix for groups A, B and C below:

\[L = \begin{bmatrix} 1 & 0 & 0 \\ -1 & 1 & 0 \\ -1 & 0 & 1 \\ 0 & -1 & 1 \end{bmatrix} \]

The last contrast is dependent both on the third and the second contrast. Contrast \(L4\) can be calculated as \(L3 - L2\) by doing the calculation element-wise:

\[ \begin{bmatrix} -1 & 0 & 1 \end{bmatrix} - \begin{bmatrix} -1 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & -1 & 1 \end{bmatrix} \]

In other words, \(L4\) is a linear combination (weighted sum) of \(L2\) and \(L3\) : \(L4 = 1\times L3 - 1 \times L2\) . Statistically therefore, contrast \(L4\) is completely redundant given the contrasts \(L2\) and \(L3\) : it doesn’t provide any extra information.

It should however be clear that if you have a research question that can be answered with contrast \(L4\) , it is perfectly valid to make this contrast. However, you should realise that the number of independent research questions is limited. It is a wise idea to limit the number of research questions to the number of contrasts you can make: apart from the grand mean, you should make no more than \(J-1\) comparisons (your regression table should have no more than \(J\) parameters).

These contrasts that you specify belong to the a priori research question. Good research has a limited number of precisely worded research questions that should be answerable by a limited number of contrasts, usually 2 or 3, sometimes only 1. These can be answered by using the regular significance level. In social and behavioural sciences, oftentimes \(\alpha\) for each individual test or confidence interval equals 5%. However, the follow-up questions that arise only after the initial data analysis (computing means, plotting the data, etc.), should however be corrected to control the overall Type I error rate.

11.4 Fishing expeditions

Research and data analysis can sometimes be viewed as a fishing expedition. Imagine you fish the high seas for herring. Given your experience and what colleagues tell you (you browse the scientific literature, so to speak), you choose a specific location where you expect a lot of herring. By choosing this location, you maximise the probability of finding herring. This is analogous to the setting up of a data collection scheme where you maximise the probability of finding a statistically significant effect, or you maximise the precision of your estimates; in other words, you maximise statistical power (see Chapter 5 ). However, while fishing in that particular spot for herring, irrespective of whether you actually find herring, you find a lot of other fish and seafood. This is all coincidence, as you never planned to find these kinds of fish and seafood in your nets. The fact that you find a crab in your nets, might seem very interesting, but it should never be reported as if you were looking for that crab. You would have equally regarded it interesting if you had found a lobster, or a seahorse, or a baseball hat. You have to realise that it is pure random sampling error: you hang out your nets, and just pick up what’s there by chance. In research it works the same way: if you do a lot of statistical tests, or compute a large number of confidence intervals, you’re bound to find something that seems interesting, but is actually merely random noise due to sampling error. If the family-wise error rate is large, say 60%, then you cannot tell your family and friends ashore that the base-ball hat you found is very fascinating. Similarly, in research you have to control the number of Type I errors by adjusting the test-wise error rate in such a way that the family-wise error rate is low.

11.5 Several ways to define your post hoc questions

One question that often arises when we find that a categorical variable has an effect in an ANOVA, is to ask where this overall significant effect is coming from. For instance, we find that the four diets result in different mean weights in the chicks. This was demonstrated with an \(F\) -test at an \(\alpha\) of 5%. A follow-up question might then be, what diets are different from each other. You might then set up contrasts for all \({4 \choose 2} = 6\) possible pairs of the four diets.

Alternatively, you may specify your post hoc questions as simple or more complex contrasts in the same way as for your a priori questions, but now with no limit on how many. For instance, you may ask what alternative diets are significantly different from the standard diet (Diet 3). The number of comparisons is then limited to 3. Additionally, you might ask whether the alternative diets combined (grand mean of diets 1, 2 and 4) are significantly different from Diet 3.

Be aware, however, that the more comparisons you make, the more severe the correction must be to control the family-wise Type I error rate.

The analyses for the questions that you answer by setting up the entire data collection, and that are thus planned before the data collection (a priori), can be called confirmatory analyses. You would like to confirm the workings of an intervention, or you want to precisely estimate the size of a certain effect. Preferably, the questions that you have are statistically independent of each other, that is, the contrasts that you compute should preferably be orthogonal (independent).

In contrast, the analyses that you do for questions that arise while analysing the data (post hoc) are called exploratory analyses. You explore the data for any interesting patterns. Usually, while exploring data, a couple of questions are not statistically independent. Any interesting findings in these exploratory analyses could then be followed up by confirmatory analyses using a new data collection scheme, purposely set up to confirm the earlier findings. It is important to do that with a new or different sample, since the finding could have resulted from mere sampling error (i.e., a Type I error).

Also be aware of the immoral practice of \(p\) -hacking. \(P\) -hacking, sometimes referred to as selective reporting , is defining your research questions and setting up your analysis (contrasts) in such a way that you have as many significant results as possible. With p -hacking one presents their research in such a way that they find all these interesting results, ignoring the fact that they made a selection of the results based on what they saw in the data (post-hoc). For instance, their research was set up to find evidence for the workings of medicine A on the alleviation of migraine. Their study included a questionnaire on all sorts of other complaints and daily activities, for the sake of completeness. When analysing the results, they might find that the contrast between medicine A with placebo is not significant for migraine. But when exploring the data further, they find that medicine A was significantly better with regards to bloodpressure and the number of walks in the park. A \(p\) -hacker would write up the research as a study of the workings of medicine A on bloodpressure and walks in the park. This form of \(p\) -hacking is called cherry-picking : only reporting statistically significant findings and pretending you never set out to find the other things and not reporting them. Another \(p\) -hacking example would be to make a clever selection of the migraine data after which the effect becomes significant, for instance by filtering out the males in the study. Thus, \(p\) -hacking is the practice of trying to select the data or choose the method of analysis in such a way that the \(p\) -values in the report are as small as possible. The research questions are then changed from exploratory to confirmatory, without informing the reader.

11.6 Controlling the family-wise Type I error rate

There are several strategies that control the number of Type I errors. One is the Bonferroni method , where we adjust the test-wise error rate by dividing the family-wise error rate by the number of comparisons, \(\alpha_{TW} = \alpha_{FW} / J\) . This method is pretty conservative, in that the \(\alpha_{TW}\) becomes low with already a couple of comparisons, so that the statistical power to spot differences that also exist in the population becomes very low. The upside is that this method is easy to understand and perform. Alternative ways of addressing the problem are Scheffé’s procedure , and the Tukey HSD method. Of these two, Scheffé’s procedure is also relatively conservative (i.e., little statistical power). The logic of the Tukey HSD method is based on the probability that a difference between two group means is more than a critical value, by chance alone. This critical value is called Honestly Significant Difference (HSD) . We fix the probability of finding such a difference (or more) between the group means under the null-hypothesis at \(\alpha_{FW}\) . The details of the actual method will not be discussed here. Interested readers may refer to Wikipedia and references therein.

11.7 Post-hoc analysis in R

In general, post hoc contrasts can be done in the same way as in the previous chapter: specifying the contrasts in an \(\mathbf{L}\) matrix, taking the inverse and assigning the matrix to the variable in your model. Here, you are therefore limited to the number of levels of a factor: you can only have \(J-1\) new variables, apart from the intercept of 1s. You can then adjust \(\alpha\) yourself using Bonferroni. For instance if you want to have a family-wise type I error rate of 0.05, and you look at two post-hoc contrasts, you can declare a contrast significant if the corresponding \(p\) -value is less than 0.025.

There are also other options in R to get post hoc contrasts, where you can ask for as many comparisons as you want.

There are two ways in which you can control the overall Type I error rate: either by using an adjusted \(\alpha\) yourself (as above), or adjusting the \(p\) -value. For now we assume you generally want to test at an \(\alpha\) of 5%. But of course this can be any value.

In the first approach, you run the model and the contrast just as you would normally do. If the output contains answers to post hoc questions, you do not use \(\alpha = 0.05\) , but you use 0.05 divided by the number of tests that you inspect: \(\alpha_{TW}= \alpha_{FW}/k\) , with \(k\) being the number of tests you do.

For instance, if the output for a linear model with a factor with four levels contains the comparison on groups 1 and 2, and it applies to an a priori question, you simply report the statistics and concludes significance if the \(p < .05\) .

If the contrast pertains to a post hoc question and you compare all six possible pairs, you report the usual statistics and conclude significance if the \(p < \frac{0.05}{6}\) .

In the second approach, you can change the \(p\) -value itself: you multiply the plotted value by the number of comparisons and declare a difference to be significant if the corresponding adjusted \(p\) -value is less than 0.05. As an example, suppose you make six comparisons. Then you multiply the usual \(p\) -values by a factor 6: \(p_{adj}= 6p\) . Thus, if you see a \(p\) -value of 0.04, you compute \(p_{adj}\) to be 0.24 and conclude that the contrast is not significant. This is often done in R: the output yields \(adjusted\) \(p\) -values. Care should be taken with the confidence intervals: make sure that you know whether these are adjusted 95% confidence intervals or not. If not, then you should compute your own. Note that when you use the adjusted \(p\) -values, you should no longer adjust the \(\alpha\) . Thus, an adjusted \(p\) -value of 0.24 is not significant, because \(p_{adj}>.05\) .

In this section we will see how to perform post hoc comparisons in two situations: either with only one factor in your model, or when you have two factors in your model.

11.7.1 ANOVA with only one factor

We give an example of an ANOVA post hoc analysis with only one factor, using the data from the four diets. We first run an ANOVA to answer the primary research question whether diet has any effect on weight gain in chicks.

Seeing these results, noting that there is indeed a significant effect of diet, a secondary question pops up: “Which pairs of two diets show significant differences?” We answer that by doing a post hoc analysis, where we study each pair of diets, and control Type I error rate using the Bonferroni method. We can do that in the following way:

In the output we see six Bonferroni adjusted \(p\) -values for all six possible pairs. The column and row numbers refer to the levels of the Diet factor: Diet 3, Diet 1 and Diet 2 in the three columns, and Diet 1, Diet 2 and Diet 4 in the three rows. We see that all \(p\) -values are non significant ( \(p > .05\) ), except for two comparisons: the difference between Diet 3 and Diet 1 is significant, \(p < .001\) . as well as the difference between Diet 4 and Diet 1, \(p < .001\) .

"An analysis of variance showed that the mean weight was signficantly different for the four diets, \(F(3, 574) = 10.8, p < .001\) . We performed post hoc pair-wise comparisons, for all six possible pairs of diets. A family-wise Type I error rate of 0.05 was used, with Bonferroni correction. The difference between Diet 1 and Diet 3 was significant, and the difference between Diets 4 and 1 was significant. All other differences were not signficantly different from 0. "

11.7.2 ANOVA with two factors and moderation

In the previous subsection we did pair-wise comparisons in a one-way ANOVA (i.e., ANOVA with only one factor). In the previous chapter we also discussed how to set up contrasts in the situation of two factors that are modelled with interaction effects (Ch. 10 ). Let’s return to that example.

In the example, we were only interested in the gender effect for each of the education levels. That means only the last three lines are relevant.

In the simple_slopes() code we used the argument ci.width = 0.95 . That was because we hadn’t discussed post hoc analysis yet, nor adjustment of \(p\) or \(\alpha\) . In the case that we want to control the Type I error rate, we could use Bonferroni correction. We should then make the relevant \(p\) -values three times bigger than what they are uncorrected, because we are interested in three contrasts.

Confidence intervals should also be changed. For that we need to adjust the Type I error rate \(\alpha\) .

In the output we see adjusted confidence intervals (note that the \(p\) -values are the original ones). We conclude for our three contrasts that in the “school” group the females score 0.31 (adjusted 95% CI: -0.31, 0.94) higher than boys, in the “college” group females score 0.24 (adjusted 95% CI: -0.39, 0.86) higher than boys, and in the “university” group 0.89 (adjusted 95% CI: -1.49, -0.28) lower than the boys.

The same logic of adjusting \(p\) -values and adjusting confidence intervals can be applied in situations with numeric independent variables.

11.8 Take-away points

Your main research questions are generally very limited in number. If they can be translated into contrasts, we call these a priori contrasts.

Your a priori contrasts can be answered using a pre-set level of significance, in the social and behavioural sciences this is often 5% for \(p\) -values and using 95% for confidence intervals. No adjustment necessary.

This pre-set level of significance, \(\alpha\) , should be set before looking at the data (if possible before the collection of the data).

If you are looking at the data and want to answer specific research questions that arise because of what you see in the data (post hoc), you should use adjusted \(p\) -values and confidence intervals.

There are several ways of adjusting the test-wise \(\alpha\) s to obtain a reasonable family-wise \(\alpha\) : Bonferroni is the simplest method but rather conservative (low statistical power). Many alternative methods exist, among them are Scheffé’s procedure, and Tukey HSD method.

Key concepts

  • Orthogonality/independence
  • \(p\) -hacking
  • Family-wise Type I error rate
  • Test-wise Type I error rate
  • Bonferroni correction

Statology

Statistics Made Easy

The Complete Guide: Hypothesis Testing in R

A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.

This tutorial explains how to perform the following hypothesis tests in R:

  • One sample t-test
  • Two sample t-test
  • Paired samples t-test

We can use the t.test() function in R to perform each type of test:

  • x, y: The two samples of data.
  • alternative: The alternative hypothesis of the test.
  • mu: The true value of the mean.
  • paired: Whether to perform a paired t-test or not.
  • var.equal: Whether to assume the variances are equal between the samples.
  • conf.level: The confidence level to use.

The following examples show how to use this function in practice.

Example 1: One Sample t-test in R

A one sample t-test is used to test whether or not the mean of a population is equal to some value.

For example, suppose we want to know whether or not the mean weight of a certain species of some turtle is equal to 310 pounds. We go out and collect a simple random sample of turtles with the following weights:

Weights : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

The following code shows how to perform this one sample t-test in R:

From the output we can see:

  • t-test statistic: -1.5848
  • degrees of freedom:  12
  • p-value:  0.139
  • 95% confidence interval for true mean:  [303.4236, 311.0379]
  • mean of turtle weights:  307.230

Since the p-value of the test (0.139) is not less than .05, we fail to reject the null hypothesis.

This means we do not have sufficient evidence to say that the mean weight of this species of turtle is different from 310 pounds.

Example 2: Two Sample t-test in R

A two sample t-test is used to test whether or not the means of two populations are equal.

For example, suppose we want to know whether or not the mean weight between two different species of turtles is equal. To test this, we collect a simple random sample of turtles from each species with the following weights:

Sample 1 : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

Sample 2 : 335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305

The following code shows how to perform this two sample t-test in R:

  • t-test statistic: -2.1009
  • degrees of freedom:  19.112
  • p-value:  0.04914
  • 95% confidence interval for true mean difference: [-14.74, -0.03]
  • mean of sample 1 weights: 307.2308
  • mean of sample 2 weights:  314.6154

Since the p-value of the test (0.04914) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean weight between the two species is not equal.

Example 3: Paired Samples t-test in R

A paired samples t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.

For example, suppose we want to know whether or not a certain training program is able to increase the max vertical jump (in inches) of basketball players.

To test this, we may recruit a simple random sample of 12 college basketball players and measure each of their max vertical jumps. Then, we may have each player use the training program for one month and then measure their max vertical jump again at the end of the month.

The following data shows the max jump height (in inches) before and after using the training program for each player:

Before : 22, 24, 20, 19, 19, 20, 22, 25, 24, 23, 22, 21

After : 23, 25, 20, 24, 18, 22, 23, 28, 24, 25, 24, 20

The following code shows how to perform this paired samples t-test in R:

  • t-test statistic: -2.5289
  • degrees of freedom:  11
  • p-value:  0.02803
  • 95% confidence interval for true mean difference: [-2.34, -0.16]
  • mean difference between before and after: -1.25

Since the p-value of the test (0.02803) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean jump height before and after using the training program is not equal.

Additional Resources

Use the following online calculators to automatically perform various t-tests:

One Sample t-test Calculator Two Sample t-test Calculator Paired Samples t-test Calculator

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Perform Hypothesis Test for a Regression Model, Given R Squared

If you run into a problem, usually in an academic setting, where you only know the multiple coefficient of determination, R 2 , and are asked to test to see if the beta coefficients are non-zero, you can do this easily using Excel. You could also do it in StatCrunch using the Data > Compute tool, but I find it tedious compared to just building the solution in Excel. And, you can save the Excel solution for later reuse if you label and name it in a smart way.

Consider this problem:

Researchers want to use an analysis of social media to forecast the number of viewers, and thus ad revenues, for new TV series. They collected data on the pilot episodes for 33 series. The data included the number of times per minute a series was mentioned in the 24 hours after the pilots aired. It also included an analysis of the sentiment index of the mentions, i.e.  ratio of positive to negative mentions. R 2 for the 1 st order regression model they produced is 0.937; R a 2 = 0.933.

Test the model to see if it might be useful in forecasting ad revenues for a new TV series.

  • A first-order model consists of terms for quantitative independent variables. Because we have two independent variables, the model will be of the form:

null hypothesis linear model r

  • Each β represents the slope of the line relating y to an x-term when all the other x-terms are held fixed. For example, if x 1 is the mention rate, the β 1 represents the change in revenue, y, for every 1-unit increase in the mention rate, holding the sentiment index, x 2 , constant.
  • There are two x terms in the model, thus, k = 2.
  • R 2 is the multiple coefficient of determination. It represents the fraction of the variation in the dependent variable y explained by the regression equation, the least squares prediction.
  • R a 2 provides an adjustment to R 2 , but includes consideration of the sample size and number of predictors in the model. The adjustment increases R 2 only if an added predictor has a strong correlation to y and decreases R 2 if the added predictor has a weak correlation to y. Generally, you should use the adjusted R 2 to describe the predictive capability of the model. In this problem, I would report that the model explains 93.3% of the variation in y.
  • The null hypothesis for regression models is that the slope coefficients are 0. Ho: β 1 = β 2 = 0.
  • The alternative hypothesis is that at least on slope coefficient is non-zero. Ha: at least one β i ≠ 0.

The test statistic for this hypothesis test is F:

null hypothesis linear model r

  • This global test for model usefulness is always a right-tail test, where the rejection region is F > F α .
  • The probability distribution for the F statistic is the F distribution. It is defined by two factors, the numerator degrees of freedom, v 1 , and the denominator degrees of freedom, v 2 . I discuss the f-distribution here.
  • For our purposes in this problem, v 1 = k and v 2 = n-(k+1).

Here is the Excel solution:

null hypothesis linear model r

  • Although you can use the Data > Compute tool in StatCrunch to calculate the F-statistics, I think using Excel for that purpose is easier. Once you have the F-statistic, we can use StatCrunch F calculator to double check and draw a sketch, though in this case, the F-statistic is so far to the right you cannot see it (note I added a bit of red color so you can see the rejection area).
  • The first image shows StatCrunch found a right-tail critical value of F to be 5.39 for an alpha of 0.01. The second show the p-value for the test statistic is 0.

null hypothesis linear model r

  • Because the p-values in both methods is essentially 0, the decision is to reject the null that all the slopes are zero. We can conclude at least one β is non-zero and the model is statistically useful .
  • But, importantly, this does not mean the model is the best model – there may be another model that produces better estimates and predictions.
  • If the global test indicates the model is useful, then tests of one or more of the individual β parameters could be performed. If you are given the results of these individual t-tests, interpret them in a similar fashion. See here for how to test individual betas using regression output.

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

IMAGES

  1. How To Find Null Hypothesis In R

    null hypothesis linear model r

  2. Solved

    null hypothesis linear model r

  3. Extract Significance Stars & Levels from Linear Regression Model in R

    null hypothesis linear model r

  4. Interpreting the residuals vs. fitted values plot for verifying the

    null hypothesis linear model r

  5. How to read excel linear regression output null hypothesis

    null hypothesis linear model r

  6. Hypothesis Tests in R

    null hypothesis linear model r

VIDEO

  1. Conduct a Multiple Linear Correlation Hypothesis Test Using Free Web Calculators

  2. Biostatistics: Correlation and Linear Regression,Part 12,Interesting Video Lecture in Amharic Speech

  3. Rosuvastatin tablets pharma##short

  4. Geostatistics 4

  5. Lecture 5. Hypothesis Testing In Simple Linear Regression Model

  6. What is Null Hypothesis Testing Mean? || Academic Research || Ettienne-Murphy

COMMENTS

  1. Understanding the Null Hypothesis for Linear Regression

    xi: The value of the predictor variable xi. Multiple linear regression uses the following null and alternative hypotheses: H0: β1 = β2 = … = βk = 0. HA: β1 = β2 = … = βk ≠ 0. The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically ...

  2. How to Use the linearHypothesis() Function in R

    You can use the linearHypothesis() function from the car package in R to test linear hypotheses in a specific regression model.. This function uses the following basic syntax: linearHypothesis(fit, c(" var1=0", "var2=0 ")) This particular example tests if the regression coefficients var1 and var2 in the model called fit are jointly equal to zero.. The following example shows how to use this ...

  3. Linear Regression With R

    The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable (s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows: Y = β1 + β2X + ϵ.

  4. 15.5: Hypothesis Tests for Regression Models

    Formally, our "null model" corresponds to the fairly trivial "regression" model in which we include 0 predictors, and only include the intercept term b 0. H 0 :Y i =b 0 +ϵ i. If our regression model has K predictors, the "alternative model" is described using the usual formula for a multiple regression model: H1: Yi = (∑K k=1 ...

  5. linearHypothesis function

    rhs. right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. For a multivariate linear model, rhs is a matrix, defaulting to 0. This argument isn't available for F-tests for linear mixed models. singular.ok.

  6. Linear Regression in R

    In this step-by-step guide, we will walk you through linear regression in R using two sample datasets. Simple linear regression. The first dataset contains observations about income (in a range of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary sample of 500 people. The income values are divided by 10,000 to make the ...

  7. What is a null model in regression and how does it relate to the null

    10. In regression, as described partially in the other two answers, the null model is the null hypothesis that all the regression parameters are 0. So you can interpret this as saying that under the null hypothesis, there is no trend and the best estimate/predictor of a new observation is the mean, which is 0 in the case of no intercept. Share.

  8. Null hypothesis testing -- Advanced Statistics using R

    To conduct a typical null hypothesis testing, the following 7 steps can be followed: State the research question. State the null and alternative hypotheses based on the research question. Select a value for significance level α α. Collect or locate data. Calculate the test statistic and the p value.

  9. Regression Analysis

    Null Model. The \(R^2\) fit statistic compares the sum of the squared errors of the model with the X predictor variables to the sum of squared errors of the null model. The baseline of comparison, the null model, is a model with no X variables such that the fitted value for each set of X values is the mean of response variable \(y\).

  10. 6.3 Simple linear modelling

    The main function for fitting linear models in R is the lm() function (short for linear model!). ... We will be testing the null hypothesis that there is no difference in the mean mortality rate between the low, medium and high groups. We fit the model in exactly the same way as before.

  11. r

    If we reject this null hypothesis (which we do because the p-value is small), then this is the same as saying there is enough evidence to conclude that at least one of the covariates has predictive power in our linear model, i.e. that using a regression is predictively 'better' than just guessing the average.

  12. How to Use the linearHypothesis() Function in R

    Rejecting the null hypothesis (based on a small p-value, usually less than 0.05) suggests that there's a statistically significant relationship between the predictor and the outcome variable. The linearHypothesis( ) Function. linearHypothesis() is a function in R that tests the general linear hypothesis for a model object for which a formula ...

  13. Chapter 11 Post-hoc comparisons

    5.5 Residual degrees of freedom in linear models; 5.6 Null-hypothesis testing with linear models; 5.7 \(p\)-values; 5.8 Hypothesis testing; 5.9 Inference for linear models in R; ... "We tested the null-hypothesis of equal mean weights in the four diet groups at an alpha of 5%, and found that it could be rejected, \(F(3, 574), p < .001\). We ...

  14. The Complete Guide: Hypothesis Testing in R

    A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.. This tutorial explains how to perform the following hypothesis tests in R: One sample t-test; Two sample t-test; Paired samples t-test; We can use the t.test() function in R to perform each type of test:. #one sample t-test t. test (x, y = NULL, alternative = c(" two.sided", "less ...

  15. Perform Hypothesis Test for a Regression Model, Given R Squared

    The null hypothesis for regression models is that the slope coefficients are 0. Ho: β 1 = β 2 = 0. The alternative hypothesis is that at least on slope coefficient is non-zero. Ha: at least one β i ≠ 0. The test statistic for this hypothesis test is F: This global test for model usefulness is always a right-tail test, where the rejection ...

  16. r

    I wish to determine whether word length has a significant effect on the produced RT (it should), using a linear mixed effect model. Using R and the lme4 package I construct the following model: m = lmer(RT ~ 1 + word.length + (1 + word.length|subject), data=rt.data) As you can see, I allow both the intercept and the slope to vary randomly ...

  17. The likelihood ratio test for linear regression in SAS

    The likelihood-ratio test statistic is 0.87436. In the chi-squared distribution with 1 degree of freedom, the probability is about 0.35 that a random variate is to the right of this value. Because this probability is greater than 0.05, you should not reject the null hypothesis that the full model explains the data better than the reduced model.

  18. r

    Old thread, but still. To directly answer your question: it is not correct. In R, ANOVA(reduced model, full model) and ANOVA(full model, reduced model) yield the same output. Under the assumption that the models are nested (meaning that all factors of one of the models are present in the other), the anova function finds the factors that have been removed in the reduced model as compared to the ...

  19. r

    Edit: This answer addresses the simple case in which there are three counts and the distribution of these counts are tested against a null hypothesis of equal distribution. Comments by the OP suggest the set-up of the experiment is more complicated than this. Aside from the multinomial test mentioned by @a_statistician , you could also use a chi-square goodness-of-fit test.