12 Interesting Linear Regression Project Ideas & Topics For Beginners [2024]

12 Interesting Linear Regression Project Ideas & Topics For Beginners [2024]

As someone deeply involved in the field, I can attest to the popularity of linear regression in machine learning. It’s a supervised learning algorithm which finds applications in many sectors. If you’re eager about this topic and want to test your skills, trying out a few linear regression projects can be a good idea. In this article, I’ll be discussing about the same.   

Here, I’ve listed linear regression project ideas for different skill levels and domains so that you can choose one according to your expertise and interests. Moreover, you can modify the challenge level of any project we’ve mentioned here by increasing or decreasing the data values you add in your data set.  Let’s get into the details now!  

Join Deep Learning Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.  

What is a Linear Regression?

Linear Regression is a supervised learning algorithm in machine learning. It models a prediction value according to independent variables and helps in finding the relationship between those variables and the forecast. Regression models depend on the relationship between the independent and dependent variables as well as the number of variables they use. 

Ads of upGrad blog

Linear regression predicts the dependent value (y) according to the independent variable (x). The output here is the dependent value, and the input is the independent value. The hypothesis function for linear regression is the following:

Y = 1 + 2 x

The linear regression model finds the best line, which predicts the value of y according to the provided value of x. To get the best line, it finds the most suitable values for 1 and 2 . 1 is the intercept, and 2 is the coefficient of x. When we find the best values for 1 and 2 , we find the best line for your linear regression as well.

It studies the relationship between quantitative variables. Students must know the fundamentals of statistics, irrespective of their career plans. Linear regression projects help students to widen their thinking and analytical abilities. These ideas for linear regression projects in python help students learn various aspects of linear regression that help them in their careers.

Types of Linear Regression:

Linear regression is commonly divided into two types i.e., Simple linear regression and multiple linear regression. Let’s discuss these types in detail.

1. Simple Linear Regression:

It shows the relationship between a single independent variable and an equivalent output or dependent variable. This relationship can be expressed as y = b0 +b1x+e.

Here, ‘y’ is the dependent variable or output. The b0 and b1 constants denoting the intercept and coefficient. ‘e’ is the error term. This equation can be plotted on a graph for further analysis. After understanding the overview and types of linear regression, you can understand -which part of the discussion or concept on linear correlation challenged you the most?

2. Multiple Linear Regression:

It determines the relationship between two or more independent variables (or inputs) and the equivalent dependent variable (or output). The independent variables can be either categorical or continuous.  The linear regression analysis is quite helpful when working on linear regression projects in python. For example, it helps in forecasting future values and trends. It can also predict the effects of changes.

Simple Linear Regression – Model Assumptions

The following are some of the presumptions upon which the Linear Regression Model rests:- 

Linear relationship

The correlation between the feature variables and the response should be linear. A scatter plot of response and feature variables can be used to assess the linearity of the assumed connection.

Multivariate normality

All variables must be multivariate normal if you want to use the linear regression model. Any linear combination of the variables in a vector with a multivariate normal distribution follows the same distributional assumptions as the original vector.

No or little multicollinearity

It is considered that multicollinearity is negligible at best. When the features (or independent variables) are extremely correlated, we say that there is multicollinearity.

No self-correlation

It is also expected that the data exhibit negligible or no auto-correlation. When the residual errors are not statistically independent of one another, autocorrelation arises.

Homoscedasticity

In homoscedasticity, the error term (or model noise) is the same for all possible values of the independent variables. That all points on the regression line have the same residuals. The use of a scatter plot allows for verification.

Practical applications of linear regression:

1. medical research:.

Medical researchers frequently use linear regression to know the relationship between patients’ blood pressure and drug dosage. They can oversee different dosages of a certain drug to patients and supervise their blood pressure response. They can use a simple linear regression model that uses blood pressure as the response variable and dosage as the predictor variable. The equation of the regression model would be:

blood pressure = b0 + b1(dosage)

The coefficient b0 represents the anticipated blood pressure when the dosage is zero.

The coefficient b1 represents the average change in blood pressure when the dosage is raised by one unit.

2. Businesses:

Commonly, businesses use linear regression to know the relationship between their revenue and advertising expenditure. They can use a simple linear regression model that considers revenue as the response variable and advertising expenditure as the predictor variable. The corresponding linear regression projects in python use this equation of the regression model:

revenue = b0 + b1(ad expenditure)

The coefficient b0 represents the overall estimated revenue when ad expenditure is zero.

The coefficient represents the average change in the total revenue when the ad expenditure is raised by one unit.

  • The negative value of b1 indicates that more ad expenditure is resultant due to less revenue.
  • The positive value of b1indicates that ad expenditure increases with increased revenue.
  • If b1is around zero, it means that the ad expenditure doesn’t significantly influence the revenue.

3. Agriculture:

Agricultural scientists widely use linear regression to determine the impact of water and fertilizer on crops.  For example, they can use various amounts of water and fertilizer on various fields and observe how the crops are affected. They can use a multiple linear regression model that considers crops as the response variable, and water and fertilizer as the predictor variables. The regression model equation would be:

crop yield = b0 + b1(quantity of fertilizer) + b2(quantity of water)

The coefficient b0 represents the expected crop yield with no water or fertilizer.

The coefficient b1 represents the average change in crop yield when the quantity of fertilizer is increased by one unit. It is assumed that the water’s quantity stays the same.

The coefficient b2 represents the average change in crop yield due to an increase in water quantity by one unit. It is assumed that the fertilizer’s quantity stays the same.

Agricultural scientists can modify the amount of water and fertilized based on the values of b1 and b2. The purpose is to maximize crop production.

It is one of those linear regression projects with datasets that can benefit a huge number of people.

Data science:

Data scientists can be beneficial to professional sports teams. They use linear regression to know how various training regimens influence players’ performance. For example, data scientists can analyze how different amounts of workout sessions and yoga sessions can influence player scores. They can use a multiple linear regression model. This model considers the total points scored (player’s score) as the response variable, and the workout sessions and yoga sessions as the predictor variables.

The regression model equation would be:

Total points scored = b0 + b1(yoga sessions) + b2(workout sessions)

The coefficient b0 represents the estimated points scored by a player who doesn’t participate in workout sessions and yoga sessions.

The coefficient b1 shows the average change in the score when the yoga sessions’ frequency is increased by one unit. It is assumed that the workout sessions’ frequency stays the same.

The coefficient b2 shows the average change in score scored when workout sessions’ frequency is increased by one unit. It is assumed that the yoga sessions’ frequency stays the same.

The data scientists can use these measured values of b1 and b2 in their linear regression projects with datasets. They can recommend to a player how to participate in yoga and workout sessions to maximize their score.

The answer to this question – which part of the discussion or concept on linear correlation challenged you the most? can be the preparation of data. So, the following section explains how to prepare data for your linear regression model.

How to prepare data for linear regression?

You can implement the following steps when working on your linear regression projects with datasets.

1) Discard outliers:

The regression model assumes a linear relationship between variables. Hence, it is significant to discard outliers that can impact the results.

2) Discard collinearity:

Collinearity denotes the correlation between independent variables. It can create data overfitting that can provide inconsistent results.

3) Normalize the data:

Linear regressions make more precise predictions if the data adopts a normal distribution curve.

4) Standardize the data:

It is accomplished by subtracting a measure of location (for example, mean) and dividing its standard deviation. This step is quite important when two data sets feature different ranges.

5) Input extra data:

You can provide space for additional imputations if some data points have missing values. This step is not mandatory if you are dealing with big data sets.

Now that we’ve discussed the basic concepts of linear regression, we can move onto our linear regression project ideas. 

Our Top Linear Regression Project Ideas

Idea #1: budget a long drive.

Suppose you want to go on a long drive (from Delhi to Lonawala). Before going on a trip this long, it’s best to prepare a budget and figure out how much you need to spend on a particular section. You can use a linear regression model here to determine the cost of gas you’ll have to get. 

In this linear regression, the total amount of money you’d have to pay would be the dependent variable, which means it would be the output of our model. The distance between the destinations would be the independent variable. To keep the model simple, we can assume that the price of fuel would remain constant during the trip. 

FYI: Free nlp course !

You can choose any two destinations for this project. It’s a great project idea for beginners because it allows you to experiment and understand the concept clearly. Plus, you can use the model whenever you plan a long drive too!

Idea #2: Compare Unemployment Rates with Gains in Stock Market 

If you’re an economics enthusiast, or if you want to use your knowledge of Machine Learning in this field, then this is one of the best linear regression project ideas for you. We all know how unemployment is a significant problem for our country. In this project, we’d find the relation between the unemployment rates and the gains happening in the stock market. 

You can use official data from the government to get the unemployment rates and use it to find out if there’s a relationship between it and the gains in the stock market. 

Read: Linear Regression Implementation in Python

Idea #3: Compare Salaries of Batsmen with The Average Runs They Score per Game

Cricket is easily the most popular game in India. You can use your knowledge of machine learning in this simple yet exciting project where you’ll plot the relationship between the salaries of batsmen and the average runs they score in every game. Our cricketers are among some of the highest-earning athletes in the world. Working on this project would help you find out how much their batting averages are responsible for their earnings. 

If you’re a beginner, you can start with one team and check the salaries of its batsmen. On the other hand, if you want to take it a step further, you can consider multiple teams (Australia, England, South Africa, etc.) and check the salaries of their batsmen too. 

Idea #4: Compare the Dates in a Month with the Monthly Salary  

This project explores the application of machine learning in human resources and management. It is among the beginner-level linear regression projects, so if you haven’t worked on such a project before, then you can start with this one. Here, you’ll take the dates present in a month and compare it with the monthly salary.

After you’ve established the relationship between the two variables, you can explore if the current wage is optimal or not. You can choose any career and find its average salary to select as the independent variable. You can make this project more challenging by discussing many other jobs apart from the original one. 

Idea #5: Compare Average Global Temperatures and Levels of Pollution 

Pollution and its impact on the environment is a prominent topic of discussion. The recent pandemic has also shown us how we can still save our environment. You can use your machine learning skills in this field too. This project would help you in understanding how machine learning can solve the various problems present in this domain as well.

Here, you’d take the average global temperatures in several years and compare them with the level of pollution that happened in that duration. Creating a linear regression model on this topic is easy and wouldn’t take a lot of effort. However, it’ll surely help you in trying out your machine learning skills.

Best Machine Learning and AI Courses Online

Idea #6: compare local temperature with the amount of rain .

This is another exciting project idea for lovers of nature and the environment. In this project, you have to find the relationship between the local temperature and the amount of rain taking place there. After completing this project, you’d see how you can use linear regression and other machine learning techniques in Geography and related subjects.

You should keep the temperature in Celsius and the amount of rain in mm (millimetres). For starters, you can consider a few prominent cities of the country (such as New Delhi, Mumbai, Pune, Jaipur) and add more as you complete the project. 

Idea #7: Compare Average age of Humans with The Amount of Their Sleep 

Sleep has always fascinated our scientists. And if you’re fascinated by this topic too, then you should work on this one. In this project, you have to compare the average lifespan of people with the amount of sleep they get.

If you want to enter the field of biotechnology or neuroscience with expertise in machine learning, then this is an excellent choice for you. It’d help you explore the applications of linear regression in these sectors. There are many research papers on this topic, so you won’t have trouble finding relevant data sources. 

In-demand Machine Learning Skills

Idea #8: compare the percentage of sediments in river with its discharge.

This is another exciting project idea for enthusiasts of the environment and geography. Here, you have to compare the percentage of sediments present in water with the level of its discharge. You can start with one river and make it more challenging by adding more streams. Similarly, you can start with a small stream (or a section of a giant river), if you haven’t worked on linear regression projects before. 

A river’s discharge is the volume following through its channel. It is the total volume of water flowing through a certain point, and the unit for measuring a river’s discharge in cubic meters per second. Sediments are the solid materials present in a stream that move and get deposited to a new location through the river. 

Also Read: 15 Interesting Machine Learning Project Ideas For Beginners

Idea #9: Compare Budgets of National Film Awards-nominated Movies with the number Movies Winning These Awards

You apply linear regression in the entertainment sector too. In this project, you have to compare the budgets of the movies nominated for the National Film Awards with the number of films that won these awards. You would find out if the budget of a film affects its probability of winning an award or not. You can start with data for the last five years (2014-19). And if you want to take it a level further, then you can add data from more years and make the project more challenging. 

Idea 10# Linear Regression Project Idea for Stock Price Prediction

This linear regression project on stock price prediction using linear regression is concerned with developing a reliable device to help investors make sound decisions. Predictive model creation is an objective of the project based on analysing historical stock data, including appropriate economic indicators, and applying market trends.

The goal is to look for trends and connections in the data that can make stock price prediction possible.It involves the selection and rigorous refinement of such features as trading volumes, market indices, and other vital factors that impact stock performance. Via regression analysis, we’ll see the impact of those variables on stock prices, creating a model that can respond to continuously changing market conditions. The tool’s output will predict and offer analytics into the factors influencing price fluctuations.

Investors can gain from this predictive tool because it assists them in predicting possible market shifts, making better decisions about their investments, and navigating the intricacies of the stock market more confidently. Specifically, the project aims to bring more strategic and data-oriented trading of stocks, which implies better stock trading decisions for investors.

Idea 11# Linear Regression Project Idea for Movie Rating

Next, we have a movie rating as a linear regression project in Python. This project uses multivariate regression to predict movie ratings, showing the diverse elements influencing audience preferences. A model will be developed that considers a range of parameters (genre, cast, director, release date, and promotional strategies) that relate to the movie ratings.

In the process, vast amounts of movie-centred data are gathered and organized, essential patterns are extracted, and multivariate regression analysis is done to give quantitative measures of the variables that affect audience ratings. The emerging model seeks not only to exceed simple correlations but also to bring in a complex understanding of the often-intertwined factors accounting for people’s views.

Film producers and studios can benefit from our project in creative and marketing judgment calls of movie production. The tool seeks to serve as an invaluable resource for forecasting audience responses, advising filmmakers on creating content that resonates with their intended audience, and maximizing promotional efforts. The project of the multivariate regression of movie rating is intended to endow the film industry with actionable knowledge, which, in turn, guarantees engaging, appealing movies.

Idea 12# Linear Regression Project to Build a Song Popularity Predictor

Our project aims to build up a Song Popularity Predictor using statistics to estimate how popular a song might be. By diving into several musical elements like the genre, artist popularity, tempo, and lyrical content, the aim is to develop an all-encompassing model that captures the factors affecting a song’s popularity.  

The process gathers diverse songs and examines the relations between various features utilizing regression analysis. With such an investigation, we aim to detect the degree of influence of each variable on a song’s popularity, building a tool that goes deeper than shallow measurements.  

Artists, producers, and musical platforms can learn from this project on the changing terrain of musical tastes. The Song Popularity Predictor intends to aid decision-making procedures, guiding the development and publicity of music that matches consumer preferences as they change.   

This project aims to give music industry stakeholders a tool to successfully create and promote their music by providing a multidimensional analysis of the elements contributing to a song’s success. You have had an excellent overview of how you can select your ideas by going through the linear regression project examples discussed above.   

Popular AI and ML Blogs & Free Courses

Final thoughts.

We’ve reached the end of our project list. We hope you found these linear regression project ideas helpful. If you have any questions regarding linear regression or these project ideas, feel free to ask us. 

On the other hand, if you want to learn more about linear regression, then we recommend heading to our blog, where you’d find many valuable resources, guides, and articles on this topic. For starters, here’s our guide on linear regression in machine learning . 

You can check IIT Delhi’s Executive PG Programme in Machine Learning   in association with upGrad . IIT Delhi is one of the most prestigious institutions in India. With more the 500+ In-house faculty members which are the best in the subject matters.

Refer to your Network!

If you know someone, who would benefit from our specially curated programs? Kindly fill in this form to register their interest. We would assist them to upskill with the right program, and get them a highest possible pre-applied fee-waiver up to ₹ 70,000/-

You earn referral incentives worth up to ₹80,000 for each friend that signs up for a paid programme! Read more about our referral incentives here .

Profile

Pavan Vadapalli

Something went wrong

Our Trending Machine Learning Courses

  • Advanced Certificate Programme in Machine Learning and NLP from IIIT Bangalore - Duration 8 Months
  • Master of Science in Machine Learning & AI from LJMU - Duration 18 Months
  • Executive PG Program in Machine Learning and AI from IIIT-B - Duration 12 Months

Machine Learning Skills To Master

  • Artificial Intelligence Courses
  • Tableau Courses
  • NLP Courses
  • Deep Learning Courses

Our Popular Machine Learning Course

Machine Learning Course

Frequently Asked Questions (FAQs)

Logistic regression goes one step further by fitting the line values to the sigmoid curve, while linear regression aims to determine the best-fitting line. Linear regression uses the mean squared error as its loss function, while logistic regression uses maximum likelihood estimation.

Linear and logistic regression are the two most used types of regression analysis. The nature of the data will ultimately dictate the sort of regression analysis model we choose.

Both linear and non-linear forms of a logistic model exist. A linear model is one in which the predictor function is linear. A non-linear model is one that employs a prediction function that does not follow a straight line. A link function connects the prediction function to the anticipated value ().

In the realm of regression analysis, linear regression—also known as ordinary least squares (OLS) and linear least squares—is the actual workhorse. Learn how a shift of one unit in each independent variable contributes to a shift of one unit in the dependent variable with the help of linear regression.

Business projections and choices might benefit from linear regression, a statistical technique that establishes the link between variables. It may be used in economics, corporate strategy, marketing, medical, and more.

Some users mistakenly believe that linear regression's normal distribution assumption applies to their data. They could make a histogram of their response variable to see if it departs from a normal distribution. Others believe the explanatory variable must have a regularly distributed distribution. Neither is necessary. The normality assumption applies to the residual distributions. The data is normally distributed, as well as the regression line is matched to the data so that the residual mean is zero.

Explore Free Courses

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in Canada through this course.

Marketing

Advance your career in the field of marketing with Industry relevant free courses

Data Science & Machine Learning

Build your foundation in one of the hottest industry of the 21st century

Management

Master industry-relevant skills that are required to become a leader and drive organizational success

Technology

Build essential technical skills to move forward in your career in these evolving times

Career Planning

Get insights from industry leaders and career counselors and learn how to stay ahead in your career

Law

Kickstart your career in law by building a solid foundation with these relevant free courses.

Chat GPT + Gen AI

Stay ahead of the curve and upskill yourself on Generative AI and ChatGPT

Soft Skills

Build your confidence by learning essential soft skills to help you become an Industry ready professional.

Study Abroad Free Course

Learn more about the education system, top universities, entrance tests, course information, and employment opportunities in USA through this course.

Suggested Blogs

Artificial Intelligence course fees

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges

by Pavan Vadapalli

27 Feb 2024

Top 9 Python Libraries for Machine Learning in 2024

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced

by Kechit Goyal

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]

by Jaideep Khare

16 Feb 2024

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Probability Histogram

Probability Histogram – Definition, Examples and...

Substantive Framework

Substantive Framework – Types, Methods and...

Factor Analysis

Factor Analysis – Steps, Methods and Examples

Graphical Methods

Graphical Methods – Types, Examples and Guide

Critical Analysis

Critical Analysis – Types, Examples and Writing...

Grounded Theory

Grounded Theory – Methods, Examples and Guide

Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service

Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve

Increase revenue and loyalty with real-time insights and recommendations delivered to teams on the ground

Know how your people feel and empower managers to improve employee engagement, productivity, and retention

Take action in the moments that matter most along the employee journey and drive bottom line growth

Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people

Get faster, richer insights with qual and quant tools that make powerful market research available to everyone

Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts

Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market

Explore the platform powering Experience Management

  • Free Account
  • For Digital
  • For Customer Care
  • For Human Resources
  • For Researchers
  • Financial Services
  • All Industries

Popular Use Cases

  • Customer Experience
  • Employee Experience
  • Employee Exit Interviews
  • Net Promoter Score
  • Voice of Customer
  • Customer Success Hub
  • Product Documentation
  • Training & Certification
  • XM Institute
  • Popular Resources
  • Customer Stories
  • Market Research
  • Artificial Intelligence
  • Partnerships
  • Marketplace

The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.

  • English/AU & NZ
  • Español/Europa
  • Español/América Latina
  • Português Brasileiro
  • REQUEST DEMO
  • Experience Management
  • Survey Data Analysis & Reporting
  • Regression Analysis

Try Qualtrics for free

The complete guide to regression analysis.

19 min read What is regression analysis and why is it useful? While most of us have heard the term, understanding regression analysis in detail may be something you need to brush up on. Here’s what you need to know about this popular method of analysis.

When you rely on data to drive and guide business decisions, as well as predict market trends, just gathering and analyzing what you find isn’t enough — you need to ensure it’s relevant and valuable.

The challenge, however, is that so many variables can influence business data: market conditions, economic disruption, even the weather! As such, it’s essential you know which variables are affecting your data and forecasts, and what data you can discard.

And one of the most effective ways to determine data value and monitor trends (and the relationships between them) is to use regression analysis, a set of statistical methods used for the estimation of relationships between independent and dependent variables.

In this guide, we’ll cover the fundamentals of regression analysis, from what it is and how it works to its benefits and practical applications.

Free eBook: 2024 global market research trends report

What is regression analysis?

Regression analysis is a statistical method. It’s used for analyzing different factors that might influence an objective – such as the success of a product launch, business growth, a new marketing campaign – and determining which factors are important and which ones can be ignored.

Regression analysis can also help leaders understand how different variables impact each other and what the outcomes are. For example, when forecasting financial performance, regression analysis can help leaders determine how changes in the business can influence revenue or expenses in the future.

Running an analysis of this kind, you might find that there’s a high correlation between the number of marketers employed by the company, the leads generated, and the opportunities closed.

This seems to suggest that a high number of marketers and a high number of leads generated influences sales success. But do you need both factors to close those sales? By analyzing the effects of these variables on your outcome,  you might learn that when leads increase but the number of marketers employed stays constant, there is no impact on the number of opportunities closed, but if the number of marketers increases, leads and closed opportunities both rise.

Regression analysis can help you tease out these complex relationships so you can determine which areas you need to focus on in order to get your desired results, and avoid wasting time with those that have little or no impact. In this example, that might mean hiring more marketers rather than trying to increase leads generated.

How does regression analysis work?

Regression analysis starts with variables that are categorized into two types: dependent and independent variables. The variables you select depend on the outcomes you’re analyzing.

Understanding variables:

1. dependent variable.

This is the main variable that you want to analyze and predict. For example, operational (O) data such as your quarterly or annual sales, or experience (X) data such as your net promoter score (NPS) or customer satisfaction score (CSAT) .

These variables are also called response variables, outcome variables, or left-hand-side variables (because they appear on the left-hand side of a regression equation).

There are three easy ways to identify them:

  • Is the variable measured as an outcome of the study?
  • Does the variable depend on another in the study?
  • Do you measure the variable only after other variables are altered?

2. Independent variable

Independent variables are the factors that could affect your dependent variables. For example, a price rise in the second quarter could make an impact on your sales figures.

You can identify independent variables with the following list of questions:

  • Is the variable manipulated, controlled, or used as a subject grouping method by the researcher?
  • Does this variable come before the other variable in time?
  • Are you trying to understand whether or how this variable affects another?

Independent variables are often referred to differently in regression depending on the purpose of the analysis. You might hear them called:

Explanatory variables

Explanatory variables are those which explain an event or an outcome in your study. For example, explaining why your sales dropped or increased.

Predictor variables

Predictor variables are used to predict the value of the dependent variable. For example, predicting how much sales will increase when new product features are rolled out .

Experimental variables

These are variables that can be manipulated or changed directly by researchers to assess the impact. For example, assessing how different product pricing ($10 vs $15 vs $20) will impact the likelihood to purchase.

Subject variables (also called fixed effects)

Subject variables can’t be changed directly, but vary across the sample. For example, age, gender, or income of consumers.

Unlike experimental variables, you can’t randomly assign or change subject variables, but you can design your regression analysis to determine the different outcomes of groups of participants with the same characteristics. For example, ‘how do price rises impact sales based on income?’

Carrying out regression analysis

Regression analysis

So regression is about the relationships between dependent and independent variables. But how exactly do you do it?

Assuming you have your data collection done already, the first and foremost thing you need to do is plot your results on a graph. Doing this makes interpreting regression analysis results much easier as you can clearly see the correlations between dependent and independent variables.

Let’s say you want to carry out a regression analysis to understand the relationship between the number of ads placed and revenue generated.

On the Y-axis, you place the revenue generated. On the X-axis, the number of digital ads. By plotting the information on the graph, and drawing a line (called the regression line) through the middle of the data, you can see the relationship between the number of digital ads placed and revenue generated.

Regression analysis - step by step

This regression line is the line that provides the best description of the relationship between your independent variables and your dependent variable. In this example, we’ve used a simple linear regression model.

Regression analysis - step by step

Statistical analysis software can draw this line for you and precisely calculate the regression line. The software then provides a formula for the slope of the line, adding further context to the relationship between your dependent and independent variables.

Simple linear regression analysis

A simple linear model uses a single straight line to determine the relationship between a single independent variable and a dependent variable.

This regression model is mostly used when you want to determine the relationship between two variables (like price increases and sales) or the value of the dependent variable at certain points of the independent variable (for example the sales levels at a certain price rise).

While linear regression is useful, it does require you to make some assumptions.

For example, it requires you to assume that:

  • the data was collected using a statistically valid sample collection method that is representative of the target population
  • The observed relationship between the variables can’t be explained by a ‘hidden’ third variable – in other words, there are no spurious correlations.
  • the relationship between the independent variable and dependent variable is linear – meaning that the best fit along the data points is a straight line and not a curved one

Multiple regression analysis

As the name suggests, multiple regression analysis is a type of regression that uses multiple variables. It uses multiple independent variables to predict the outcome of a single dependent variable. Of the various kinds of multiple regression, multiple linear regression is one of the best-known.

Multiple linear regression is a close relative of the simple linear regression model in that it looks at the impact of several independent variables on one dependent variable. However, like simple linear regression, multiple regression analysis also requires you to make some basic assumptions.

For example, you will be assuming that:

  • there is a linear relationship between the dependent and independent variables (it creates a straight line and not a curve through the data points)
  • the independent variables aren’t highly correlated in their own right

An example of multiple linear regression would be an analysis of how marketing spend, revenue growth, and general market sentiment affect the share price of a company.

With multiple linear regression models you can estimate how these variables will influence the share price, and to what extent.

Multivariate linear regression

Multivariate linear regression involves more than one dependent variable as well as multiple independent variables, making it more complicated than linear or multiple linear regressions. However, this also makes it much more powerful and capable of making predictions about complex real-world situations.

For example, if an organization wants to establish or estimate how the COVID-19 pandemic has affected employees in its different markets, it can use multivariate linear regression, with the different geographical regions as dependent variables and the different facets of the pandemic as independent variables (such as mental health self-rating scores, proportion of employees working at home, lockdown durations and employee sick days).

Through multivariate linear regression, you can look at relationships between variables in a holistic way and quantify the relationships between them. As you can clearly visualize those relationships, you can make adjustments to dependent and independent variables to see which conditions influence them. Overall, multivariate linear regression provides a more realistic picture than looking at a single variable.

However, because multivariate techniques are complex, they involve high-level mathematics that require a statistical program to analyze the data.

Logistic regression

Logistic regression models the probability of a binary outcome based on independent variables.

So, what is a binary outcome? It’s when there are only two possible scenarios, either the event happens (1) or it doesn’t (0). e.g. yes/no outcomes, pass/fail outcomes, and so on. In other words, if the outcome can be described as being in either one of two categories.

Logistic regression makes predictions based on independent variables that are assumed or known to have an influence on the outcome. For example, the probability of a sports team winning their game might be affected by independent variables like weather, day of the week, whether they are playing at home or away and how they fared in previous matches.

What are some common mistakes with regression analysis?

Across the globe, businesses are increasingly relying on quality data and insights to drive decision-making — but to make accurate decisions, it’s important that the data collected and statistical methods used to analyze it are reliable and accurate.

Using the wrong data or the wrong assumptions can result in poor decision-making, lead to missed opportunities to improve efficiency and savings, and — ultimately — damage your business long term.

  • Assumptions

When running regression analysis, be it a simple linear or multiple regression, it’s really important to check that the assumptions your chosen method requires have been met. If your data points don’t conform to a straight line of best fit, for example, you need to apply additional statistical modifications to accommodate the non-linear data. For example, if you are looking at income data, which scales on a logarithmic distribution, you should take the Natural Log of Income as your variable then adjust the outcome after the model is created.

  • Correlation vs. causation

It’s a well-worn phrase that bears repeating – correlation does not equal causation. While variables that are linked by causality will always show correlation, the reverse is not always true. Moreover, there is no statistic that can determine causality (although the design of your study overall can).

If you observe a correlation in your results, such as in the first example we gave in this article where there was a correlation between leads and sales, you can’t assume that one thing has influenced the other. Instead, you should use it as a starting point for investigating the relationship between the variables in more depth.

  • Choosing the wrong variables to analyze

Before you use any kind of statistical method, it’s important to understand the subject you’re researching in detail. Doing so means you’re making informed choices of variables and you’re not overlooking something important that might have a significant bearing on your dependent variable.

  • Model building The variables you include in your analysis are just as important as the variables you choose to exclude. That’s because the strength of each independent variable is influenced by the other variables in the model. Other techniques, such as Key Drivers Analysis, are able to account for these variable interdependencies.

Benefits of using regression analysis

There are several benefits to using regression analysis to judge how changing variables will affect your business and to ensure you focus on the right things when forecasting.

Here are just a few of those benefits:

Make accurate predictions

Regression analysis is commonly used when forecasting and forward planning for a business. For example, when predicting sales for the year ahead, a number of different variables will come into play to determine the eventual result.

Regression analysis can help you determine which of these variables are likely to have the biggest impact based on previous events and help you make more accurate forecasts and predictions.

Identify inefficiencies

Using a regression equation a business can identify areas for improvement when it comes to efficiency, either in terms of people, processes, or equipment.

For example, regression analysis can help a car manufacturer determine order numbers based on external factors like the economy or environment.

Using the initial regression equation, they can use it to determine how many members of staff and how much equipment they need to meet orders.

Drive better decisions

Improving processes or business outcomes is always on the minds of owners and business leaders, but without actionable data, they’re simply relying on instinct, and this doesn’t always work out.

This is particularly true when it comes to issues of price. For example, to what extent will raising the price (and to what level) affect next quarter’s sales?

There’s no way to know this without data analysis. Regression analysis can help provide insights into the correlation between price rises and sales based on historical data.

How do businesses use regression? A real-life example

Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

A typical example is using a regression equation to assess the correlation between ad costs and conversions of new customers. In this instance,

  • our dependent variable (the factor we’re trying to assess the outcomes of) will be our conversions
  • the independent variable (the factor we’ll change to assess how it changes the outcome) will be the daily ad spend
  • the regression equation will try to determine whether an increase in ad spend has a direct correlation with the number of conversions we have

The analysis is relatively straightforward — using historical data from an ad account, we can use daily data to judge ad spend vs conversions and how changes to the spend alter the conversions.

By assessing this data over time, we can make predictions not only on whether increasing ad spend will lead to increased conversions but also what level of spending will lead to what increase in conversions. This can help to optimize campaign spend and ensure marketing delivers good ROI.

This is an example of a simple linear model. If you wanted to carry out a more complex regression equation, we could also factor in other independent variables such as seasonality, GDP, and the current reach of our chosen advertising networks.

By increasing the number of independent variables, we can get a better understanding of whether ad spend is resulting in an increase in conversions, whether it’s exerting an influence in combination with another set of variables, or if we’re dealing with a correlation with no causal impact – which might be useful for predictions anyway, but isn’t a lever we can use to increase sales.

Using this predicted value of each independent variable, we can more accurately predict how spend will change the conversion rate of advertising.

Regression analysis tools

Regression analysis is an important tool when it comes to better decision-making and improved business outcomes. To get the best out of it, you need to invest in the right kind of statistical analysis software.

The best option is likely to be one that sits at the intersection of powerful statistical analysis and intuitive ease of use, as this will empower everyone from beginners to expert analysts to uncover meaning from data, identify hidden trends and produce predictive models without statistical training being required.

IQ stats in action

To help prevent costly errors, choose a tool that automatically runs the right statistical tests and visualizations and then translates the results into simple language that anyone can put into action.

With software that’s both powerful and user-friendly, you can isolate key experience drivers, understand what influences the business, apply the most appropriate regression methods, identify data issues, and much more.

Regression analysis tools

With Qualtrics’ Stats iQ™, you don’t have to worry about the regression equation because our statistical software will run the appropriate equation for you automatically based on the variable type you want to monitor. You can also use several equations, including linear regression and logistic regression, to gain deeper insights into business outcomes and make more accurate, data-driven decisions.

Related resources

Analysis & Reporting

Data Analysis 31 min read

Social media analytics 13 min read, kano analysis 21 min read, margin of error 11 min read, sentiment analysis 20 min read, thematic analysis 11 min read, behavioral analytics 12 min read, request demo.

Ready to learn more about Qualtrics?

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 31 January 2022

The clinician’s guide to interpreting a regression analysis

  • Sofia Bzovsky 1 ,
  • Mark R. Phillips   ORCID: orcid.org/0000-0003-0923-261X 2 ,
  • Robyn H. Guymer   ORCID: orcid.org/0000-0002-9441-4356 3 , 4 ,
  • Charles C. Wykoff 5 , 6 ,
  • Lehana Thabane   ORCID: orcid.org/0000-0003-0355-9734 2 , 7 ,
  • Mohit Bhandari   ORCID: orcid.org/0000-0001-9608-4808 1 , 2 &
  • Varun Chaudhary   ORCID: orcid.org/0000-0002-9988-4146 1 , 2

on behalf of the R.E.T.I.N.A. study group

Eye volume  36 ,  pages 1715–1717 ( 2022 ) Cite this article

18k Accesses

9 Citations

1 Altmetric

Metrics details

  • Outcomes research

Introduction

When researchers are conducting clinical studies to investigate factors associated with, or treatments for disease and conditions to improve patient care and clinical practice, statistical evaluation of the data is often necessary. Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors and disease outcomes or to identify relevant prognostic factors for diseases [ 1 ].

This editorial will acquaint readers with the basic principles of and an approach to interpreting results from two types of regression analyses widely used in ophthalmology: linear, and logistic regression.

Linear regression analysis

Linear regression is used to quantify a linear relationship or association between a continuous response/outcome variable or dependent variable with at least one independent or explanatory variable by fitting a linear equation to observed data [ 1 ]. The variable that the equation solves for, which is the outcome or response of interest, is called the dependent variable [ 1 ]. The variable that is used to explain the value of the dependent variable is called the predictor, explanatory, or independent variable [ 1 ].

In a linear regression model, the dependent variable must be continuous (e.g. intraocular pressure or visual acuity), whereas, the independent variable may be either continuous (e.g. age), binary (e.g. sex), categorical (e.g. age-related macular degeneration stage or diabetic retinopathy severity scale score), or a combination of these [ 1 ].

When investigating the effect or association of a single independent variable on a continuous dependent variable, this type of analysis is called a simple linear regression [ 2 ]. In many circumstances though, a single independent variable may not be enough to adequately explain the dependent variable. Often it is necessary to control for confounders and in these situations, one can perform a multivariable linear regression to study the effect or association with multiple independent variables on the dependent variable [ 1 , 2 ]. When incorporating numerous independent variables, the regression model estimates the effect or contribution of each independent variable while holding the values of all other independent variables constant [ 3 ].

When interpreting the results of a linear regression, there are a few key outputs for each independent variable included in the model:

Estimated regression coefficient—The estimated regression coefficient indicates the direction and strength of the relationship or association between the independent and dependent variables [ 4 ]. Specifically, the regression coefficient describes the change in the dependent variable for each one-unit change in the independent variable, if continuous [ 4 ]. For instance, if examining the relationship between a continuous predictor variable and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that for every one-unit increase in the predictor, there is a two-unit increase in intra-ocular pressure. If the independent variable is binary or categorical, then the one-unit change represents switching from one category to the reference category [ 4 ]. For instance, if examining the relationship between a binary predictor variable, such as sex, where ‘female’ is set as the reference category, and intra-ocular pressure (dependent variable), a regression coefficient of 2 means that, on average, males have an intra-ocular pressure that is 2 mm Hg higher than females.

Confidence Interval (CI)—The CI, typically set at 95%, is a measure of the precision of the coefficient estimate of the independent variable [ 4 ]. A large CI indicates a low level of precision, whereas a small CI indicates a higher precision [ 5 ].

P value—The p value for the regression coefficient indicates whether the relationship between the independent and dependent variables is statistically significant [ 6 ].

Logistic regression analysis

As with linear regression, logistic regression is used to estimate the association between one or more independent variables with a dependent variable [ 7 ]. However, the distinguishing feature in logistic regression is that the dependent variable (outcome) must be binary (or dichotomous), meaning that the variable can only take two different values or levels, such as ‘1 versus 0’ or ‘yes versus no’ [ 2 , 7 ]. The effect size of predictor variables on the dependent variable is best explained using an odds ratio (OR) [ 2 ]. ORs are used to compare the relative odds of the occurrence of the outcome of interest, given exposure to the variable of interest [ 5 ]. An OR equal to 1 means that the odds of the event in one group are the same as the odds of the event in another group; there is no difference [ 8 ]. An OR > 1 implies that one group has a higher odds of having the event compared with the reference group, whereas an OR < 1 means that one group has a lower odds of having an event compared with the reference group [ 8 ]. When interpreting the results of a logistic regression, the key outputs include the OR, CI, and p-value for each independent variable included in the model.

Clinical example

Sen et al. investigated the association between several factors (independent variables) and visual acuity outcomes (dependent variable) in patients receiving anti-vascular endothelial growth factor therapy for macular oedema (DMO) by means of both linear and logistic regression [ 9 ]. Multivariable linear regression demonstrated that age (Estimate −0.33, 95% CI − 0.48 to −0.19, p  < 0.001) was significantly associated with best-corrected visual acuity (BCVA) at 100 weeks at alpha = 0.05 significance level [ 9 ]. The regression coefficient of −0.33 means that the BCVA at 100 weeks decreases by 0.33 with each additional year of older age.

Multivariable logistic regression also demonstrated that age and ellipsoid zone status were statistically significant associated with achieving a BCVA letter score >70 letters at 100 weeks at the alpha = 0.05 significance level. Patients ≥75 years of age were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.96, 95% CI 0.94 to 0.98, p  = 0.001) [ 9 ]. Similarly, patients between the ages of 50–74 years were also at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those <50 years of age, since the OR is less than 1 (OR 0.15, 95% CI 0.04 to 0.48, p  = 0.001) [ 9 ]. As well, those with a not intact ellipsoid zone were at a decreased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone (OR 0.20, 95% CI 0.07 to 0.56; p  = 0.002). On the other hand, patients with an ungradable/questionable ellipsoid zone were at an increased odds of achieving a BCVA letter score >70 letters at 100 weeks compared to those with an intact ellipsoid zone, since the OR is greater than 1 (OR 2.26, 95% CI 1.14 to 4.48; p  = 0.02) [ 9 ].

The narrower the CI, the more precise the estimate is; and the smaller the p value (relative to alpha = 0.05), the greater the evidence against the null hypothesis of no effect or association.

Simply put, linear and logistic regression are useful tools for appreciating the relationship between predictor/explanatory and outcome variables for continuous and dichotomous outcomes, respectively, that can be applied in clinical practice, such as to gain an understanding of risk factors associated with a disease of interest.

Schneider A, Hommel G, Blettner M. Linear Regression. Anal Dtsch Ärztebl Int. 2010;107:776–82.

Google Scholar  

Bender R. Introduction to the use of regression models in epidemiology. In: Verma M, editor. Cancer epidemiology. Methods in molecular biology. Humana Press; 2009:179–95.

Schober P, Vetter TR. Confounding in observational research. Anesth Analg. 2020;130:635.

Article   Google Scholar  

Schober P, Vetter TR. Linear regression in medical research. Anesth Analg. 2021;132:108–9.

Szumilas M. Explaining odds ratios. J Can Acad Child Adolesc Psychiatry. 2010;19:227–9.

Thiese MS, Ronna B, Ott U. P value interpretations and considerations. J Thorac Dis. 2016;8:E928–31.

Schober P, Vetter TR. Logistic regression in medical research. Anesth Analg. 2021;132:365–6.

Zabor EC, Reddy CA, Tendulkar RD, Patil S. Logistic regression in clinical studies. Int J Radiat Oncol Biol Phys. 2022;112:271–7.

Sen P, Gurudas S, Ramu J, Patrao N, Chandra S, Rasheed R, et al. Predictors of visual acuity outcomes after anti-vascular endothelial growth factor treatment for macular edema secondary to central retinal vein occlusion. Ophthalmol Retin. 2021;5:1115–24.

Download references

R.E.T.I.N.A. study group

Varun Chaudhary 1,2 , Mohit Bhandari 1,2 , Charles C. Wykoff 5,6 , Sobha Sivaprasad 8 , Lehana Thabane 2,7 , Peter Kaiser 9 , David Sarraf 10 , Sophie J. Bakri 11 , Sunir J. Garg 12 , Rishi P. Singh 13,14 , Frank G. Holz 15 , Tien Y. Wong 16,17 , and Robyn H. Guymer 3,4

Author information

Authors and affiliations.

Department of Surgery, McMaster University, Hamilton, ON, Canada

Sofia Bzovsky, Mohit Bhandari & Varun Chaudhary

Department of Health Research Methods, Evidence & Impact, McMaster University, Hamilton, ON, Canada

Mark R. Phillips, Lehana Thabane, Mohit Bhandari & Varun Chaudhary

Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, Australia

Robyn H. Guymer

Department of Surgery, (Ophthalmology), The University of Melbourne, Melbourne, VIC, Australia

Retina Consultants of Texas (Retina Consultants of America), Houston, TX, USA

Charles C. Wykoff

Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA

Biostatistics Unit, St. Joseph’s Healthcare Hamilton, Hamilton, ON, Canada

Lehana Thabane

NIHR Moorfields Biomedical Research Centre, Moorfields Eye Hospital, London, UK

Sobha Sivaprasad

Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Peter Kaiser

Retinal Disorders and Ophthalmic Genetics, Stein Eye Institute, University of California, Los Angeles, CA, USA

David Sarraf

Department of Ophthalmology, Mayo Clinic, Rochester, MN, USA

Sophie J. Bakri

The Retina Service at Wills Eye Hospital, Philadelphia, PA, USA

Sunir J. Garg

Center for Ophthalmic Bioinformatics, Cole Eye Institute, Cleveland Clinic, Cleveland, OH, USA

Rishi P. Singh

Cleveland Clinic Lerner College of Medicine, Cleveland, OH, USA

Department of Ophthalmology, University of Bonn, Bonn, Germany

Frank G. Holz

Singapore Eye Research Institute, Singapore, Singapore

Tien Y. Wong

Singapore National Eye Centre, Duke-NUD Medical School, Singapore, Singapore

You can also search for this author in PubMed   Google Scholar

  • Varun Chaudhary
  • , Mohit Bhandari
  • , Charles C. Wykoff
  • , Sobha Sivaprasad
  • , Lehana Thabane
  • , Peter Kaiser
  • , David Sarraf
  • , Sophie J. Bakri
  • , Sunir J. Garg
  • , Rishi P. Singh
  • , Frank G. Holz
  • , Tien Y. Wong
  •  & Robyn H. Guymer

Contributions

SB was responsible for writing, critical review and feedback on manuscript. MRP was responsible for conception of idea, critical review and feedback on manuscript. RHG was responsible for critical review and feedback on manuscript. CCW was responsible for critical review and feedback on manuscript. LT was responsible for critical review and feedback on manuscript. MB was responsible for conception of idea, critical review and feedback on manuscript. VC was responsible for conception of idea, critical review and feedback on manuscript.

Corresponding author

Correspondence to Varun Chaudhary .

Ethics declarations

Competing interests.

SB: Nothing to disclose. MRP: Nothing to disclose. RHG: Advisory boards: Bayer, Novartis, Apellis, Roche, Genentech Inc.—unrelated to this study. CCW: Consultant: Acuela, Adverum Biotechnologies, Inc, Aerpio, Alimera Sciences, Allegro Ophthalmics, LLC, Allergan, Apellis Pharmaceuticals, Bayer AG, Chengdu Kanghong Pharmaceuticals Group Co, Ltd, Clearside Biomedical, DORC (Dutch Ophthalmic Research Center), EyePoint Pharmaceuticals, Gentech/Roche, GyroscopeTx, IVERIC bio, Kodiak Sciences Inc, Novartis AG, ONL Therapeutics, Oxurion NV, PolyPhotonix, Recens Medical, Regeron Pharmaceuticals, Inc, REGENXBIO Inc, Santen Pharmaceutical Co, Ltd, and Takeda Pharmaceutical Company Limited; Research funds: Adverum Biotechnologies, Inc, Aerie Pharmaceuticals, Inc, Aerpio, Alimera Sciences, Allergan, Apellis Pharmaceuticals, Chengdu Kanghong Pharmaceutical Group Co, Ltd, Clearside Biomedical, Gemini Therapeutics, Genentech/Roche, Graybug Vision, Inc, GyroscopeTx, Ionis Pharmaceuticals, IVERIC bio, Kodiak Sciences Inc, Neurotech LLC, Novartis AG, Opthea, Outlook Therapeutics, Inc, Recens Medical, Regeneron Pharmaceuticals, Inc, REGENXBIO Inc, Samsung Pharm Co, Ltd, Santen Pharmaceutical Co, Ltd, and Xbrane Biopharma AB—unrelated to this study. LT: Nothing to disclose. MB: Research funds: Pendopharm, Bioventus, Acumed—unrelated to this study. VC: Advisory Board Member: Alcon, Roche, Bayer, Novartis; Grants: Bayer, Novartis—unrelated to this study.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Bzovsky, S., Phillips, M.R., Guymer, R.H. et al. The clinician’s guide to interpreting a regression analysis. Eye 36 , 1715–1717 (2022). https://doi.org/10.1038/s41433-022-01949-z

Download citation

Received : 08 January 2022

Revised : 17 January 2022

Accepted : 18 January 2022

Published : 31 January 2022

Issue Date : September 2022

DOI : https://doi.org/10.1038/s41433-022-01949-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Factors affecting patient satisfaction at a plastic surgery outpatient department at a tertiary centre in south africa.

  • Chrysis Sofianos

BMC Health Services Research (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research topics in regression analysis

Book cover

Handbook of Market Research pp 299–327 Cite as

Regression Analysis

  • Bernd Skiera 4 ,
  • Jochen Reiner 4 &
  • Sönke Albers 5  
  • Reference work entry
  • First Online: 03 December 2021

6954 Accesses

3 Citations

Linear regression analysis is one of the most important statistical methods. It examines the linear relationship between a metric-scaled dependent variable (also called endogenous, explained, response, or predicted variable) and one or more metric-scaled independent variables (also called exogenous, explanatory, control, or predictor variable). We illustrate how regression analysis work and how it supports marketing decisions, e.g., the derivation of an optimal marketing mix. We also outline how to use linear regression analysis to estimate nonlinear functions such as a multiplicative sales response function. Furthermore, we show how to use the results of a regression to calculate elasticities and to identify outliers and discuss in details the problems that occur in case of autocorrelation, multicollinearity and heteroscedasticity. We use a numerical example to illustrate in detail all calculations and use this numerical example to outline the problems that occur in case of endogeneity.

  • Regression analysis
  • Marketing mix modeling
  • Elasticities
  • Multicollinearity
  • Autocorrelation
  • Outlier detection
  • Endogeneity
  • Sales response function

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Albers, S. (2012). Optimizable and implementable aggregate response modeling for marketing decision support. International Journal of Research in Marketing, 29 (2), 111–122.

Article   Google Scholar  

Albers, S., Mantrala, M. K., & Sridhar, S. (2010). Personal selling elasticities: A meta-analysis. Journal of Marketing Research, 47 (5), 840–853.

Assmus, G., Farley, J. W., & Lehmann, D. R. (1984). How advertising affects sales: A meta-analysis of econometric results. Journal of Marketing Research, 21 (1), 65–74.

Bijmolt, T. H. A., van Heerde, H., & Pieters, R. G. M. (2005). New empirical generalizations on the determinants of price elasticity. Journal of Marketing Research, 42 (2), 141–156.

Chatterjee, S., & Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regressions. Statistical Science, 1 (3), 379–416.

Google Scholar  

Greene, W. H. (2008). Econometric analysis (6th ed.). Upper Saddle River: Pearson.

Gujarati, D. N. (2003). Basic econometrics (4th ed.). New York: McGraw Hill.

Hair, J. F., Black, W. C., Babin, J. B., & Anderson, R. E. (2014). Multivariate data analysis (7th ed.). Upper Saddle River: Pearson.

Hair, J. F., Hult, G. T. M., Ringle, C. M., & Sarstedt, M. (2017). A primer on partial least squares structural equation modeling (PLS-SEM) (2nd ed.). Thousand Oaks: Sage.

Hanssens, D. M., Parsons, L. J., & Schultz, R. L. (1990). Market response models: Econometric and time series analysis . Boston: Springer.

Hsiao, C. (2014). Analysis of panel data (3rd ed.). Cambridge: Cambridge University Press.

Book   Google Scholar  

Irwin, J. R., & McClelland, G. H. (2001). Misleading heuristics and moderated multiple regression models. Journal of Marketing Research, 38 (1), 100–109.

Koutsoyiannis, A. (1977). Theory of econometrics (2nd ed.). Houndmills: MacMillan.

Laurent, G. (2013). EMAC distinguished marketing scholar 2012: Respect the data! International Journal of Research in Marketing, 30 (4), 323–334.

Leeflang, P. S. H., Wittink, D. R., Wedel, M., & Neart, P. A. (2000). Building models for marketing decisions . Berlin: Kluwer.

Lodish, L. L., Abraham, M. M., Kalmenson, S., Livelsberger, J., Lubetkin, B., Richardson, B., & Stevens, M. E. (1995). How TV advertising works: A meta-analysis of 389 real world split cable T. V. advertising experiments. Journal of Marketing Research, 32 (2), 125–139.

Pindyck, R. S., & Rubenfeld, D. (1998). Econometric models and econometric forecasts (4th ed.). New York: McGraw-Hill.

Sethuraman, R., Tellis, G. J., & Briesch, R. A. (2011). How well does advertising work? Generalizations from meta-analysis of brand advertising elasticities. Journal of Marketing Research, 48 (3), 457–471.

Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd ed.). London: Sage.

Stock, J., & Watson, M. (2015). Introduction to econometrics (3rd ed.). Upper Saddle River: Pearson.

Tellis, G. J. (1988). The price sensitivity of selective demand: A meta-analysis of econometric models of sales. Journal of Marketing Research, 25 (4), 391–404.

White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48 (4), 817–838.

Wooldridge, J. M. (2009). Introductory econometrics: A modern approach (4th ed.). Mason: South-Western Cengage.

Download references

Author information

Authors and affiliations.

Goethe University Frankfurt, Frankfurt, Germany

Bernd Skiera & Jochen Reiner

Kuehne Logistics University, Hamburg, Germany

Sönke Albers

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Bernd Skiera .

Editor information

Editors and affiliations.

Department of Business-to-Business Marketing, Sales, and Pricing, University of Mannheim, Mannheim, Germany

Christian Homburg

Department of Marketing & Sales Research Group, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Martin Klarmann

Marketing & Sales Department, University of Mannheim, Mannheim, Germany

Arnd Vomberg

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this entry

Cite this entry.

Skiera, B., Reiner, J., Albers, S. (2022). Regression Analysis. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-57413-4_17

Download citation

DOI : https://doi.org/10.1007/978-3-319-57413-4_17

Published : 03 December 2021

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-57411-0

Online ISBN : 978-3-319-57413-4

eBook Packages : Business and Management Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Quantitative Research Methods for Political Science, Public Policy and Public Administration for Undergraduates: 1st Edition With Applications in R

14 topics in multiple regression.

Thus far we have developed the basis for multiple OLS reression using matrix algebra, delved into the meaning of the estimated partial regression coefficient, and revisited the basis for hypothesis testing in OLS. In this chapter we turn to one of the key strengths of OLS: the robust flexibility of OLS for model specification. First we will discuss how to include binary variables (referred to as ``dummy variables") as IVs in an OLS model. Next we will show you how to build on dummy variables to model their interactions with other variables in your model. Finally, we will address an alternative way to express the partial regression coefficients – using standardized coefficients – that permit you to compare the magnitudes of the estimated effects of your IVs even when they are measured on different scales. As has been our custom, the examples in this chapter are based on variables from the class data set.

14.1 Dummy Variables

Thus far, we have considered OLS models that include variables measured on interval level scales (or, in a pinch and with caution, ordinal scales). That is fine when we have variables for which we can develop valid and reliable interval (or ordinal) measures. But in the policy and social science worlds, we often want to include in our analysis concepts that do not readily admit to interval measure – including many cases in which a variable has an “on - off”, or “present - absent” quality. In other cases we want to include a concept that is essentially nominal in nature, such that an observation can be categorized as a subset but not measured on a “high-low” or “more-less” type of scale. In these instances we can utilize what is generally known as a dummy variable, but are also referred to as indicator variables, Boolean variables, or categorical variables.

What the Heck are “Dummy Variables”?

  • A dichotomous variable, with values of 0 and 1;
  • A value of 1 represents the presence of some quality, a zero its absence;
  • The 1s are compared to the 0s, who are known as the ``referent group";
  • Dummy variables are often thought of as a proxy for a qualitative variable.

Dummy variables allow for tests of the differences in overall value of the \(Y\) for different nominal groups in the data. They are akin to a difference of means test for the groups identified by the dummy variable. Dummy variables allow for comparisons between an included (the 1s) and an omitted (the 0s) group. Therefore, it is important to be clear about which group is omitted and serving as the ``comparison category."

It is often the case that there are more than two groups represented by a set of nominal categories. In that case, the variable will consist of two or more dummy variables, with 0/1 codes for each category except the referent group (which is omitted). Several examples of categorical variables that can be represented in multiple regression with dummy variables include:

  • Experimental treatment and control groups (treatment=1, control=0)
  • Gender (male=1, female=0 or vice versa)
  • Race and ethnicity (a dummy for each group, with one omitted referent group)
  • Region of residence (dummy for each region with one omitted reference region)
  • Type of education (dummy for each type with omitted reference type)
  • Religious affiliation (dummy for each religious denomination with omitted reference)

The value of the dummy coefficient represents the estimated difference in \(Y\) between the dummy group and the reference group. Because the estimated difference is the average over all of the \(Y\) observations, the dummy is best understood as a change in the value of the intercept ( \(A\) ) for the ``dummied" group. This is illustrated in Figure 14.1 . In this illustration, the value of \(Y\) is a function of \(X_1\) (a continuous variable) and \(X_2\) (a dummy variable). When \(X_2\) is equal to 0 (the referent case) the top regression line applies. When \(X_2 = 1\) , the value of \(Y\) is reduced to the bottom line. In short, \(X_2\) has a negative estimated partial regression coefficient represented by the difference in height between the two regression lines.

Dummy Intercept Variables

Figure 14.1: Dummy Intercept Variables

For a case with multiple nominal categories (e.g., region) the procedure is as follows: (a) determine which category will be assigned as the referent group; (b) create a dummy variable for each of the other categories. For example, if you are coding a dummy for four regions (North, South, East and West), you could designate the South as the referent group. Then you would create dummies for the other three regions. Then, all observations from the North would get a value of 1 in the North dummy, and zeros in all others. Similarly, East and West observations would receive a 1 in their respective dummy category and zeros elsewhere. The observations from the South region would be given values of zero in all three categories. The interpretation of the partial regression coefficients for each of the three dummies would then be the estimated difference in \(Y\) between observations from the North, East and West and those from the South.

Now let’s walk through an example of an \(R\) model with a dummy variable and the interpretation of that model. We will predict climate change risk using age, education, income, ideology, and “gend”, a dummy variable for gender for which 1 = male and 0 = female.

First note that the inclusion of the dummy variables does not change the manner in which you interpret the other (non-dummy) variables in the model; the estimated partial regression coefficients for age, education, income and ideology should all be interpreted as described in the prior chapter. Note that the estimated partial regression coefficient for ``gender" is negative and statistically significant, indicating that males are less likely to be concerned about the environment than are females. The estimate indicates that, all else being equal, the average difference between men and women on the climate change risk scale is -0.2221178.

14.2 Interaction Effects

Dummy variables can also be used to estimate the ways in which the effect of a variable differs across subsets of cases. These kinds of effects are generally called ``interactions." When an interaction occurs, the effect of one \(X\) is dependent on the value of another. Typically, an OLS model is additive, where the \(B\) ’s are added together to predict \(Y\) ;

\(Y_i = A + BX_1 + BX_2 + BX_3 + BX_4 + E_i\) .

However, an interaction model has a multiplicative effect where two of the IVs are multiplied;

\(Y_i = A + BX_1 + BX_2 + BX_3 * BX_4 + E_i\) .

A ``slope dummy" is a special kind of interaction in which a dummy variable is interacted with (multiplied by) a scale (ordinal or higher) variable. Suppose, for example, that you hypothesized that the effects of political of ideology on perceived risks of climate change were different for men and women. Perhaps men are more likely than women to consistently integrate ideology into climate change risk perceptions. In such a case, a dummy variable (0=women, 1=men) could be interacted with ideology (1=strong liberal, 7=strong conservative) to predict levels of perceived risk of climate change (0=no risk, 10=extreme risk). If your hypothesized interaction was correct, you would observe the kind of pattern as shown in Figure 14.2 .

Illustration of Slope Interaction

Figure 14.2: Illustration of Slope Interaction

We can test our hypothesized interaction in R , controlling for the effects of age and income.

The results indicate a negative and significant interaction effect for gender and ideology. Consistent with our hypothesis, this means that the effect of ideology on climate change risk is more pronounced for males than females. Put differently, the slope of ideology is steeper for males than it is for females. This is shown in Figure 14.3 .

Interaction of Ideology and Gender

Figure 14.3: Interaction of Ideology and Gender

In sum, dummy variables add greatly to the flexibility of OLS model specification. They permit the inclusion of categorical variables, and they allow for testing hypotheses about interactions of groups with other IVs within the model. This kind of flexibility is one reason that OLS models are widely used by social scientists and policy analysts.

14.3 Standardized Regression Coefficients

In most cases, the various IVs in a model are represented on different measurement scales. For example, ideology ranges from 1 to 7, while age ranges from 18 to over 90 years old. These different scales make comparing the effects of the various IVs difficult. If we want to directly compare the magnitudes of the effects of ideology and age on levels of environmental concern, we would need to standardize the variables.

One way to standardized variables is to create a \(Z\) -score based on each variable. Variables are standardized in this way as follows:

\[\begin{equation} Z_i = \frac{X_i-\bar{X}}{s_x} \tag{14.1} \end{equation}\]

where \(s_x\) is the s.d. of \(X\) . Standardizing the variables by creating \(Z\) -scores re-scales them so that each variables has a mean of \(0\) and a s.d. of \(1\) . Therefore, all variables have the same mean and s.d. It is important to realize (and it is somewhat counter-intuitive) that the standardized variables retain all of the variation that was in the original measure.

A second way to standardize variables converts the unstandardized \(B\) , into a standardized \(B'\) .

\[\begin{equation} B'_k = B_k\frac{s_k}{s_Y} \tag{14.2} \end{equation}\]

where \(B_k\) is the unstandardized coefficient of \(X_k\) , \(s_k\) is the s.d. of \(X_k\) , and \(s_y\) is the s.d. of \(Y\) . Standardized regression coefficients, also known as beta weights or “betas”, are those we would get if we regress a standardized \(Y\) onto standardized \(X\) ’s.

Interpreting Standardized Betas

  • The standard deviation change in \(Y\) for a one-standard deviation change in \(X\)
  • All \(X\) ’ss on an equal footing, so one can compare the strength of the effects of the \(X\) ’s
  • Variances will differ across different samples

We can use the scale function in R to calculate a \(Z\) score for each of our variables, and then re-run our model.

In addition, we can convert the original unstandardized coefficient for ideology, to a standardized coefficient.

Using either approach, standardized coefficients allow us to compare the magnitudes of the effects of each of the IVs on \(Y\) .

14.4 Summary

This chapter has focused on options in designing and using OLS models. We first covered the use of dummy variables to capture the effects of group differences on estimates of \(Y\) . We then explained how dummy variables, when interacted with scale variables, can provide estimates of the differences in how the scale variable affects \(Y\) across the different subgroups represented by the dummy variable. Finally, we introduced the use of standardized regression coefficients as a means to compare the effects of different \(Xs\) on \(Y\) when the scales of the \(Xs\) differ. Overall, these refinements in the use of OLS permit great flexibility in the application of regression models to estimation and hypothesis testing in policy analysis and social science research.

14.5 Study Questions

  • What is a dummy variable? When should we use it? How do you interpret coefficients on dummy variables?
  • What is an interaction effect? When should you include an interacton effect in your model?
  • What is the primary benefit of standardizing regression coefficients?

Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Simple Linear Regression | An Easy Introduction & Examples

Simple Linear Regression | An Easy Introduction & Examples

Published on February 19, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Simple linear regression is used to estimate the relationship between two quantitative variables . You can use simple linear regression when you want to know:

  • How strong the relationship is between two variables (e.g., the relationship between rainfall and soil erosion).
  • The value of the dependent variable at a certain value of the independent variable (e.g., the amount of soil erosion at a certain level of rainfall).

Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

If you have more than one independent variable, use multiple linear regression instead.

Table of contents

Assumptions of simple linear regression, how to perform a simple linear regression, interpreting the results, presenting the results, can you predict values outside the range of your data, other interesting articles, frequently asked questions about simple linear regression.

Simple linear regression is a parametric test , meaning that it makes certain assumptions about the data. These assumptions are:

  • Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.
  • Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among observations.
  • Normality : The data follows a normal distribution .

Linear regression makes one additional assumption:

  • The relationship between the independent and dependent variable is linear : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor).

If your data do not meet the assumptions of homoscedasticity or normality, you may be able to use a nonparametric test instead, such as the Spearman rank test.

If your data violate the assumption of independence of observations (e.g., if observations are repeated over time), you may be able to perform a linear mixed-effects model that accounts for the additional structure in the data.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

research topics in regression analysis

Simple linear regression formula

The formula for a simple linear regression is:

y = {\beta_0} + {\beta_1{X}} + {\epsilon}

  • y is the predicted value of the dependent variable ( y ) for any given value of the independent variable ( x ).
  • B 0 is the intercept , the predicted value of y when the x is 0.
  • B 1 is the regression coefficient – how much we expect y to change as x increases.
  • x is the independent variable ( the variable we expect is influencing y ).
  • e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Linear regression finds the line of best fit line through your data by searching for the regression coefficient (B 1 ) that minimizes the total error (e) of the model.

While you can perform a linear regression by hand , this is a tedious process, so most people use statistical programs to help them quickly analyze the data.

Simple linear regression in R

R is a free, powerful, and widely-used statistical program. Download the dataset to try it yourself using our income and happiness example.

Dataset for simple linear regression (.csv)

Load the income.data dataset into your R environment, and then run the following command to generate a linear model describing the relationship between income and happiness:

This code takes the data you have collected data = income.data and calculates the effect that the independent variable income has on the dependent variable happiness using the equation for the linear model: lm() .

To learn more, follow our full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function in R:

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

Simple linear regression summary output in R

This output table first repeats the formula that was used to generate the results (‘Call’), then summarizes the model residuals (‘Residuals’), which give an idea of how well the model fits the real data.

Next is the ‘Coefficients’ table. The first row gives the estimates of the y-intercept, and the second row gives the regression coefficient of the model.

Row 1 of the table is labeled (Intercept) . This is the y-intercept of the regression equation, with a value of 0.20. You can plug this into your regression equation if you want to predict happiness values across the range of income that you have observed:

The next row in the ‘Coefficients’ table is income. This is the row that describes the estimated effect of income on reported happiness:

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The number in the table (0.713) tells us that for every one unit increase in income (where one unit of income = 10,000) there is a corresponding 0.71-unit increase in reported happiness (where happiness is a scale of 1 to 10).

The Std. Error column displays the standard error of the estimate. This number shows how much variation there is in our estimate of the relationship between income and happiness.

The t value  column displays the test statistic . Unless you specify otherwise, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |)  column shows the p value . This number tells us how likely we are to see the estimated effect of income on happiness if the null hypothesis of no effect were true.

Because the p value is so low ( p < 0.001),  we can reject the null hypothesis and conclude that income has a statistically significant effect on happiness.

The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p value of the model. Here it is significant ( p < 0.001), which means that this model is a good fit for the observed data.

When reporting your results, include the estimated effect (i.e. the regression coefficient), standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what your regression coefficient means:

It can also be helpful to include a graph with your results. For a simple linear regression, you can simply plot the observations on the x and y axis and then include the regression line and regression function:

Simple linear regression graph

No! We often say that regression models can be used to predict the value of the dependent variable at certain values of the independent variable. However, this is only true for the range of values where we have actually measured the response.

We can use our income and happiness regression analysis as an example. Between 15,000 and 75,000, we found an r 2 of 0.73 ± 0.0193. But what if we did a second survey of people making between 75,000 and 150,000?

Extrapolating data in R

The r 2 for the relationship between income and happiness is now 0.21, or a 0.21-unit increase in reported happiness for every 10,000 increase in income. While the relationship is still statistically significant (p<0.001), the slope is much smaller than before.

Extrapolating data in R graph

What if we hadn’t measured this group, and instead extrapolated the line from the 15–75k incomes to the 70–150k incomes?

You can see that if we simply extrapolated from the 15–75k income data, we would overestimate the happiness of people in the 75–150k income range.

Curved data line

If we instead fit a curve to the data, it seems to fit the actual pattern much better.

It looks as though happiness actually levels off at higher incomes, so we can’t use the same regression line we calculated from our lower-income data to predict happiness at higher levels of income.

Even when you see a strong pattern in your data, you can’t know for certain whether that pattern continues beyond the range of values you have actually measured. Therefore, it’s important to avoid extrapolating beyond what the data actually tell you.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Simple Linear Regression | An Easy Introduction & Examples. Scribbr. Retrieved March 18, 2024, from https://www.scribbr.com/statistics/simple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, an introduction to t tests | definitions, formula and examples, multiple linear regression | a quick guide (examples), linear regression in r | a step-by-step guide & examples, what is your plagiarism score.

No internet connection.

All search filters on the page have been cleared., your search has been saved..

  • All content
  • Dictionaries
  • Encyclopedias
  • Expert Insights
  • Foundations
  • How-to Guides
  • Journal Articles
  • Little Blue Books
  • Little Green Books
  • Project Planner
  • Tools Directory
  • Sign in to my profile My Profile

Not Logged In

  • Sign in Signed in
  • My profile My Profile

Not Logged In

Understanding Regression Analysis: An Introductory Guide

  • Edition: Second Edition
  • By: Larry D. Schroeder , David L. Sjoquist & Paula E. Stephan
  • Publisher: SAGE Publications, Inc
  • Publication year: 2017
  • Online pub date: December 14, 2018
  • Discipline: Business and Management
  • Methods: Regression analysis , Independent variables , Dependent variables
  • DOI: https:// doi. org/10.4135/9781506361628
  • Keywords: equations , errors , estimates , family size , food consumption , income , population Show all Show less
  • Print ISBN: 9781506332888
  • Online ISBN: 9781506361628
  • Buy the book icon link

Subject index

Understanding Regression Analysis: An Introductory Guide presents the fundamentals of regression analysis, from its meaning to uses, in a concise, easy-to-read, and non-technical style. It illustrates how regression coefficients are estimated, interpreted, and used in a variety of settings within the social sciences, business, law, and public policy. Packed with applied examples and using few equations, the book walks readers through elementary material using a verbal, intuitive interpretation of regression coefficients, associated statistics, and hypothesis tests. The Second Edition features updated examples and new references to modern software output.

Front Matter

  • Acknowledgements
  • Series Editor’s Introduction
  • Acknowledgments
  • About the Authors
  • Chapter 1 | Linear Regression
  • Chapter 2 | Multiple Linear Regression
  • Chapter 3 | Hypothesis Testing
  • Chapter 4 | Extensions to the Multiple Regression Model
  • Chapter 5 | Problems and Issues Associated With Regression

Back Matter

  • Appendix A: Derivation of a And b
  • Appendix B: Critical Values for Student’s t Distribution
  • Appendix C: Regression Output From SAS, STATA, SPSS, R, and EXCEL
  • Appendix D: Suggested Textbooks

Sign in to access this content

Get a 30 day free trial, more like this, sage recommends.

We found other relevant content for you on other Sage platforms.

Have you created a personal profile? Login or create a profile so that you can save clips, playlists and searches

  • Sign in/register

Navigating away from this page will delete your results

Please save your results to "My Self-Assessments" in your profile before navigating away from this page.

Sign in to my profile

Sign up for a free trial and experience all Sage Learning Resources have to offer.

You must have a valid academic email address to sign up.

Get off-campus access

  • View or download all content my institution has access to.

Sign up for a free trial and experience all Sage Research Methods has to offer.

  • view my profile
  • view my lists

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Healthcare (Basel)

Logo of healthcare

Regression Analysis for COVID-19 Infections and Deaths Based on Food Access and Health Issues

Abrar almalki.

1 Computational Science and Engineering, North Carolina A&T University, Greensboro, NC 27411, USA; ude.tacn@ujarakogb (B.G.); ude.tacn.seigga@haauqcaty (Y.A.)

Balakrishna Gokaraju

Yaa acquaah, anish turlapaty.

2 Department of Electronics and Communication Engineering, Indian Institute of Information Technology, Sri City 517 646, India; [email protected]

Associated Data

Data such as income and food access were downloaded from the USDA Food Desert Locator Map website at https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas.aspx (accessed on 28 October 2019).

COVID-19, or SARS-CoV-2, is considered as one of the greatest pandemics in our modern time. It affected people’s health, education, employment, the economy, tourism, and transportation systems. It will take a long time to recover from these effects and return people’s lives back to normal. The main objective of this study is to investigate the various factors in health and food access, and their spatial correlation and statistical association with COVID-19 spread. The minor aim is to explore regression models on examining COVID-19 spread with these variables. To address these objectives, we are studying the interrelation of various socio-economic factors that would help all humans to better prepare for the next pandemic. One of these critical factors is food access and food distribution as it could be high-risk population density places that are spreading the virus infections. More variables, such as income and people density, would influence the pandemic spread. In this study, we produced the spatial extent of COVID-19 cases with food outlets by using the spatial analysis method of geographic information systems. The methodology consisted of clustering techniques and overlaying the spatial extent mapping of the clusters of food outlets and the infected cases. Post-mapping, we analyzed these clusters’ proximity for any spatial variability, correlations between them, and their causal relationships. The quantitative analyses of the health issues and food access areas against COVID-19 infections and deaths were performed using machine learning regression techniques to understand the multi-variate factors. The results indicate a correlation between the dependent variables and independent variables with a Pearson correlation R 2 -score = 0.44% for COVID-19 cases and R 2 = 60% for COVID-19 deaths. The regression model with an R 2 -score of 0.60 would be useful to show the goodness of fit for COVID-19 deaths and the health issues and food access factors.

1. Introduction

An outbreak is announced as a pandemic when it spreads in a large geographical area, infects, and results in mortality for a high number of people, and all of that is caused by a virus that is a subtype of a current virus [ 1 ]. The first pandemic recorded was in 1580 [ 1 ]. Before 1889, pandemics’ patterns show a 50–60-year cycle, while, after 1889, a 10–40-year cycle is shown, with the possibility of shortening [ 1 ]. Unfortunately, nothing has been done to change this pandemic pattern in the last century [ 1 ].

Research indicates that the current outbreak started to spread between people in late November to December 2019 [ 2 ]. On 31 December, 27 cases were recorded of unknown diseases [ 2 ]. The recent outbreak was identified on 7 January 2020, a virus called SARS-CoV-2, which is caused by the beta coronavirus and attaches to the lower respiratory census tract [ 2 ]. On 18 January, the cases spread around the country regarding the travel for the Chinese Lunar New Year [ 3 ]. The government started to lock down the city of Wuhan, considered as ground zero, and closed all routes to the province [ 3 ]. The origin of the cases was connected to visiting the Wuhan’s Huanan Seafood market [ 2 ]. All the cases were related to traveling from Wuhan until 2 February 2020 [ 3 ]. Later, the cases spreaded all over the world and to the United States of America. The first case in the United States was recorded on 20 January 2020 [ 4 ]. By October 2021, the United States recorded 44,518,018 total cases and 716,370 total deaths [ 4 ].

The Chinese government reacted to the spread of COVID-19 by restricting people’s movement, mandatory masks, and monitoring machines [ 5 ]. Internationally, the responses included things such as social distancing, vaccines, and disinfecting hands to control the spread [ 6 ]. The Center of Disease Control CDC in the United States reacted to the pandemic by advising mask use, requiring negative tests for people to enter the US from a foreign country, and collecting contact information from passengers to minimize incoming infection cases [ 7 ]. However, the World Health Organization recommendations of face masks and sanitizer were difficult to enforce in low-income countries in Africa because of poor facilities and low access to equipment [ 8 ].

Investigating the factors or variables associated with a pandemic is essential to understand its spread. In the case of this pandemic, investigating the COVID-19 spread in relation to food access distribution, income, population density, health issues, and poverty is associated with future pandemic recovery plans and prevention. Food access would be limited by a stay-at-home order, curfew, and social distance rule. At the same time, population density or human traffic in public places, such as food outlets, would increase the chance of infection. Food access in urban areas is a critical factor for human survival. Equal distribution of food outlets supports healthy and active life in communities, while unequal distribution may have a negative impact on people’s health and result in a higher incidence of diabetes and other health risks. Analyzing food distribution is a multi-variate problem as it depends on various factors of influence ranging from income to demography [ 9 ]. More variables, such as income levels, affect people’s ability to buy food and access transportation for takeout. Health issues and chronic diseases may be affected by the pandemic conditions and the consequences associated with weakened immunity and infections.

A healthy life and well-being are some of the United Nations goals and strategies, especially the Sustainable Development Goals SDG 3.3, which aims to end pandemics by 2030 [ 3 ]. However, the spread of a new virus threatens this goal [ 3 ] because this pandemic is not the first and will not be the last, and the frequency of these pandemics might increase as influenza mutates every cold season to form a new strain. Investigating the current stage of the pandemic and its adverse effects helps us as humans to prepare for future pandemics.

2. Literature Review

Scientists have documented outbreaks and pandemics and analyzed them to limit their negative influence. Previous pandemics, such as malaria and H1N1, affected human health and life. In a study by Malik & Abdalla, they mapped the spread of H1N1 by using spatiotemporal analysis. The study analyzed the spatial spread and spatial–temporal distribution with the factors of population density and international flights from Mexico [ 10 ]. The second study indicates the use of spatiotemporal analysis to map the H1N1 outbreak [ 11 ]. The study found that the virus infections did not spread much as clusters between the first and third weeks but increased to larger clusters in the sixth week [ 12 ]. These clusters started to converge further from week six to eighteen, and then started to decline in week 22 [ 12 ]. There have been some studies on pre-existing health risks and their susceptibility to higher infection rates during epidemics. One study presented the effect of obesity on influenza infection duration and concluded that obesity extended the shedding duration by 42% for influenza and by 43% for influenza-H1N1 [ 13 ].

Since COVID-19 was announced as a pandemic on March 11 2020, scientists started to study and analyze the spread of the virus and its associated factors. Several studies focused on the global scale, while other studies investigated smaller scales and examined specific variables’ correlation to COVID-19 [ 14 ]. In a study, the authors presented the sectors that were disrupted globally, namely: tourism, restaurants, leisure, entertainment, travel, sports, etc. [ 15 ]. Another study presented the comparison between developed and developing countries, where increased COVID-19 cases and deaths were present in developed countries compared to developing countries [ 14 ].

Pandemic spread and prediction could be analyzed by several methods, including the Geographic Information System (GIS) and machine learning (ML). The GIS is an effective tool for visualizing the spread of cases with spatial reference maps, time, location, and other overlaying techniques. The role of GIS is clear in mapping cases, mapping case clusters, mapping the outbreak spread, and helping decision-makers act [ 16 ]. The geospatial analysis of GIS on COVID-19 was mostly on five topics, which are spatial–temporal analysis, health and social geography, environmental variables, data mining, and web-based mapping [ 17 ]. As an example, GIS can be used for dashboard tracing, which was applied for the first time at John Hopkins University [ 18 ]. Another use of GIS was applied by the World Health Organization to illustrate confirmed cases and deaths [ 9 ]. More examples are in the HealthMap by the Boston Children’s Hospital, USA [ 19 ]. A study proved the effectiveness of ML models on outbreak predictions by applying multi-layered perceptron MLP and Adaptive Network-based Fuzzy Inference System ANFIS [ 20 ].

Currently, GIS is a useful tool for mapping cases and deaths, spreading, and predicting the future spread for health authorities regarding taking necessary and precise action on future outbreaks. The use of GIS is critical during the pandemic and post-pandemic for policymakers to make decisions on developing surveillance tracking systems for controlling and preventing future pandemics [ 18 ]. South Korea shows the best example of creating a web-GIS tracking for its pandemic tracking system by tracing cases and highly infected sites [ 21 ]. The application of the GIS into the South Korean method provided a decision-making tool on updated tracking and predicted the needed procedures [ 22 ]. Given this orientation, another study investigated the outbreak spree by applying the five GIS model sizes in the United States [ 23 ]. It investigated the differences in using different size modeling from local to global and applied those methods on four variables, black female populations, income, household income, and percentage of nurse practitioners [ 23 ].

More tools on the analysis of COVID-19 cases and spreads included statistical regression models. These models have been used to investigate the fluctuation in cases and then connect that to variables. A study presented the investigation in Germany on the spike and decrease in COVID-19 cases in the first two months of the pandemic and found increases and decreases in cases, and these changes on carve may be by variables that need to be studied [ 24 ]. More investigation on the correlation of COVID-19 with other variables, such as health issues, is critical around the world. The importance of cholesterol and its relation to the virus entering human cells is illustrated in a study, and lower cholesterol helps clear the virus sooner and limit infections [ 25 ]. High blood pressure recorded a correlation with a reduction in lung function [ 26 ].

The correlation of COVID-19 with variables has been investigated by several regression models as an efficient method. More specifically, research on the correlation with health issues has been applied and presents a correlation to various health issues. A multivariable linear regression analysis on global data of COVID-19 cases and deaths recorded a high correlation of cases and deaths with high cholesterol and high body mass [ 14 ]. Moreover, the correlation is stronger in the younger population [ 14 ]. An analysis in the United Kingdom on people’s body mass and COVID-19 hospitalization by applying logistic regression demonstrated higher hospitalization for people with obesity [ 27 ]. Further, a study conducted by penalized logistic regression models proved that hypertension illustrates a correlation to COVID-19 cases and mortality [ 28 ]. Nevertheless, moderate blood pressure is considered a dramatic factor in patient survival and limiting organ damage [ 28 ]. COVID-19 affects people’s health, and that effect may be more severe on people with health conditions. A study presented a regression analysis on patients’ clearance after being affected with COVID-19 and concluded that more days are recorded for people suffering from high cholesterol and diabetes [ 29 ]. Additionally, a study presented a positive correlation between COVID-19 and population density in India by computational correlation coefficient models [ 30 ].

The analysis of the geographical spread of a virus provides a tool for decision-making and long-term management for outbreaks [ 21 ]. Mapping the data based on normality would show a visualization, followed by normalizing data, such as showing the percentage based on every 100,000 people [ 31 ]. A recent study on the environmental effects of COVID-19 spread took place in China to analyze the effects of temperature and humidity [ 3 ]. The results illustrated the relationship between infected cases and weather, where low humidity supported the suitability and spread of the virus [ 24 ]. Moreover, strong cases showed a temperature range of 10 °C to 20 °C [ 3 ]. Furthermore, a higher number of cases were shown in economically developed cities, such as Beijing, and lower cases in less developed cities, such as Lhasa, which could be due to air pollution, geographical location, or population density [ 3 ]. Moreover, a study in Malaysia discussed that tourism was affected badly by the outbreak and, in turn, affected the economy and financial development [ 32 ].

Additional variables, such as income, were investigated in various geographical locations. An investigation in Spain demonstrated the negative correlation of the mean income to COVID-19 cases spread, where more cases spread in lower mean income districts because of low access to health care, lack of awareness, and poverty rates [ 33 ]. More specifically, low median income districts had 2.5 times higher cases than higher than mean income districts in Spain [ 33 ]. More studies presented the correlation of income to COVID-19 cases and deaths and its influence on food security. For instance, researchers in Kenya analyzed surveys on COVID-19 influence and concluded that low-income households that depend on labor jobs are more vulnerable to food insecurity due to financial shock [ 34 ]. More specifically, during the pandemic, people in low-income neighborhoods spent more time at work than those in high-income neighborhoods due to labor shortages [ 35 ].

COVID-19 has a long-term influence on food security and impacted a population increase of 17 million Americans in 2020 compared to 2018 [ 36 ]. Despite the increase in food insecurity, the pandemic has had a dramatic influence on the increase in children classified as having food insecurity by 3% more in 2020 than in 2017 [ 37 ]. Hence, the U.S government increased the free food programs in nation-wide K–12 public schools.

In Brazilian data studies, the investigations found a positive correlation to different socio-economic variables, such as population density, and negative correlation to social isolation rates, which proves the importance of social distancing enforcement [ 38 ]. Another investigation was done in India by statistical analysis called Pearson’s correlation coefficient [ 39 ]. A positive correlation between people density and COVID-19 cases was presented in five states. A statistical analysis recorded a correlation of COVID-19 with the number of tests and population density [ 39 ]. More variables, such as public transportation, were investigated for the correlation to COVID-19 cases and deaths. A statistical analysis recorded a correlation of COVID-19 with the number of tests and population density [ 40 ]. Regarding another study, a positive correlation was presented between public transportation sites, such as airports and train stations, and COVID-19 cases, in which the people living less than 25 miles from transportation spots showed higher cases than people living more than 50 miles away [ 41 ]. This was further supported by another study on the spatial distribution of COVID-19 cases in China, describing the possibility of transportation influence on the spread between neighborhoods [ 42 ].

The demographic variables were also investigated in several studies. A study that took place in the United States analyzed the cases and death numbers of COVID-19 and concluded that African Americans have the highest rates because of their low income, low access to transportation, and the high rate of chronic diseases, such as diabetes and obesity [ 43 ]. Also, the study recorded the vulnerability of the Hispanic community on the age of to the pandemic because of their high uninsured status rate, high chronic diseases, language barrier, and their immigration status [ 10 ].

Researchers indicate that there is a lack of application of GIS on pandemic spreads and more application is needed [ 12 ]. There is a need for more GIS analysis on the outbreak with different variables. Further research is needed to investigate more variables, such as food access, in the United States [ 11 ]. The proposed study illustrates the investigation of the spatial distribution of COVID-19 cases and deaths in Guilford County and examines the possibility of correlation with specific variables in food access and health risks. This study investigates variables such as health issues, income, food outlets and access areas, population density, and poverty rates. This study is applying technology by exploring machine learning models’ efficiency to analyze the pandemic distribution.

The research questions in this study are:

  • Is it possible that COVID-19 cases and deaths in geospatial distribution are associated with food outlets and restaurants distribution?
  • Can other variables illustrate a geospatial correlation with COVID-19 cases and deaths?
  • How can machine learning discover a higher quantitative statistical correlation of COVID-19 cases and deaths against various independent variables?
  • Do the machine learning results concur with the GIS regression results?

Our contributions in this study are:

  • Investigated the geospatial association of COVID-19 cases and deaths to food outlets distribution
  • Examined the dependency of various socio-economic and health risk variables on COVID-19 cases and deaths
  • Applied ML techniques to investigate the statistical association between COVID-19 cases and deaths to other variables

3. Study Area and Materials

This study took place in Guilford County ( Figure 1 ) in the state of North Carolina, with an area of 645.70 square miles and a population of 537,174 [ 44 ]. The county population consisted of 35.4% black, 49.4% is white, 5.3% Asian, 8.4% Hispanic, and 1.5% other [ 44 ]. The county took steps to maintain people’s health and wellness. Mandatory face masks were officially announced starting from 5 PM on Jun 26, 2020 [ 45 ]. Guilford County issued a “stay at home” order for transportation on April 17, 2020 [ 46 ]. In June, the county announced 5 testing sites spread around the county [ 46 ]. The county has three zip code areas with a high cluster of cases, and they are 27,405, 27,407, and 27,406 [ 47 ]. By October 14, 2021, North Carolina recorded 1,436,699 total cases [ 4 ]. The datasets were obtained from the health department in Guilford County.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g001.jpg

Study area.

4. Methods and Results

This study adopted a spatial-based and machine learning regression method to analyze the correlation between COVID-19 cases, deaths, and independent variables. The spatial method was applied to analyze the correlation and to present it visually on maps with variation of correlation degree. ML regression model is a strong tool that could be used for different topics and purposes, and the cause and analysis is one of them. Moreover, applying several models to compare results is important to find the most suitable model for this study and document it. In this study, the authors used ArcGIS-ArcMap software version 10.3 for GIS analysis and Jupiter software to apply the regression analysis. The method ( Figure 2 ) applied used GIS tools for spatial and Sci-Kit Learn software libraries for machine learning regression, respectively. The GIS regression methods applied four models: the scatterplot matrix graph, spatial autocorrelation (Moran’s I), ordinary least squares (OLS), and the geographically weighted regression. The ML regression method applied four models, and they are linear multioutput regression, K-nearest neighbors of multioutput regression, random forest of multioutput regression, and support vector regression. These models were applied to analyze the correlation between dependent (COVID-19 cases and deaths) and independent variables (med-income, poverty rate, population density, high blood pressure, high cholesterol, obesity, number of healthy food outlets, and number of healthy food outlets).

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g002.jpg

Methodology graph.

4.1. GIS Methods

These maps in Figure 3 and Figure 4 present the COVID-19 cases and deaths. In Figure 3 , higher numbers of cases are presented in dark blue color. The lowest COVID-19 infections are in the downtown of Greensboro, where it has fewer residential homes than businesses, and the highest are located outside of Greensboro in Summerfield, Gibsonville, Sedalia, Burlington, and Pleasant Garden. In Figure 4 , the highest numbers of deaths are ranging between 22 and 33 per each census tract, displayed in blue color, and the lowest numbers of deaths are given 0 to 3 per each census tract in yellow color. The COVID-19 deaths low numbers are reported in Greensboro and the high mortality reported out of the city. An observation from this distribution could be about people’s education and the mask enforcement in large stores or offices. After that, scatterplot matrix graph in Figure 5 presents the interaction between COVID-19 cases and independent variables. The graph illustrates some positive and negative correlations and no correlation. Positive correlations include obesity with poverty and high blood pressure. Negative correlation is presented between obesity and med-income variables. However, there is no apparent strong correlation observed between COVID-19 cases and other variables through this scatter matrix visualization.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g003.jpg

COVID-19 cases in Guilford County.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g004.jpg

COVID-19 deaths distribution.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g005.jpg

Scatterplot matrix graph using cases as dependent variable.

The scatterplot matrix graph is also applied to COVID-19 deaths as a dependent variable. The graph ( Figure 6 ) also presents no correlation between COVID-19 deaths and variables. Negative correlations are presented between med-income and poverty and obesity.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g006.jpg

Scatterplot matrix graph using deaths as dependent variable.

After that, we applied the spatial autocorrelation (Moran’s I) to find the cluster of cases and deaths on some census tracts. The spatial autocorrelation is applied by this equation:

In Equation (1) Z i is the deviation of an attribute for feature i from its mean ( X i − X ¯ ). The W i . j   is the spatial weight between feature I and j , and n is equal to the total number of features. The S 0   is the aggregate of all spatial weight. After applying the equation, results are presented in Figure 7 and Figure 8 . Figure 7 illustrates that COVID-19 cases are significantly clustered in Guilford County, which means there is high dependency of output and independent input variables.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g007.jpg

Spatial autocorrelation for COVID-19 cases.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g008.jpg

Spatial autocorrelation for COVID-19 deaths.

In Figure 8 , the spatial autocorrelation concluded that the cluster of COVID-19 deaths is a result of random chance, which encourages the investigation further on different variables. The Moran’s summary of COVID-19 cases and deaths by the Moran’s I spatial autocorrelation is in Table 1 below.

OLS results for COVID-19 cases and deaths.

Next, local Moran’s was applied based on this formula:

In Equation (2), n is the total number of features, and χ i is the attribute for feature i . Moreover, w i . j is the spatial weight between feature i and j . The output of this equation is presented in Figure 9 and Figure 10 . Figure 9 , the local Moran’s on COVID-19 cases, presents tracts with high case numbers and its correlation with a high number and percentage of variables in the south of Greensboro and east of Guilford County. The pink patch represents high cases of COVID-19 with an increase in variables. The red patch represents high cases and low variables correlation. The blue patch illustrates tract with low cases number with low variables in Greensboro downtown. In Figure 10 , the local Moran’s on COVID-19 deaths is presented with the correlation of variables in each tract. The red patch represents high mortality with low correlation with variables, and the pink patch represents high mortality number with high variables in the north of Greensboro.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g009.jpg

The local Moran’s on COVID-19 cases in Guilford County.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g010.jpg

The local Moran’s on COVID-19 deaths in Guilford County.

Then, OLS was applied to examine dependent and independent variables. OLS is a linear regression to perform a prediction or detect relationship between dependent and independent variables. We examine COVID-19 cases as a dependent variable with all independent variables. This OLS model uses the equation below:

where Y is the dependent variables, β is coefficients, X is explanatory or independent variables, and Ɛ is random error. In Figure 11 , red patches represent areas with higher COVID-19 cases than the model predicted, and the blue shaded census tracts illustrate areas with lower COVID-19 cases than the model expected. In this model, the multiple R square was 0.358946, and the adjusted R-square was 0.307662. The Akaike’s information criterion (AICc) was 1412.247528. The joint F-statistic was 0.000000, which was a significant result. The Jarque–Bera statistic [g] was 1.511785, which indicates that the independent variables have an influence on the dependent variable. The joint Wald statistic [e] was significant and computed as 0.000000. The Keonker (BP) statistics, which determine if the independent variables have a consistent relationship to the dependent variable, was 0.009854, also significant, but the relationship is not consistent.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g011.jpg

OLS on COVID-19 cases in Guilford County.

In Figure 12 , red patches represent areas with higher COVID-19 deaths than the model predicted, and the blue shaded illustrates areas with lower COVID-19 deaths than the model predicted. In this model, the multiple R square was 0.159614, and the adjusted R-square was 0.092383. The Akaike’s information criterion (AICc) was 685.908921. Joint F-statistic was 0.021994, which was a significant result. The joint Wald statistic [e] was 0.000000 as a significant result. The Keonker (BP) statistics determine if the independent variables have a consistent relationship to the dependent variable, and it was 0.388493, which was not significant. The Jarque–Bera statistic [g] was 0.000000, which is significant and means the model is biased and needs further investigation.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g012.jpg

OLS on COVID-19 deaths in Guilford County.

Based on the independent variables’ coefficient of the OLS, variables with higher coefficients than 7.5 will be applied in the GWR. These variables are high cholesterol, high blood pressure, and healthy food outlets. In Figure 13 and Figure 14 GWRs were applied on COVID-19 cases and deaths to visualize the correlation with independent variables by applying this equation:

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g013.jpg

Geographically weighted regression on COVID-19 cases.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g014.jpg

Geographically weighted regression on COVID-19 deaths.

In this equation above, the coefficient ℬ 1 illustrates the increase in y because of one -unit increase in x . This map shows less tract with high correlation and more with medium correlation. In Figure 13 , the map presents the correlation between the dependent and independent variables. Red patches, which represent high correlation, are in east of Gilford County in the tracts 012803, 015300, and 017200. In Figure 14 , the map presents the correlation of COVID-19 deaths with variables (high cholesterol, high blood pressure, and health food outlets) and presents correlation degrees in color shades. The highest correlation of COVID-19 deaths with the variables is presented on the tracts 015703, 012604, and 013700.

4.2. ML Regression Results and Discussion

This study adopted machine learning techniques to investigate the correlation by applying both linear and nonlinear regression models. Linear, multi-output linear, random forest, and K-nearest neighborhood regression models were applied to investigate the data. All models investigate all variables at the same time, but linear regression investigates single output at a time. These four models were applied to evaluate their results. These models are predicting the values of the dependent variables, such as COVID-19 cases and COVID-19 deaths, with the correlation of independent variables of med-income, poverty rate, population density, number of healthy food outlets, and number of un-healthy food outlets. The dataset was divided into 80% training and 20% testing for multioutput model development. The training set contained eighty-seven (87) observations and twenty-two (22) observations in the testing set, and two different metrics: root mean square (RMS) and R-squared (R 2 ), which were used to evaluate the models developed. The implementation of multioutput and multiple linear regression models were done with the Sklearn package in Python and MATLAB 2020a, respectively. The default parameters for the multioutput regression models were used in Table 2 .

Regression models’ parameters.

The equation below is derived in the linear regression model. In the equation, coefficients of variables were computed based on the linear regression model.

The degree of linear association between all variables is computed by the Pearson correlation coefficient (R 2 )-scores in the correlation matrix heatmap format in Figure 15 . The results could be read in three directions: R values close to 1 show a positive relationship, and R values close to −1 illustrate negative relationships, but results close to zero have no linear relationships. It can be observed in the heatmap ( Figure 13 ) that there is a positive correlation between obesity and poverty (R 2 = 0.74). There is a high positive correlation between high cholesterol and high blood pressure (R 2 = 0.82). Furthermore, there is a positive correlation between obesity and high blood pressure (R 2 = 0.77). Moreover, there is a strong negative correlation between obesity and med-income (R 2 = −0.7), and a negative correlation between income and poverty (R 2 = −0.75). There is no correlation between COVID-19 cases and health issues (obesity, high cholesterol, and high blood pressure). Moreover, there is no correlation between unhealthy food outlets, healthy food outlets, and health issues.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g015.jpg

Correlation matrix with heatmap.

From the tables’ results below ( Table 3 and Table 4 ), the authors applied and compared the regression models results. The COVID-19 cases as a dependent variable have the highest value of R 2 -score as 45% by the application of linear regression for multioutput regression model, and COVID-19 deaths had a higher value of 60% by the application of support vector regression model. The high correlation R 2 -scores of COVID-19 deaths and variables were also presented by the GIS spatial autocorrelation as clustered distribution in Figure 7 . These regression models’ results indicate that independent variables (med-income, poverty rate, population density, number of healthy food outlets, and number of unhealthy food outlets) have more influence on the dependent variable COVID-19 deaths than COVID cases.

R-square value of regression models.

Root square error (RMSE) values of regression models.

The application of the multiple linear regression models considered the two dependent variables (COVID-19 cases and deaths). The support vector regression model was applied to examine all the data and errors within the threshold. In Figure 16 , the predicted trends for dependent variable COVID-19 deaths are presented against the original trend values. Both trends, match the peaks and troughs well overall, showing similar behavior. However, the residual errors seem to vary both on the positive and negative side of the trend. The test data are kept out of the sample. The significance of Figure 11 is that the prediction trend is matching the peaks and troughs present in the original trend of number of COVID cases well (ground-truth). There are still many residual gaps between the original and predicted values, but the trend was predicted well overall. This figure coincides well with the R 2 -coefficient of 0.60 for number of COVID-19 deaths.

An external file that holds a picture, illustration, etc.
Object name is healthcare-10-00324-g016.jpg

Support vector regression model.

5. Conclusions

This study implemented GIS and machine learning techniques on COVID-19 data in 109 census tracts in Guilford County to investigate any correlation between the spread of the pandemic and social–economic, food access, and health issues variables. The GIS and machine learning methods were applied to examine the datasets and compare their results regarding if they are equivalent or different.

The GIS results illustrate the distribution of the variables where COVID-19 cases have a cluster in Guilford County, while COVID-19 deaths have no cluster. The cases cluster was biased and indicated more investigation of independent variables. COVID-19 deaths presented a p -value at 0.00000, which indicates a 99% confidence that independent dose had no influence on the distribution. Moreover, the COVID-19 infection cases p -value result was 0.516475, and that indicates less than 90% confidence that independent variables do not influence the distribution. The OLS results did not a indicate high influence of the independent variables on the dependents. The R-square of the influence of the independent variables on COVID-19 cases is only 35%. It also indicated only 9% on COVID-19 deaths. These percentages are low, and we suggest more investigation and including more variables.

The application of four spatial regression models indicates some influence on the independent variables. The heat map presented a weak correlation between the dependent and independent variables. There was a positive but not strong correlation between the dependent variables COVID-19 cases and deaths, which means deaths increase where cases are high. However, there were several strong negative correlations between income and two variables (poverty and obesity), but there was a positive correlation between poverty and obesity. More correlations between the independent variables are clear in a positive correlation of high blood pressure with obesity and high cholesterol. These independent variables do not show direct impacts on the dependent variables, but they affect people’s health, which could make them control variables. For example, poverty led to unhealthy diet, which affects people’s immune system, and the presence of two health issues in a community makes them more vulnerable to health issues and risks. The highest R-square for COVID-19 cases was 60% by support vector regression and for COVID-19 death; the highest R-square was 44% by the linear regression for multioutput regression. These numbers are not high for correlation, which indicates an unclear influence of the independent variables on the dependent variables.

The machine learning results take the same direction as the GIS results, correlation between variables or independent variables. The study illustrates the need for future investigation on the spread of COVID-19 infections and deaths in Guilford County. Further study may include the distribution of more health issues, such as autoimmune diseases, to investigate more correlations to COVID-19 infections. Further analysis would require more datasets or a larger geographical scale.

In future, this study would examine several variables exclusively independent in the regression model and investigate the feature engineering in machine learning to increase the R 2 -score. Other independent variables would be related to the distribution of health centers, religion, and public transportation stops and routes. These data could be obtained from the transportation department and state health department. This study has a data limitation. The study area has 118 census tracts but only 107 census tracts had all the data variables recorded. That affected the results because more data would show more correlation and distribution analysis. More data would provide a clearer picture of the analysis to examine the issues on a state level, which includes many counties, and to analyze patterns and compare the counties.

Acknowledgments

We thank King Abdulaziz University for the financial support of the first author’s degree. We thank Mark Smith from the Health Department in Guilford County, Greensboro, NC for providing the preliminary datasets on health statistical records and other data. The authors also sincerely acknowledge the Health Surveillance and Analysis Unit of the Guilford County Department of Health and Human Services, Division of Public Health as a source as well as the NC Electronic Disease Surveillance System (NC EDSS) of NC DHHS for providing datasets.

Author Contributions

Conceptualization, B.G. and A.A.; methodology, A.A., Y.A. and B.G.; software, Y.A. and A.A.; validation, B.G.; formal analysis, A.A. and Y.A.; investigation, A.A.; resources, B.G.; data curation, B.G., A.A., B.G., and Y.A.; writing—original draft preparation, A.A.; writing—review and editing, A.A., B.G. and A.T.; visualization, A.A, Y.A.; supervision, B.G.; project administration, A.A. and B.G.; funding acquisition, A.A., B.G. All authors have read and agreed to the published version of the manuscript.

This research is sponsored by North Carolina Dept. of Environmental Quality, (NCDEQ), Center for Energy Research and Technology (CERT), Visualization and Computation Advancing Research Center (ViCAR), and Computational Data Science and Engineering Department at NC A&T State University.

Informed Consent Statement

Not applicable.

Data Availability Statement

Conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • How it works

A Beginner’s Guide to Regression Analysis

Published by Owen Ingram at September 1st, 2021 , Revised On July 5, 2022

Are you good with data-driven decisions at work? If not, why? What is stopping you from getting on the crest of a wave? There could be just one answer to these questions, and that is “too much data getting in the way.” Do not worry; there is a solution to every problem in this world, and there is definitely one for parsing through tons of data.

Yes, you heard it right! You will not have to get in trouble with the number crunching and counting with this solution. What is the solution?

Well, without further ado, we would like to introduce you to “regression,” which precisely is allowing one to see into the future.

What is Regression Analysis?

Here is a scenario to help you understand what regression is and how it helps you make better strategic decisions in research.

Let’s say you are the CEO of a company and are trying to predict the profit margin for the next month. Now you might have a lot of factors in your mind that can affect the number. Be it the number of sales you get in the month, the number of employees not taking leaves, or the number of hours each worker gives daily. But what if things do not go as planned? The “what if” list here has no stop; it can go on forever.  All these impacting factors here are variables, and regression analysis is the process of mathematically figuring out which of these variables actually have an impact and which are not plausible.

So, we can say that regression analysis helps you find the relationship between a set of dependent and independent variables. There are different ways to find this relationship between variables, which in statistics is named “ regression models .”

We will learn about each in the next heading.

Types of Regression Models

If you are not sure which type of regression model you should use for a particular study, this section might help you.

Though there are numerous types of regression models depending on the type of variables , these are the most common ones.

Linear Regression

Logistic regression, ridge regression, lasso regression, polynomial regression, bayesian linear regression.

Linear regression is the real workhorse of the industry and probably is the first type that comes to mind. It is often known as Linear Least Squares and Ordinary Least Squares . This model consists of a dependent variable and a predictable variable that align with each other. Hence, the name linear regression. If the data you are dealing with contains more than one independent variable , then the linear regression here would be Multi-Linear Regression .

Logistic Regression comes into play when the dependent variable is discrete. This means that the target value will only have one or two values. For instance, a true or false, a yes or no, a 0 or 1, and so on. In this case, a sigmoid curve describes the relationship between the independent and dependent variables .

When using this regression model for the data analysis process , two things should strictly be taken into consideration:

  • Make sure there is no multi-linearity (like that in the linear regression model) or correlation between the two variables in the dataset
  • Also, ensure that the size of data is big with the equal manifestation of values to come in targeted variables

When there is a high correlation between the independent and dependent variables, this type of regression is used. It is simply because, with multi collinear data, least-square estimates give impartial numbers. However, if the collinearity is high, there might be a slight chance of unfair judgment.

Thus, a bias matrix is brought to the surface in ridge regression. This powerful type of regression is less vulnerable to overfitting. Are you familiar with the ‘overfitting’ word?

Overfitting in statistics is a modeling error that one makes when the function is too closely brought into line with limited data points. When a model in research has been compromised with this error, it might lose its value all at once.

Lasso Regression is best suitable for performing regularization alongside feature selection. This type of regression hinders the absolute size of the regression coefficient. What happens next? The coefficient value will almost come nearer zero, which the complete opposite of what happened in Ridge Regression.

This is why feature selection utilizes this regression model that helps to select a set of features from the dataset. Only required and limited features are used in Lasso Regression, and all the other features are zero. Researchers get rid of the overfitting in the model by doing this. But what if the independent variables are highly collinear?

In that case, this model will only choose one variable and turn the others to zero. We can say that it is somewhat like the Ridge Regression but with variable selection.

This is another type of regression that is almost the same as Multi-Linear Regression but with some changes. In the Polynomial Regression Model, the relationship between the two variables, dependent and independent , is denoted by the nth degree. While in a Multi-Linear Regression Model, the line is linear, here it is the opposite. The best fit line in Polynomial Regression passing through all the points is curved. This curve either depends on the value of n or the value of X.

This model is also prone to overfitting. It is best to assess the curve towards the end as the higher polynomials might give strange and unexpected results on extrapolation.

The last type of regression model we are going to discuss is the Bayesian Linear Regression. Have you heard of the Bayes theorem? Well, this regression type basically uses that to figure out the value of regression coefficients.

It is a lot like both Ridge Regression and Linear Regression, but the stability here is much higher. In this model, we find the value of the posterior distribution of the features instead of working on the least squares.

FAQs About Regression Analysis

What is regression.

It is a technique to find out the relationship between the dependent and independent variables

What is a linear regression model?

Linear Regression Model helps determine the relationship between different continuous variables by fitting a linear equation for dealing with data.

What is the difference between multi-linear regression and polynomial regression?

The only difference between Multi-Linear Regression and polynomial repression is that in the latter relationship between ‘x’ and ‘y’ is denoted by the nth value, so the line here is a curve. While in Multi-Linear, the line is straight.

What is overfitting in statistics?

When a function in statistics corresponds too closely to a particular set of data, some modeling error is possible. This modeling error is called overfitting.

What is ridge regression?

It is a method of finding the coefficients of multiple regression models in which the independent variables are highly correlated. In other words, it is a method to develop a parsimonious model when the number of predictable variables is higher than the observations in a set.

You May Also Like

Interval data is a type of discrete data that can be calculated along a scale where every point is placed at an equal interval from another, just as the name explains itself.

The standard normal distribution is a special kind of normal distribution where the mean is 0, and the standard deviation is 1.

Measures of variability in statistics is a summary explaining the proportions of fluctuation in the dataset.

USEFUL LINKS

LEARNING RESOURCES

DMCA.com Protection Status

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works

Browse Course Material

Course info, instructors.

  • Dr. Peter Kempthorne
  • Dr. Choongbum Lee
  • Dr. Vasily Strela
  • Dr. Jake Xia

Departments

  • Mathematics

As Taught In

  • Applied Mathematics
  • Probability and Statistics

Learning Resource Types

Topics in mathematics with applications in finance, regression analysis.

This file contains information regarding lecture 6 notes.

facebook

Read our research on: TikTok | Podcasts | Election 2024

Regions & Countries

Ii. explaining the regression analyses.

A regression analysis is a statistical technique designed to show the relative importance of each of a number of independent variables in predicting a phenomenon of interest– in this case, the likelihood that a respondent is very happy.

For the purpose of this analysis, we constructed two regression models,. The first considerd party identification along with a series of demogrpahic traits– including age, gender, race, ethnicity, income, educational acheviement and marital status. A second model considered all those factors, as well as church attendance and health status, which have long been shown to be correlated with happiness. Predicted probabilities have been computed by varying a given independent variable from its minimum to its maximum value, while holding all other variables in the equation constant (at their mean or modal value). Both regression analyses were performed using a combined data base from two different Pew surveys– one conducted in July, 2008 among 2,250 adults and the other conducted in October, 2005 among 3,014 adults.

The Model One analysis found:

  • The probability of a Republican being very happy is 13 percent greater than the probability of a Democrat being very happy, once all other variables in this model are held equal.
  • The probability of those at the highest annual income levels ($150,000 and above) being very happy is 16 percent greater than the probability of those at the lowest income levels (less than $10,000) being very happy, if all else is held equal.
  • The probability of married respondents being very happy is 12 percent greater than the probability of unmarried respondents being very happy, if all else is held equal.
  • The probability of someone age s 64 and above being very happy is 10 percent greater than the probability of all othert adults being very happy, if all else is held equal.
  • The probability of someone ages 29 and below being very happy is 7 percent greater than the probability of all othert adults being very happy, if all else is held equal.
  • The probability of someone who has completed college being very happy is 8 percent greater than the probability of someone who has not completed high school being very happy, if all else is held equal.
  • Gender, race and ethnicity have no independent impact on the odds that someone is very happy, once all the other variables in this equation have been controlled.

The Model Two analysis found:

  • The probability of those who report they are in excellent health being very happy is 36 percent greater than the probability of those in poor health being very happy, if all else is held equal. (There was a small difference in the 2008 and 2005 surveys in the number of responses categories to the health question. To make the measure more comaprable across the years, the 2008 sample combined those who said “poor” and “only fair” into one category for this analyis. In the 2005 sample, the lowest health rating was poor.
  • The probability of those of who attend religious services more than once a week being very happy is 18 percent greater than the probability of those who never attend religious services being very happy, if all else is held equal.
  • In this model, the estimated impact of party identification, marriage, income and age under 30 all decline slightly from their values in Model One, but still stand as predictors of happiness. The value of age over 64 increases slightly from it value in Model One. No other demographic traits have an independent impact on predicted happiness in this model.

We also ran the regression equation with using ideological self-identification (conservative versus liberal) rather than partisan self-identification (Republican versus Democrat) as one of the variables. We found that the impact of being conservative as a predictor of happiness is about the same as the effect of being a Republican. In addition, when we did another analysis that combined both party and ideology on a continuum from liberal Democrat to conservative Republican, we found that the effect of these combined variables on predicting happiness is slightly greater than the effect of either variable on its own.

Social Trends Monthly Newsletter

Sign up to to receive a monthly digest of the Center's latest research on the attitudes and behaviors of Americans in key realms of daily life

Report Materials

Table of contents, go west, old man, mcdonald’s and starbucks: 43% yin, 35% yang, muslim americans: middle class and mostly mainstream, who’s feeling rushed, are we happy yet, most popular.

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

  • Our Unique Approach
  • Leadership Team
  • Research Advisory Board
  • Regional Advisory Board
  • Research Papers
  • Sustainable CO2 Reduction and Climate Change Adaptation
  • Sustainable Transportation
  • Sustainable Energy Systems
  • Sustainable Water and Agricultural Systems
  • Sustainable Cities and Their Built Environment
  • PUBLICATIONS
  • Event Calendar
  • Past Events

research topics in regression analysis

USC Center for Sustainability Solutions

Research papers: regression analysis.

By Andrea Martinez, Joon-Ho Choi

Energy and Buildings

Reducing the energy consumption in existing buildings became one of the critical challenges at the beginning of the 21st century. Several types and levels of retrofits are now being implemented in the building stock. To obtain a better understanding of the actual impact of these actions, evidence-based research has been playing an increasingly important role. This paper describes the collection of data on measured pre- and post-retrofit energy consumption of a group of buildings in the U.S., in order to distinguish the impacts of different levels of retrofits. In particular, the goal has been to distinguish how retrofits including facade improvements compare to those centered exclusively on internal systems. Additionally, energy data was collected for a subset of non-retrofitted buildings and used as the control group. The regression model revealed greater energy savings from retrofits including the facade as compared to those that excluded it. However, those savings are modest considering the energy reductions that are anticipated from deep-energy retrofits. Other relevant factors, such as occupants and their behavior, are vital for determining the value of retrofits and need to be incorporated in the next phases of this study.

Research Areas

Research topics.

research topics in regression analysis

  • 2023 AERA in the News
  • 2022 AERA in the News
  • 2021 AERA In the News
  • 2020 AERA In the News
  • 2019 AERA In the News
  • 2018 AERA In the News
  • 2017 AERA In the News
  • 2016 AERA In the News
  • 2015 AERA In the News
  • 2014 AERA In the News
  • 2013 AERA In the News
  • AERA Speaking Out on Major Issues
  • 2023 AERA News Releases
  • 2022 AERA News Releases
  • 2021 AERA News Releases
  • 2020 AERA News Releases
  • 2019 AERA News Releases
  • 2018 AERA News Releases
  • 2017 AERA News Releases
  • 2016 AERA News Releases
  • 2015 AERA News Releases
  • 2014 AERA News Releases
  • 2013 AERA News Releases
  • 2012 AERA News Releases
  • 2011 News Releases
  • 2010 News Releases
  • 2009 News Releases
  • 2008 News Releases
  • 2007 News Releases
  • 2006 News Releases
  • 2005 News Releases
  • 2004 News Releases
  • AERA Research Archive
  • Trending Topic Research Files
  • Communication Resources for Researchers
  • AERA Highlights Archival Issues
  • AERA Video Gallery

research topics in regression analysis

Share 

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Reading Lists
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

Research Roundup: How the Pandemic Changed Management

  • Mark C. Bolino,
  • Jacob M. Whitney,
  • Sarah E. Henry

research topics in regression analysis

Lessons from 69 articles published in top management and applied psychology journals.

Researchers recently reviewed 69 articles focused on the management implications of the Covid-19 pandemic that were published between March 2020 and July 2023 in top journals in management and applied psychology. The review highlights the numerous ways in which employees, teams, leaders, organizations, and societies were impacted and offers lessons for managing through future pandemics or other events of mass disruption.

The recent pandemic disrupted life as we know it, including for employees and organizations around the world. To understand such changes, we recently reviewed 69 articles focused on the management implications of the Covid-19 pandemic. These papers were published between March 2020 and July 2023 in top journals in management and applied psychology.

  • Mark C. Bolino is the David L. Boren Professor and the Michael F. Price Chair in International Business at the University of Oklahoma’s Price College of Business. His research focuses on understanding how an organization can inspire its employees to go the extra mile without compromising their personal well-being.
  • JW Jacob M. Whitney is a doctoral candidate in management at the University of Oklahoma’s Price College of Business and an incoming assistant professor at Kennesaw State University. His research interests include leadership, teams, and organizational citizenship behavior.
  • SH Sarah E. Henry is a doctoral candidate in management at the University of Oklahoma’s Price College of Business and an incoming assistant professor at the University of South Florida. Her research interests include organizational citizenship behaviors, workplace interpersonal dynamics, and international management.

Partner Center

ORIGINAL RESEARCH article

This article is part of the research topic.

Forest Ecosystems in Mountain Regions: Conditions, Risks and Impacts

Patterns and drivers of tree species diversity in a coniferous forest of northwest China Provisionally Accepted

  • 1 Xinjiang Academy of Forestry, China
  • 2 Sun Yat-sen University, China
  • 3 Kanas State-level Nature Reserve, China

The final, formatted version of the article will be published soon.

Understanding the pattern of species diversity and underlying ecological determinants driving a forest ecosystem is fundamental to conservation biology and forest management. Boreal forests play an irreplaceable role in providing ecosystem services and maintaining the carbon cycle globally, yet research attention remains disproportionately limited and lacking throughout time.Based on field measurement data from a large (25ha) fully-mapped coniferous forest plot, the present study quantified patterns of species diversity and their determinants in Kanas of Xinjiang, northwest China. We applied linear regression analysis to test the effects of biotic and soil factors on alpha-diversity and local contribution of beta diversity (LCBD), and then we adopted path analysis to test the determinants that affected the species diversity index. Our results revealed that alpha-diversity indices did not vary greatly across different subplots, and richness value (between 2 and 6) was low in Kanas. Noteworthy is the discerned negative association between the average diameter at breast height (DBH) and species richness, suggesting that areas with smaller DBH values tend to harbor greater species richness. For beta-diversity, a higher value was observed in the substory layer (0.221) compared to both the canopy layer (0.161) and the understory layer (0.158). We also found that the species abundance distance matrix of biological and soil environmental factors were significantly correlated with species geographic distance matrices. More importantly, our results showed that average DBH and soil pH would affect the alpha diversity indices, and average DBH, soil Ph, average height and soil total Phosphorous would affect the beta diversity indices. Soil pH also indirectly affected the LCBDunder, LCBDsub and LCBDcan (p ≤ 0.001), upon mediation of alpha diversity indices. Overall, our results provide crucial revelations about species diversity patterns in boreal forests, and insights that can support the protection of forest biodiversity in China.

Keywords: β diversity, vertical strata, species composition, community structure, boreal forest

Received: 04 Nov 2023; Accepted: 11 Mar 2024.

Copyright: © 2024 Wang, Zhao, Zhang, Deng, Maimaiti and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Dr. Jingjing Zhao, Sun Yat-sen University, Guangzhou, China

People also looked at

This paper is in the following e-collection/theme issue:

Published on 18.3.2024 in Vol 26 (2024)

Reducing Loneliness and Social Isolation of Older Adults Through Voice Assistants: Literature Review and Bibliometric Analysis

Authors of this article:

Author Orcid Image

  • Rachele Alessandra Marziali 1 * , MSc   ; 
  • Claudia Franceschetti 1 * , MEng   ; 
  • Adrian Dinculescu 2 * , PhD   ; 
  • Alexandru Nistorescu 2 * , PhD   ; 
  • Dominic Mircea Kristály 3 * , PhD   ; 
  • Adrian Alexandru Moșoi 4 * , MSc   ; 
  • Ronny Broekx 5 * , BSB   ; 
  • Mihaela Marin 2 * , MSc   ; 
  • Cristian Vizitiu 2, 3 * , PhD   ; 
  • Sorin-Aurel Moraru 3 * , PhD   ; 
  • Lorena Rossi 1 * , MEng   ; 
  • Mirko Di Rosa 6 * , PhD  

1 Centre for Innovative Models for Aging Care and Technology, IRCCS INRCA-National Institute of Health and Science on Aging, Ancona, Italy

2 The Space Applications and Technologies Laboratory, Institute of Space Science – Subsidiary of INFLPR (National Institute for Laser, Plasma and Radiation Physics), Magurele, Romania

3 Department of Automatics and Information Technology, Faculty of Electrical Engineering and Computer Science, Transilvania University of Brasov, Brasov, Romania

4 Department of Psychology and Education Sciences, Faculty of Psychology and Education Sciences, Transilvania University of Brasov, Brasov, Romania

5 Innovation Department, ePoint, Hamont, Belgium

6 Centre for Biostatistics and Applied Geriatric Clinical Epidemiology, IRCCS INRCA-National Institute of Health and Science on Aging, Ancona, Italy

*all authors contributed equally

Corresponding Author:

Claudia Franceschetti, MEng

Centre for Innovative Models for Aging Care and Technology, IRCCS INRCA-National Institute of Health and Science on Aging

Via Santa Margherita 5

Ancona, 60124

Phone: 39 0718004788

Email: [email protected]

Background: Loneliness and social isolation are major public health concerns for older adults, with severe mental and physical health consequences. New technologies may have a great impact in providing support to the daily lives of older adults and addressing the many challenges they face. In this scenario, technologies based on voice assistants (VAs) are of great interest and potential benefit in reducing loneliness and social isolation in this population, because they could overcome existing barriers with other digital technologies through easier and more natural human-computer interaction.

Objective: This study aims to investigate the use of VAs to reduce loneliness and social isolation of older adults by performing a systematic literature review and a bibliometric cluster mapping analysis.

Methods: We searched PubMed, Embase, and Scopus databases for articles that were published in the last 6 years, related to the following main topics: voice interface, VA, older adults, isolation, and loneliness. A total of 40 articles were found, of which 16 (40%) were included in this review. The included articles were then assessed through a qualitative scoring method and summarized. Finally, a bibliometric analysis was conducted using VOSviewer software (Leiden University’s Centre for Science and Technology Studies).

Results: Of the 16 articles included in the review, only 2 (13%) were considered of poor methodological quality, whereas 9 (56%) were of medium quality and 5 (31%) were of high quality. Finally, through bibliometric analysis, 221 keywords were extracted, of which 36 (16%) were selected. The most important keywords, by number of occurrences and by total link strength; results of the analysis with the Association Strength normalization method; and default values were then presented. The final bibliometric network consisted of 36 selected keywords, which were grouped into 3 clusters related to 3 main topics (ie, VA use for social isolation among older adults, the significance of age in the context of loneliness, and the impact of sex factors on well-being). For most of the selected articles, the effect of VA on social isolation and loneliness of older adults was a minor theme. However, more investigations were done on user experience, obtaining preliminary positive results.

Conclusions: Most articles on the use of VAs by older adults to reduce social isolation and loneliness focus on usability, acceptability, or user experience. Nevertheless, studies directly addressing the impact that using a VA has on the social isolation and loneliness of older adults find positive and promising results and provide important information for future research, interventions, and policy development in the field of geriatric care and technology.

Introduction

Nowadays, the aging of the population presents new challenges that requires consideration and response [ 1 ]. Among the major public health concerns regarding older adults, 2 significant concerns are loneliness and social isolation [ 2 ].

In fact, social networks seem to decrease with age and the prevalence of loneliness is estimated to increase as the population ages [ 2 ], to the extent that Valtorta and Hanratty [ 3 ] define loneliness and isolation as being “increasingly part of the experience of growing old.”

Social isolation and loneliness have severe consequences for older adults’ mental and physical health, including depressive symptoms [ 4 ], dementia [ 5 ], coronary heart disease and stroke [ 6 ], and mortality [ 7 ]. Moreover, social isolation and loneliness also have adverse outcomes concerning the use of health services, increasing emergency department and physician visits, hospital readmissions, and long-term care admissions [ 8 ].

New technologies may have a great impact on providing support in the daily lives of older people, especially in the areas of health monitoring, security, and comfort [ 9 ]. Therefore, they could be valuable tools to respond to the many challenges that older adults face.

In this scenario, technologies based on voice assistants (VAs) are of great interest and have potential benefits. VAs are systems based on artificial intelligence techniques that are programmed to be activated at a specific wake word to capture the user’s voice, process and interpret the command via a server, and respond back with a voice response or completed task [ 10 ].

VA systems have the potential to support behavioral interventions using everyday life technologies such as smartphones, tablets, and smart speakers [ 9 ]. The strength behind the use of voice-based technology, having reached a significant stage of maturity, is strictly related to the concept of ubiquitous computing ( Figure 1 ), introduced by Weiser in 1991 when thinking about a paradigm of technology able to adapt to the human environment that vanish in the background [ 11 ]. Indeed, VA technology is physically intangible; it does not force the user to be physically at a particular place to operate, and it provides interaction using natural language [ 9 ].

research topics in regression analysis

Concerning the application to older people, this easy and natural human-computer interaction gives VA systems the potential to overcome possible barriers existing with other digital technologies, which appears particularly promising and appropriate [ 9 ].

In light of this, the objective of this study is to investigate the use of VAs to reduce loneliness and social isolation of older adults by performing a literature review and a bibliometric analysis.

Database Creation

A literature search of scientific articles published from January 1, 2018, to April 4, 2023, was conducted. Considering that VA technology had not reached a significant stage of maturity, especially in its application for social purposes, this time range was defined.

The PubMed, Embase, and Scopus databases were searched to extend the range of eligible articles. In particular, the search was performed by setting up the “Title/Abstract” field in PubMed, the “Title or Abstract” field in Embase, and the “Title, Abstract, Keywords” field in Scopus.

The search was performed using an appropriate sequence of keywords, based on the research objectives. The first part of the search string was focused on synonyms for VA, whereas the second part specified the application for isolation and loneliness in older adults. The search string used was as follows: ((voice interface) OR (voice assistant) OR (vocal interface) OR (vocal assistant) OR (speech agent) OR (vocal agent)) AND (olde* OR elder*) AND (isolation OR loneliness).

We collected a total of 40 publications: 34 from Scopus, 4 from PubMed, and 2 from Embase.

Study Selection

The selection of the eligible studies was performed according to the following principles:

  • Including only publications in English language: no documents were excluded.
  • Removal of overlaps between the different databases: 3 overlapping documents were identified.
  • Excluding papers in which the title and abstract were not relevant to the research question: 12 papers were excluded.
  • Removal of articles not retrieved: 1 article was excluded.
  • Excluding articles not pertinent to the research question: 8 documents were excluded.

The studies were assessed independently by 3 authors (CF, RAM, and AD). Any disagreement and uncertainties in the study selection were resolved by discussion. In particular, 2 authors conducted the first assessment, and another one solved the divergences.

Multimedia Appendix 1 [ 12 - 19 ] reports the list of excluded articles concerning eligibility assessment and details about the motivations for their exclusion.

The final database was composed of 40% (16/40) of the collected documents.

Figure 2 reports the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram [ 20 ], summarizing the identification, screening, and inclusion procedures performed.

research topics in regression analysis

Quality Scoring

As systematic reviews are comprehensive and rigorous assessments of existing literature on a specific research question and they aim to synthesize the available evidence to provide a reliable and unbiased summary, the “Tool for Scoring Quality of Non-Empirical Data Sources” [ 21 ], owned by the Aerospace Medicine Systematic Review Group, was used to assess the quality of individual studies included in this review. In total, 2 authors (RAM and CF) performed this evaluation independently, solving any disagreements or doubts through discussion. It is important to note that the purpose of quality scoring in systematic reviews is not to exclude studies but rather to provide an evaluation of their methodological strengths and weaknesses. The scoring process helps reviewers assess the overall risk of bias in the body of evidence and inform their conclusions and recommendations.

Data Extraction

To perform the synthesis of findings, a data extraction from the 16 selected articles was conducted. The extraction consisted of a further evaluation of the full text of the articles. In total, 2 authors (MDR and CF) independently extracted information from the selected studies, including reference, population, technological solution, environment, study design, outcomes, and main results. The assessors made the information homogeneous and analyzed the articles together in the case of doubts or missing data. The data extracted were reported in the corresponding section of the synthesis of findings table ( Table 1 ).

a VA: voice assistant.

b PACS: postacute COVID-19 syndrome.

c HRQoL: health-related quality of life.

d DASS-21: Depression Anxiety Scale-21.

e CD-RISC-25: Connor-Davidson Resilience Scale-25.

f EQ-5D-5L: EuroQol-5 Dimensions-5 Levels.

g ISI: Insomnia Severity Index.

h SF-36: 36-Item Short Form Health Survey.

i NGD: normalized Google distance.

j N/A: not applicable.

k SSPA: Social Skills Performance Assessment.

l ADL: activities of daily living.

Bibliometric Analysis

A bibliometric analysis was also conducted to construct a map of the selected articles using VOSviewer software (version 1.6.19; Leiden University’s Centre for Science and Technology Studies). This tool represents one of the most popular programs for bibliometric cluster mapping [ 38 ].

To illustrate the keyword co-occurrence network, keywords were extracted from the list of the 16 included articles.

During the map creation, the authors choose the co-occurrence type of analysis on keywords and selected full counting as the counting method. The threshold of the minimum number of occurrences of a keyword was set at 2 keywords. All the keywords were illustrated regardless of the greatest total link strength. At the selected keywords’ verification step, the authors considered it convenient to merge similar words by creating a thesaurus file. Thus, the thesaurus file included a column of similar keywords and another column with the keyword to be replaced with. Hence, in the final step, the selected keywords were analyzed using the Association Strength normalization method and default values. In addition, for clustering, the default values of resolution (ie, 1.00), minimum cluster size (ie, 1), and merge small cluster option were used.

In the following sections, the synthesis of the findings and results of the bibliometric analysis and qualitative scoring of the 16 selected articles are presented.

Synthesis of Findings

The selected articles were assessed with regard to population, technological solution, environment, study design, outcomes, and main results. Table 1 presents a synthesis of the findings.

In summary, the population most frequently involved in the selected studies is older adults. In some cases, informal caregivers [ 22 ], geriatric experts [ 29 ], the medical community, the general public [ 35 ], or formal caregivers working in a day-care facility with experience in caring for people with dementia [ 36 ] are also involved. All the articles detail the total number of people engaged, except for 31% (5/16) of the articles [ 26 , 31 , 32 , 34 , 35 ]. The remaining articles involve a minimum of 7 and a maximum of 109 older adults. Among the selected articles, the age of the population varies widely, including people aged >50 [ 22 , 24 , 26 , 30 ], >60 [ 23 , 33 ], >65 [ 27 , 29 ], and >75 years [ 28 ]. Naturally, professionals are younger, ranging from 21 [ 29 ] to 33 [ 36 ] years. However, for some articles [ 25 , 31 , 34 - 37 ], there is no information on the age of the population involved. Instead, the sex of the participants is only specified in 56% (9/16) of the articles [ 24 , 26 - 30 , 32 , 33 , 36 ], in which a majority of female users are included.

In addition, 25% (4/16) of the articles consider participants’ familiarity with technology, involving only people with no experience with VA technology [ 26 ] and digital devices [ 31 ], involving only people with low technology use [ 32 ], or specifying people’s technological abilities [ 27 ]. In addition, some studies consider clinical conditions: 6% (1/16) of the articles [ 22 ] included people with diabetes or long-term health conditions, whereas others include people with postacute COVID-19 syndrome [ 24 ]; with normative cognitive functioning [ 28 ]; with no severe visual or hearing impairment and no moderate to severe cognitive impairment [ 30 ]; with mild difficulties in social skills, depression and anxiety symptoms, and nonverbal impairment [ 33 ]; and without dementia [ 36 ].

Technological Solution

Regarding VA technology solutions, 44% (7/16) of the articles [ 22 , 24 , 25 , 28 , 29 , 32 , 34 ] report the use of commercially available VAs, for example, Google Assistant, Amazon Alexa, Apple Siri, and Microsoft Cortana. Some studies specify the design of new VA systems developed using the Amazon Alexa platform and Alexa Voice services [ 36 ] or implementing the Google Voice Android Software Development Kit on a tablet [ 27 ]. In other studies, the newly designed VA is embedded in a mobile app [ 23 ], a PC application [ 26 ], or even embodied as a household potted flower [ 35 ]. A total of 13% (2/16) of the articles [ 31 , 32 ] describe the design and the testing of a new VA-based digital intelligent platform. Finally, 1 (6%) article [ 33 ] presents a web-based automated version of a VA designed to improve communication skills, whereas another one [ 37 ] involves a personalized and expressive VA.

Environment

The environment in most of the articles [ 22 , 24 , 25 , 29 - 31 , 33 , 37 ] is the home, which is alternated, in the study by Pradhan et al [ 32 ], with the older adult living community and, in the studies by Bravo et al [ 23 ] and Simpson et al [ 35 ], with the retirement home. Instead, the environments in other articles are the laboratory [ 26 , 27 ], the independent living facility [ 28 ], the older adult care center [ 34 ], and the day-care facility [ 36 ]. Thus, the selected articles concerning the use of a VA for social isolation and loneliness address both older adults living independently at home and those living in a facility.

Study Design

Regarding the study design, among the 16 selected studies, 4 (25%) are quantitative, including 1 (6%) evaluation test [ 23 ], 1 (6%) pre-post study [ 24 ], 1 (6%) development and user test [ 26 ], and 1 (6%) VAs test [ 34 ]. Qualitative studies include 1 (6%) service evaluation [ 22 ], 1 (6%) evaluation test [ 29 ], and 1 (6%) pre- post study [ 32 ]. Then, there are 5 (31%) mixed studies, including both qualitative and quantitative methods, of which 1 (6%) is an evaluation test [ 27 ], 1 (6%) is a single-group quasi-experimental study [ 28 ], 1 (6%) is a pre-post study [ 30 ], 1 (6%) is a randomized controlled trial [ 33 ], and 1 (6%) was a usability study [ 36 ]. Finally, the remaining studies include 1 (6%) mini review [ 25 ], 2 (13%) conference speeches [ 35 , 37 ], and 1 (6%) study protocol [ 31 ]. More detailed information on the methodology results is presented in the Quality Scoring section.

Among the outcomes, only 31% (5/16) of the articles [ 22 , 25 , 28 , 31 , 35 ] consider loneliness or social isolation. Of these 16 studies, only 1 (6%) [ 28 ] uses a standardized instrument—the 8-item University of California, Los Angeles (UCLA) Loneliness Scale—to assess the perception of loneliness. Instead, most articles (9/16, 56%) [ 22 - 24 , 27 , 29 , 30 , 32 , 35 , 36 ] focus on topics related to the acceptability, user experience, satisfaction, and usability of the technological solution, whereas a smaller number (2/16, 13%) [ 26 , 34 ] focuses on its technical performance. To evaluate these aspects, 5-point Likert scales are used only by 19% (3/16) of the articles [ 23 , 27 , 36 ].

Further outcomes addressed are verbal and nonverbal behavior in social communication [ 33 ], definition of project objectives, scientific and technological goals and actions [ 37 ], program impact on health and care trajectories [ 31 ], codes and overarching themes [ 29 ], interaction anthropomorphic aspects [ 28 ], and psychological and physical aspects such as frailty and quality of life [ 24 , 31 ].

Main Results

Turning to the main results of using a VA, the impact on loneliness and social isolation is positive, leading to an improvement in users’ perceptions. Specifically, the participants in 13% (2/16) of the studies [ 22 , 24 ] report that the VA helped them cope with loneliness, whereas another study (1/16, 6%) [ 28 ] finds a significant reduction in perceived loneliness after 4 weeks of use and that the relational greetings from the user to the VA predict this reduction. Moreover, the loneliness experienced by the person forecasts the number of greetings he or she makes to the VA. Finally, a mini review (1/16, 6%) [ 25 ] outlines that the use of VA in older adults improves social connectedness and reduces loneliness.

Other benefits obtained include a positive impact on health and social well-being [ 22 ]; improvement in postacute COVID-19 syndrome symptoms, frailty, and health-related quality of life at 6 months follow-up [ 24 ]; sedentary life changes [ 24 ]; and significant improvement in eye contact and facial expressivity [ 33 ].

Regarding the VA, it is considered useful [ 24 ], satisfying [ 23 , 27 ], and interesting [ 36 ], and it obtains good results in the acknowledgment (the ability to recognize user contextual information) and engagement (the ability to maintain a coherent conversation) performance [ 34 ]. In addition, among participants in the study by Pech et al [ 30 ], 63% have a positive opinion toward the system used, and in the study by Striegl et al [ 36 ], both older adults and formal caregivers describe that the VA used have a high feasibility to support people with dementia in activities of daily living.

The main results also include technical information about the VA. For example, in 1 (6%) study [ 26 ], the VA obtains, in all the commands, a right answer ratio percentage >75%; another (1/16, 6%) study [ 29 ] identifies 8 major themes as possible VA beneficial functions; and another (1/16, 6%) study [ 32 ] presents crucial information for VA development, whereas in another (1/16, 6%) study [ 35 ], the device prototype is developed. Finally, critical issues emerge: VA interruptions when the person pauses for too long [ 27 ], older adults’ resistance to change, unplanned workload for a formal caregiver, specific technological obstacles [ 30 ], and bad results in the ability to suggest and perform some related activities at the end of the interaction [ 34 ]. Instead, the proposed improvements include facilitated access to professionals, communication at community events, late-night pharmacy service, customized activity proposals, and videoconferencing [ 30 ].

For 13% (2/16) of the articles [ 31 , 37 ], it is not applicable to define the main results.

Along with the bibliometric analysis, the authors built a thesaurus file containing the words that can be replaced, considering their very close meaning. The thesaurus file is presented in Table 2 .

The bibliometric analysis extracted 221 keywords from the included articles, of which 36 (16%) met the threshold of 2 occurrences. The keyword list is presented in Table 3 , in descending order of occurrence, showing the number of occurrences and the total link strength.

As can be observed in Table 3 , the most used keywords by occurrence were as follows: “social isolation” (n=8), “human” (n=6), “older adults” (n=6), “aged” (n=5), “covid-19” (n=5), “loneliness” (n=5), “human computer interaction” (n=4), and “voice assistant” (n=4).

The most used keywords by total link strength, as shown in Table 3 , were as follows: “human” (n=53), “aged” (n=44), “loneliness” (n=44), “social isolation” (n=42), “covid-19” (n=42), “pandemics” (n=29), “very elderly” (n=29), “older adults” (n=28), “prospective study” (n=25), “quality of life” (n=25).

The bibliometric network is illustrated in Figure 3 and consists of 3 clusters of 36 keywords. The clusters are presented in more detail in Table 4 , where each keyword from a cluster is shown in descending order by occurrence.

research topics in regression analysis

According to the scoring tool, 13% (2/16) of the documents were assessed as being of poor quality in terms of the methodology. In the study by Simpson et al [ 35 ], it is unclear what the methodological information is based on, how it is presented, and if it is in line with other sources. The document is based on a conference speech on methods for the design-thinking approach. Instead, in the study by Torres et al [ 37 ], most of the information is not clearly sourced; it is unclear what the methodological information is based on and if it is in line with other sources. In addition, this paper is based on a speech at a conference on the objectives, goals, and actions of a research and innovation project.

A total of 56% (9/16) of the documents were considered medium quality. Specifically, 44% (7/16) articles [ 22 , 23 , 25 - 27 , 29 , 31 ] contain clear sources, methodological quality, and information value, presenting findings in line with the literature. Nevertheless, study designs were not of very high quality, representing mostly multiple case reports and case studies, whereas the study by Corbett et al [ 25 ] is a literature review.

A total of 13% (2/16) of the articles [ 24 , 34 ] have instead a more rigorous approach in the study design, representing a qualitative study and a single-group quasi-experimental study, respectively. However, the former is an abstract document lacking bibliographic references, while in the latter, it is unclear what the methodological information is based on. In both cases, the information presented is not clearly linked with the literature findings.

Finally, 31% (5/16) of the documents were deemed of high quality, considering that the information presented and the methodological information are clearly referenced. Among these, 1 (6%) article [ 33 ] is a randomized controlled study, while the remaining 25% (4/16) [ 28 , 30 , 32 , 36 ] are descriptive or observational studies.

Multimedia Appendix 2 [ 22 - 37 ] provides details of the quality scoring performed on the selected articles.

Principal Findings

The purpose of this study is to synthesize knowledge about the use of VAs to reduce loneliness and social isolation among older adults.

Initially, after conducting the literature research, the quality of the selected articles is investigated, focusing on the strengths and weaknesses of the methodologies used. Of the 16 articles included in the review, only 2 (13%) articles [ 35 , 37 ] are considered poor quality, 9 (56%) articles [ 22 - 27 , 29 , 31 , 34 ] are medium quality, and 5 (31%) articles are high quality [ 28 , 30 , 32 , 33 , 36 ]. In summary, although recent publications in the literature on the use of VA by older adults for the reduction of loneliness and social isolation are not numerous, most of them are of medium to high methodological quality in terms of study design, authenticity, clear methodological quality, clear informational value, and representativeness of available primary sources.

After assessing the methodological quality of the selected articles, the findings are summarized, focusing on population, technological solution, environment, study design, outcomes, and main results for a more detailed overview. Among the 16 articles presented, most focus on the evaluation of acceptability, user experience, satisfaction, usability, or performance of the VA, while only 5 (31%) papers deepen the theme of social isolation and loneliness. Of these studies, 1 (6%) [ 31 ] has no available results, as it is a study protocol, and another (6%) [ 35 ] reached the development stage of a VA prototype. Therefore, 3 (19%) articles remain that investigate the possible effect of the use of a VA on social isolation and loneliness by older adults.

The first paper [ 22 ], a service evaluation study, found that using a VA for 2 months at home helped people with diabetes or other long-term health conditions (such as multiple sclerosis, dementia, and depression) combat loneliness. This is particularly relevant because it seems that social isolation increases the risk of mortality through physiological upregulation of chronic inflammation. This impact is significant even for middle-aged people, but is greater for older adults, particularly men [ 39 ]. Thus, the results obtained from the use of VAs are particularly relevant considering the population the study targeted but an assessment of loneliness would be needed to investigate the actual impact of the use on this dimension.

The second paper, a single-group quasi-experimental study [ 28 ], reported a significant reduction in perceived loneliness, assessed through the 8-item UCLA Loneliness Scale, after older adults living in an independent living facility used a VA for 4 weeks. Thus, loneliness among older adults living alone using a VA has decreased. Moreover, the loneliness perceived at the beginning of the intervention by participants predicts the number of greetings to the VA (such as “Good morning” or “Alexa, I’m home”), and, in addition, these relational greetings forecast loneliness reduction during the month of use. Therefore, according to the authors, VA anthropomorphization might have a role in combating loneliness in older adults.

Finally, the results of a mini review [ 25 ] suggest that the VA reduces loneliness among older adults and increases their connectedness. Older adults perceive the VA as a “companion,” especially those who live alone or have solitary lives for most of the day.

These studies show encouraging results about the potential of a VA in reducing social isolation and loneliness in older adults, in line with the suggestion from a systematic review [ 40 ] that new technologies can be promising opportunities to reduce social isolation and loneliness in this population. For example, 1 (6%) study found that the use of technology by older adults predicts less loneliness, which has in turn been associated with, on the one hand, better self-reported health and subjective well-being and, on the other hand, fewer chronic diseases and less depression [ 41 ]. Therefore, these are preliminary results suggesting that the association between technology use and physical and mental health may be mediated by loneliness.

VAs have the potential to be used by older adults to reduce their social isolation and loneliness, and the results presented go in that direction; however, they cannot be exhaustive nor conclusive.

Finally, the bibliometric cluster mapping analysis provides valuable insights into the relationships between keywords in the included articles. The generated keyword co-occurrence network revealed 3 distinct clusters, each representing a specific theme or concept in the literature.

Cluster 1, represented by keywords such as “social isolation,” “elderly people,” “voice assistant,” and “human computer interaction,” highlights the relevance of VA technology in combating social isolation among older adults. This cluster emphasizes the relevance of the topic. A VA could be a promising tool for facilitating social interactions, promoting well-being, and addressing the challenges faced by older people regarding social isolation. The relevance of VAs in addressing social isolation among older adults aligns with the findings of Portet et al [ 9 ] on the design and evaluation of a smart home VA for older adults. This cluster also corresponds to the author’s focus on the use of quality scoring to evaluate the methodological strengths and weaknesses of the studies, as the inclusion of studies exploring the effectiveness of VAs in combating social isolation would be of particular interest. This cluster emphasizes the importance of designing user-friendly interfaces and incorporating natural language generation and recognition for effective human-computer interaction. This cluster aligns with the literature on ambient assisted living, assistive technology, and artificial intelligence, and it is supported by the work presented in 1 (6%) article [ 10 ] on VAs and their applications, as well as in another (1/16, 6%) article [ 8 ] that discusses technological solutions for addressing social isolation and loneliness in primary care.

Cluster 2 emphasizes the significance of age in the context of loneliness. Keywords such as “loneliness,” “human,” and “quality of life” indicate the importance of understanding the psychological and emotional aspects of loneliness, considering the diverse experiences of individuals across different demographics. This is supported by the works presented by Valtorta and Hanratty [ 3 ] and Holt-Lunstad et al [ 7 ], who discuss the association between loneliness, social isolation, and health outcomes in older adults, emphasizing the importance of considering demographic factors in understanding and addressing these issues. Cluster 2 is also relevant in the context of the COVID-19 pandemic, as it includes keywords such as “COVID-19,” “pandemics,” and “digital divide,” which illustrates the impact of the pandemic on social isolation and the need for technological solutions, such as VAs, to bridge the digital divide and ensure connectivity and support for older adults during times of crisis. A study [ 6 ] on the association between social isolation, loneliness, and health outcomes in the context of coronary heart disease and stroke further emphasizes the significance of addressing social isolation during pandemics.

Cluster 3 encapsulates a range of keywords related to sex, clinical research, and well-being. The presence of keywords, such as “adult,” “female,” and “male,” along with “clinical article” and “well-being” underscores the importance of understanding how sex-specific factors can significantly impact overall well-being. This cluster likely refers to studies and investigations that explore the intersection of sex-related variables with clinical research outcomes, shedding light on how these factors can influence health and well-being differently among various demographic groups. Moreover, Cluster 3 may offer valuable insights into the evolving landscape of clinical research and its focus on addressing sex-specific health concerns, thus promoting a more comprehensive approach to well-being across diverse populations.

These clusters shed light on important topics related to social isolation, loneliness, and the use of VAs in addressing these issues among older adults. The findings underlined here can inform future research, interventions, and policy development in the field of geriatric care and technology.

Strengths and Limitations

The study provides a comprehensive exploration of voice assistance systems used by older individuals, highlighting popular examples such as Amazon Alexa, Google Assistant, Apple Siri, Microsoft Cortana, Samsung Bixby, and Huawei HiVoice. The study examines the strengths and limitations of these systems.

One of the notable strengths of this study is its investigation into the use of VAs to alleviate loneliness and social isolation among older adults. This topic is fairly recent, but its relevance is growing in both the scientific and technological communities.

Moreover, this investigation is supported by both a literature review and a bibliometric analysis to gather as much knowledge as possible on the role of technology in combating loneliness and social isolation in older adults.

In addition, the selection of studies included in the article underwent an independent evaluation process by the authors, with any disagreements or uncertainties being resolved through discussion.

Another strength is the consideration of the scientific articles published in 2018. This choice was driven by the fact that VAs are relatively new and are continually advancing technological solutions. Furthermore, the application of such technology among older individuals is not yet widespread, resulting in a limited number of studies available on the topic. Despite this limitation, the potential benefits of VA solutions for older adults are highly intriguing, and this study aims to shed light on possible applications and the associated impact on older users.

This study also has limitations that need to be pointed out. First, the number of publications in the systematic review is reduced because the topic has only gained relevance recently. However, the authors decided to proceed with the bibliometric analysis to contribute in terms of interpretation, even though the number of papers on the use of VAs to reduce loneliness and social isolation among older adults is limited. Further limitations relate to the fact that 1 (6%) article [ 42 ] could not be retrieved and that the synthesis of findings is not comprehensive, as only the abstract was available for 1 article [ 24 ], nor complete, as it was not applicable to define the main results of 13% (2/16) articles [ 31 , 37 ]. Moreover, the selected studies had great heterogeneity, with only 6% (1/16) of studies [ 33 ] having a control group and 6% (1/16) of studies [ 28 ] having follow-up. Concerning the information about the population, it is not specified if people involved in the studies live alone or not. This could limit considerations regarding social isolation and loneliness. Finally, most articles collected qualitative data without providing quantitative instruments to assess the actual impact of VA use.

Future Directions

On the basis of this literature review and bibliometric analysis, several priorities for future research can be identified. First, working with keywords from clusters 1 and 2, it is easy to see that “loneliness” and “social isolation” have a huge impact on older people [ 43 ]. On the basis of our literature review, authors are more interested in system use and acceptability [ 30 ], acceptance user experience [ 22 ], and system usability [ 36 ], which are just some examples. The main points are “loneliness” and “social isolation,” and we only found 1 study [ 28 ] to reduce perceived loneliness in older adults. Thus, the topic of the use of VA for social isolation and loneliness among older adults seems to be underestimated in comparison to user experience aspects, which are more deeply investigated in the scientific literature.

Similarly, we encourage that researchers include questionnaires to measure loneliness in future studies, for example, the Revised UCLA Loneliness Scale [ 44 ], the De Jong Gierveld Loneliness Scale [ 45 , 46 ], the Steptoe Social Isolation Index for social isolation [ 44 ], and the Cornwell Perceived Isolation Scale for perceived isolation [ 47 ], for use with VA systems based on artificial intelligence techniques or other related systems to improve the life expectancy of older people. For other specific information about these questionnaires, refer to Social Isolation and Loneliness in Older Adults: Opportunities for the Health Care System [ 48 ]. Second, this work shows that the terms social isolation and loneliness are still often treated as interchangeable, although they are actually related but distinct concepts [ 3 ].

In fact, nowadays, the tendency is to refer to loneliness as a subjective negative feeling of perceiving a lack of social network or desired companion, whereas social isolation is the objective lack or scarcity of social contacts and interactions with family, friends, or community [ 3 ]. Therefore, it would be particularly relevant if future studies would clearly define which dimensions they measure, as mentioned in the preceding section. Third, future research should examine the large heterogeneity within the older adult population. Some of the selected articles described different characteristics of the population, but none delved into the possible different impacts of VA use in relation to these variables. Future studies should explore the effects of using a VA on the social isolation and loneliness of older adults, investigating possible differences in sex, socioeconomic background, and also familiarity with technology and living conditions.

Conclusions

This paper conducted a literature review and a bibliometric analysis of the use of VAs among older adults to reduce social isolation and loneliness. The findings indicate that most studies focus on the usability, acceptability, or user experience of the VA. However, studies directly addressing the impact that using a VA has on the social isolation and loneliness of older adults have positive results and provide important information for future research, interventions, and policy development in the field of geriatric care and technology.

Acknowledgments

This study has been developed within the framework of the EMILIO (Increase Self Management and Counteract Social Isolation Using a Voice Assistant Enabled Virtual Concierge) project (AAL-2021-8-120-CP), cofinanced under the Ambient Assisted Living Joint Programme of the European Commission [ 49 ] and the National Funding Agencies of Belgium, the Netherlands, Italy, and Switzerland.

The authors are grateful to all consortium partners: Italian National Institute of Health and Science on Aging (IRCCS INRCA), Solving Team SRL, ICT Factory GmbH, Erdmann Design AG, Magicview, ePoint, Vulpia VZW, Institute of Space Science, INFLPR Subsidiary, Transilvania University of Brasov.

The project website is available on the internet [ 50 ].

Authors' Contributions

RAM, CF, AD, AN, MM, and CV contributed to the methodology, investigation, writing of the original draft, and reviewing and editing. DMK, AAM, and S-AM were responsible for the investigation, writing of the original draft, reviewing, and editing. RB conducted reviewing and editing. LR was involved in conceptualization and funding acquisition, whereas MDR was involved in methodology, project administration, conceptualization, supervision, funding acquisition, reviewing, and editing.

Conflicts of Interest

None declared.

Excluded articles and motivations for the exclusion.

Quality scoring of selected articles.

  • Active ageing : a policy framework. World Health Organization. 2002. URL: https://apps.who.int/iris/handle/10665/67215 [accessed 2023-06-01]
  • Holt-Lunstad J. The potential public health relevance of social isolation and loneliness: prevalence, epidemiology, and risk factors. Public Policy Aging Rep. 2017;27(4):127-130. [ CrossRef ]
  • Valtorta N, Hanratty B. Loneliness, isolation and the health of older adults: do we need a new research agenda? J R Soc Med. Dec 2012;105(12):518-522. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Van As BA, Imbimbo E, Franceschi A, Menesini E, Nocentini A. The longitudinal association between loneliness and depressive symptoms in the elderly: a systematic review. Int. Psychogeriatr. Apr 14, 2021;34(7):657-669. [ CrossRef ]
  • Kuiper JS, Zuidersma M, Oude Voshaar RC, Zuidema SU, van den Heuvel ER, Stolk RP, et al. Social relationships and risk of dementia: a systematic review and meta-analysis of longitudinal cohort studies. Ageing Res Rev. Jul 2015;22:39-57. [ CrossRef ] [ Medline ]
  • Valtorta NK, Kanaan M, Gilbody S, Hanratty B. Loneliness, social isolation and risk of cardiovascular disease in the English longitudinal study of ageing. Eur J Prev Cardiol. Sep 2018;25(13):1387-1396. [ CrossRef ] [ Medline ]
  • Holt-Lunstad J, Smith TB, Baker M, Harris T, Stephenson D. Loneliness and social isolation as risk factors for mortality: a meta-analytic review. Perspect Psychol Sci. Mar 2015;10(2):227-237. [ CrossRef ] [ Medline ]
  • Freedman A, Nicolle J. Social isolation and loneliness: the new geriatric giants: approach for primary care. Can Fam Physician. Mar 2020;66(3):176-182. [ FREE Full text ] [ Medline ]
  • Portet F, Vacher M, Golanski C, Roux C, Meillon B. Design and evaluation of a smart home voice interface for the elderly: acceptability and objection aspects. Pers Ubiquit Comput. Oct 2, 2011;17(1):127-144. [ CrossRef ]
  • Hoy MB. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Med Ref Serv Q. Jan 12, 2018;37(1):81-88. [ CrossRef ] [ Medline ]
  • Weiser M. The computer for the 21st century. Sci Am. Sep 1991;265(3):94-104. [ FREE Full text ] [ CrossRef ]
  • Chen J, Yang YT, Zhu X, Zhu Z. Share and care: a senior-friendly family interaction application. In: Proceedings of the IEEE MIT Undergraduate Research Technology Conference (URTC). 2020. Presented at: URTC 2020; October 9-11, 2020; Cambridge, MA. URL: https://ieeexplore.ieee.org/document/9668885 [ CrossRef ]
  • Eimontaite I, Voinescu A, Alford C, Caleb-Solly P, Morgan P. The impact of different human-machine interface feedback modalities on older participants’ user experience of CAVs in a simulator environment. In: Proceedings of the International Conference on Human Factors in Transportation. 2019. Presented at: AHFE 2019; July 24-28, 2019; Washington, DC. URL: https://link.springer.com/chapter/10.1007/978-3-030-20503-4_11 [ CrossRef ]
  • Eirale A, Martini M, Tagliavini L, Gandini D, Chiaberge M, Quaglia G. Marvin: an innovative omni-directional robotic assistant for domestic environments. Sensors (Basel). Jul 14, 2022;22(14):1-22. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Martin-Hammond A, Vemireddy S, Rao K. Exploring older adults' beliefs about the use of intelligent assistants for consumer health information management: a participatory design study. JMIR Aging. Dec 11, 2019;2(2):e15381. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Méndez JI, Mata O, Ponce P, Meier A, Peffer T, Molina A. Multi-sensor system, gamification, and artificial intelligence for benefit elderly people. In: Ponce H, Martínez-Villaseñor L, Brieva J, Moya-Albor E, editors. Challenges and Trends in Multimodal Fall Detection for Healthcare. Cham, Switzerland. Springer; 2020.
  • Restyandito, Febryandi, Nugraha KA, Sebastian D. Mobile social media interface design for elderly in Indonesia. In: Proceedings of the HCI International 2020 – Late Breaking Posters. 2020. Presented at: HCII 2020; July 19-24, 2020; Copenhagen, Denmark. URL: https://link.springer.com/chapter/10.1007/978-3-030-60703-6_10 [ CrossRef ]
  • Syeda MZ, Park M, Kim Y, Kwon YM. Tangible social content service system: making digital technology easier to use by elderly and its usability evaluation. In: Proceedings of the 12th International Conference on Complex, Intelligent, and Software Intensive Systems. 2018. Presented at: CISIS 2018; July 4-6, 2018; Matsue, Japan. [ CrossRef ]
  • Zhou D, Barakova EI, An P, Rauterberg M. Assistant robot enhances the perceived communication quality of people with dementia: a proof of concept. IEEE Trans Human Mach Syst. Jun 2022;52(3):332-342. [ CrossRef ]
  • Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. Mar 29, 2021;372:n71. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Laws JM, Winnard A. Tool for scoring the quality of non-empirical data sources- E.G: technical reports. Aerospace Medicine and Rehabilitation Laboratory, Northumbria University. 2019. URL: https:/​/www.​researchgate.net/​publication/​331385312_Tool_for_Scoring_the_Quality_of_Non-Empirical_Data_Sources-_EG_Technical_Reports [accessed 2024-02-23]
  • Balasubramanian GV, Beaney P, Chambers R. Digital personal assistants are smart ways for assistive technology to aid the health and wellbeing of patients and carers. BMC Geriatr. Nov 15, 2021;21(1):643. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bravo SL, Herrera CJ, Valdez EC, Poliquit KJ, Ureta J, Cu J, et al. CATE: an embodied conversational agent for the elderly. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART. 2020. Presented at: ICAART 2020; February 22-24, 2020; Valletta, Malta. URL: https://www.scitepress.org/Link.aspx?doi=10.5220/0009174009410948 [ CrossRef ]
  • Caselgrandi A, Milić J, Motta F, Belli M, Venuta M, Aprile E, et al. Voice assistance to develop a participatory research and action to improve health trajectories of people with PACS. Antivir Ther. Dec 1, 2021;26(1_suppl):13-14. [ CrossRef ]
  • Corbett CF, Wright PJ, Jones K, Parmer M. Voice-activated virtual home assistant use and social isolation and loneliness among older adults: mini review. Front Public Health. 2021;9:742012. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Farías-Barraza B, Reyes-Rogget M, López FA, López-Martínez IN, Contreras-Bolton C, Linfati R. Low-cost voice assistant design and testing for older adults. In: Proceedings of the Computer Information Systems and Industrial Management. 2022. Presented at: CISIM 2022; July 15-17, 2022; Barranquilla, Colombia. [ CrossRef ]
  • Garcia-Mendez S, de Arriba-Perez F, Gonzalez-Castano FJ, Regueiro-Janeiro JA, Gil-Castineira F. Entertainment chatbot for the digital inclusion of elderly people without abstraction capabilities. IEEE Access. May 17, 2021;9:75878-75891. [ CrossRef ]
  • Jones VK, Hanus M, Yan C, Shade MY, Blaskewicz Boron J, Maschieri Bicudo R. Reducing loneliness among aging adults: the roles of personal voice assistants and anthropomorphic interactions. Front Public Health. 2021;9:750736. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • O'Brien K, Light SW, Bradley S, Lindquist L. Optimizing voice-controlled intelligent personal assistants for use by home-bound older adults. J Am Geriatr Soc. May 2022;70(5):1504-1509. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pech M, Gbessemehlan A, Dupuy L, Sauzéon H, Lafitte S, Bachelet P, et al. Lessons learned from the SoBeezy program for older adults during the COVID-19 pandemic: experimentation and evaluation. JMIR Form Res. Nov 24, 2022;6(11):e39185. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pérès K, Zamudio-Rodriguez A, Dartigues JF, Amieva H, Lafitte S. Prospective pragmatic quasi-experimental study to assess the impact and effectiveness of an innovative large-scale public health intervention to foster healthy ageing in place: the SoBeezy program protocol. BMJ Open. Apr 29, 2021;11(4):e043082. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pradhan A, Findlater L, Lazar A. "Phantom friend" or "Just a box with information": personification and ontological categorization of smart speaker-based voice assistants by older adults. Proc ACM Hum Comput Interact. Nov 07, 2019;3(CSCW):1-21. [ CrossRef ]
  • Razavi SZ, Schubert LK, van Orden K, Ali MR, Kane B, Hoque E. Discourse behavior of older adults interacting with a dialogue agent competent in multiple topics. ACM Trans Interact Intell Syst. Jul 23, 2022;12(2):1-21. [ CrossRef ]
  • Reis A, Paulino D, Paredes H, Barroso I, Monteiro MJ, Rodrigues V, et al. Using intelligent personal assistants to assist the elderlies: an evaluation of Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri. In: Proceedings of the 2nd International Conference on Technology and Innovation in Sports, Health and Wellbeing (TISHW). 2018. Presented at: TISHW; June 20-22, 2018; Thessaloniki, Greece. URL: https://ieeexplore.ieee.org/document/8559503/authors#authors [ CrossRef ]
  • Simpson J, Gaiser F, MacÍk M, Breßgott T. Daisy: a friendly conversational agent for older adults. In: Proceedings of the 2nd Conference on Conversational User Interfaces. 2020. Presented at: CUI '20; July 22-24, 2020; Bilbao, Spain. [ CrossRef ]
  • Striegl J, Gollasch D, Loitsch C, Weber G. Designing VUIs for social assistance robots for people with dementia. In: Proceedings of Mensch und Computer 2021. 2021. Presented at: MuC '21; September 5-8, 2021; Ingolstadt, Germany. [ CrossRef ]
  • Torres MI, Chollet G, Montenegro C, Tenorio-Laranga J, Gordeeva O, Esposito A, et al. EMPATHIC, Expressive, Advanced Virtual Coach to Improve Independent Healthy-Life-Years of the Elderdy. Presented at: 4th International Conference on Advances in Speech and Language Technologies for Iberian Languages, IberSPEECH 2018; 21-23 November 2018, 2018;172-173; Barcelona, Spain. [ CrossRef ]
  • van Eck NJ, Waltman L. Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics. Aug 2010;84(2):523-538. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Yang YC, McClintock MK, Kozloski M, Li T. Social isolation and adult mortality: the role of chronic inflammation and sex differences. J Health Soc Behav. Jun 2013;54(2):183-203. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Poscia A, Stojanovic J, La Milia DI, Duplaga M, Grysztar M, Moscato U, et al. Interventions targeting loneliness and social isolation among the older people: an update systematic review. Exp Gerontol. Feb 2018;102:133-144. [ CrossRef ] [ Medline ]
  • Chopik WJ. The benefits of social technology use among older adults are mediated by reduced loneliness. Cyberpsychol Behav Soc Netw. Sep 2016;19(9):551-556. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Chen S, Nakamura M. Generating personalized dialogues based on conversation log summarization and sentiment analysis. In: Proceedings of the 23rd International Conference on Information Integration and Web Intelligence. 2021. Presented at: iiWAS2021; November 29-December 1, 2021; Linz, Austria. [ CrossRef ]
  • OʼSúilleabháin PS, Gallagher S, Steptoe A. Loneliness, living alone, and all-cause mortality: the role of emotional and social loneliness in the elderly during 19 years of follow-up. Psychosom Med. 2019;81(6):521-526. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Steptoe A, Shankar A, Demakakos P, Wardle J. Social isolation, loneliness, and all-cause mortality in older men and women. Proc Natl Acad Sci U S A. Apr 09, 2013;110(15):5797-5801. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • de Jong-Gierveld J, Kamphuls F. The development of a Rasch-type loneliness scale. Appl Psychol Meas. Jul 27, 2016;9(3):289-299. [ CrossRef ]
  • Gierveld JD, Tilburg TV. A 6-item scale for overall, emotional, and social loneliness: confirmatory tests on survey data. Res Aging. 2006;28(5):582-598. [ CrossRef ]
  • Cornwell EY, Waite LJ. Social disconnectedness, perceived isolation, and health among older adults. J Health Soc Behav. Mar 2009;50(1):31-48. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • National Academies of Sciences, Engineering, and Medicine, Division of Behavioral and Social Sciences and Education, Health and Medicine Division, Board on Behavioral, Cognitive, and Sensory Sciences, Board on Health Sciences Policy, Committee on the Health and Medical Dimensions of Social Isolation and Loneliness in Older Adults. Social Isolation and Loneliness in Older Adults: Opportunities for the Health Care System. Washington, DC. National Academies Press; 2020.
  • Ageing well in the digital world. Active Assisted Living Programme. URL: https://www.aal-europe.eu/ [accessed 2024-02-14]
  • Emilio–personal assistant. Active Assisted Living Programme. URL: https://www.emilio-aal.eu/ [accessed 2024-02-26]

Abbreviations

Edited by T de Azevedo Cardoso; submitted 04.07.23; peer-reviewed by V Jones, F Yang; comments to author 26.09.23; revised version received 13.10.23; accepted 24.11.23; published 18.03.24.

©Rachele Alessandra Marziali, Claudia Franceschetti, Adrian Dinculescu, Alexandru Nistorescu, Dominic Mircea Kristály, Adrian Alexandru Moșoi, Ronny Broekx, Mihaela Marin, Cristian Vizitiu, Sorin-Aurel Moraru, Lorena Rossi, Mirko Di Rosa. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 18.03.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

  • Share full article

Advertisement

Supported by

What the Data Says About Pandemic School Closures, Four Years Later

The more time students spent in remote instruction, the further they fell behind. And, experts say, extended closures did little to stop the spread of Covid.

Sarah Mervosh

By Sarah Mervosh ,  Claire Cain Miller and Francesca Paris

Four years ago this month, schools nationwide began to shut down, igniting one of the most polarizing and partisan debates of the pandemic.

Some schools, often in Republican-led states and rural areas, reopened by fall 2020. Others, typically in large cities and states led by Democrats, would not fully reopen for another year.

A variety of data — about children’s academic outcomes and about the spread of Covid-19 — has accumulated in the time since. Today, there is broad acknowledgment among many public health and education experts that extended school closures did not significantly stop the spread of Covid, while the academic harms for children have been large and long-lasting.

While poverty and other factors also played a role, remote learning was a key driver of academic declines during the pandemic, research shows — a finding that held true across income levels.

Source: Fahle, Kane, Patterson, Reardon, Staiger and Stuart, “ School District and Community Factors Associated With Learning Loss During the COVID-19 Pandemic .” Score changes are measured from 2019 to 2022. In-person means a district offered traditional in-person learning, even if not all students were in-person.

“There’s fairly good consensus that, in general, as a society, we probably kept kids out of school longer than we should have,” said Dr. Sean O’Leary, a pediatric infectious disease specialist who helped write guidance for the American Academy of Pediatrics, which recommended in June 2020 that schools reopen with safety measures in place.

There were no easy decisions at the time. Officials had to weigh the risks of an emerging virus against the academic and mental health consequences of closing schools. And even schools that reopened quickly, by the fall of 2020, have seen lasting effects.

But as experts plan for the next public health emergency, whatever it may be, a growing body of research shows that pandemic school closures came at a steep cost to students.

The longer schools were closed, the more students fell behind.

At the state level, more time spent in remote or hybrid instruction in the 2020-21 school year was associated with larger drops in test scores, according to a New York Times analysis of school closure data and results from the National Assessment of Educational Progress , an authoritative exam administered to a national sample of fourth- and eighth-grade students.

At the school district level, that finding also holds, according to an analysis of test scores from third through eighth grade in thousands of U.S. districts, led by researchers at Stanford and Harvard. In districts where students spent most of the 2020-21 school year learning remotely, they fell more than half a grade behind in math on average, while in districts that spent most of the year in person they lost just over a third of a grade.

( A separate study of nearly 10,000 schools found similar results.)

Such losses can be hard to overcome, without significant interventions. The most recent test scores, from spring 2023, show that students, overall, are not caught up from their pandemic losses , with larger gaps remaining among students that lost the most ground to begin with. Students in districts that were remote or hybrid the longest — at least 90 percent of the 2020-21 school year — still had almost double the ground to make up compared with students in districts that allowed students back for most of the year.

Some time in person was better than no time.

As districts shifted toward in-person learning as the year went on, students that were offered a hybrid schedule (a few hours or days a week in person, with the rest online) did better, on average, than those in places where school was fully remote, but worse than those in places that had school fully in person.

Students in hybrid or remote learning, 2020-21

80% of students

Some schools return online, as Covid-19 cases surge. Vaccinations start for high-priority groups.

Teachers are eligible for the Covid vaccine in more than half of states.

Most districts end the year in-person or hybrid.

Source: Burbio audit of more than 1,200 school districts representing 47 percent of U.S. K-12 enrollment. Note: Learning mode was defined based on the most in-person option available to students.

Income and family background also made a big difference.

A second factor associated with academic declines during the pandemic was a community’s poverty level. Comparing districts with similar remote learning policies, poorer districts had steeper losses.

But in-person learning still mattered: Looking at districts with similar poverty levels, remote learning was associated with greater declines.

A community’s poverty rate and the length of school closures had a “roughly equal” effect on student outcomes, said Sean F. Reardon, a professor of poverty and inequality in education at Stanford, who led a district-level analysis with Thomas J. Kane, an economist at Harvard.

Score changes are measured from 2019 to 2022. Poorest and richest are the top and bottom 20% of districts by percent of students on free/reduced lunch. Mostly in-person and mostly remote are districts that offered traditional in-person learning for more than 90 percent or less than 10 percent of the 2020-21 year.

But the combination — poverty and remote learning — was particularly harmful. For each week spent remote, students in poor districts experienced steeper losses in math than peers in richer districts.

That is notable, because poor districts were also more likely to stay remote for longer .

Some of the country’s largest poor districts are in Democratic-leaning cities that took a more cautious approach to the virus. Poor areas, and Black and Hispanic communities , also suffered higher Covid death rates, making many families and teachers in those districts hesitant to return.

“We wanted to survive,” said Sarah Carpenter, the executive director of Memphis Lift, a parent advocacy group in Memphis, where schools were closed until spring 2021 .

“But I also think, man, looking back, I wish our kids could have gone back to school much quicker,” she added, citing the academic effects.

Other things were also associated with worse student outcomes, including increased anxiety and depression among adults in children’s lives, and the overall restriction of social activity in a community, according to the Stanford and Harvard research .

Even short closures had long-term consequences for children.

While being in school was on average better for academic outcomes, it wasn’t a guarantee. Some districts that opened early, like those in Cherokee County, Ga., a suburb of Atlanta, and Hanover County, Va., lost significant learning and remain behind.

At the same time, many schools are seeing more anxiety and behavioral outbursts among students. And chronic absenteeism from school has surged across demographic groups .

These are signs, experts say, that even short-term closures, and the pandemic more broadly, had lasting effects on the culture of education.

“There was almost, in the Covid era, a sense of, ‘We give up, we’re just trying to keep body and soul together,’ and I think that was corrosive to the higher expectations of schools,” said Margaret Spellings, an education secretary under President George W. Bush who is now chief executive of the Bipartisan Policy Center.

Closing schools did not appear to significantly slow Covid’s spread.

Perhaps the biggest question that hung over school reopenings: Was it safe?

That was largely unknown in the spring of 2020, when schools first shut down. But several experts said that had changed by the fall of 2020, when there were initial signs that children were less likely to become seriously ill, and growing evidence from Europe and parts of the United States that opening schools, with safety measures, did not lead to significantly more transmission.

“Infectious disease leaders have generally agreed that school closures were not an important strategy in stemming the spread of Covid,” said Dr. Jeanne Noble, who directed the Covid response at the University of California, San Francisco health system.

Politically, though, there remains some disagreement about when, exactly, it was safe to reopen school.

Republican governors who pushed to open schools sooner have claimed credit for their approach, while Democrats and teachers’ unions have emphasized their commitment to safety and their investment in helping students recover.

“I do believe it was the right decision,” said Jerry T. Jordan, president of the Philadelphia Federation of Teachers, which resisted returning to school in person over concerns about the availability of vaccines and poor ventilation in school buildings. Philadelphia schools waited to partially reopen until the spring of 2021 , a decision Mr. Jordan believes saved lives.

“It doesn’t matter what is going on in the building and how much people are learning if people are getting the virus and running the potential of dying,” he said.

Pandemic school closures offer lessons for the future.

Though the next health crisis may have different particulars, with different risk calculations, the consequences of closing schools are now well established, experts say.

In the future, infectious disease experts said, they hoped decisions would be guided more by epidemiological data as it emerged, taking into account the trade-offs.

“Could we have used data to better guide our decision making? Yes,” said Dr. Uzma N. Hasan, division chief of pediatric infectious diseases at RWJBarnabas Health in Livingston, N.J. “Fear should not guide our decision making.”

Source: Fahle, Kane, Patterson, Reardon, Staiger and Stuart, “ School District and Community Factors Associated With Learning Loss During the Covid-19 Pandemic. ”

The study used estimates of learning loss from the Stanford Education Data Archive . For closure lengths, the study averaged district-level estimates of time spent in remote and hybrid learning compiled by the Covid-19 School Data Hub (C.S.D.H.) and American Enterprise Institute (A.E.I.) . The A.E.I. data defines remote status by whether there was an in-person or hybrid option, even if some students chose to remain virtual. In the C.S.D.H. data set, districts are defined as remote if “all or most” students were virtual.

Sarah Mervosh covers education for The Times, focusing on K-12 schools. More about Sarah Mervosh

Claire Cain Miller writes about gender, families and the future of work for The Upshot. She joined The Times in 2008 and was part of a team that won a Pulitzer Prize in 2018 for public service for reporting on workplace sexual harassment issues. More about Claire Cain Miller

Francesca Paris is a Times reporter working with data and graphics for The Upshot. More about Francesca Paris

IMAGES

  1. Regression analysis: What it means and how to interpret the outcome

    research topics in regression analysis

  2. What is regression analysis?

    research topics in regression analysis

  3. Regression Analysis: The Ultimate Guide

    research topics in regression analysis

  4. Regression Analysis

    research topics in regression analysis

  5. A Refresher on Regression Analysis

    research topics in regression analysis

  6. PPT

    research topics in regression analysis

VIDEO

  1. Regression Analysis #research

  2. Statistics Webinar on Regression Analysis

  3. REGRESSION ANALYSIS ||BUSINESS STATISTICS|| Lecture

  4. Regression Analysis 1

  5. Regression and Other topics

  6. 16. Lecture 5.4 Regression analysis

COMMENTS

  1. 12 Interesting Linear Regression Project Ideas & Topics For ...

    The linear regression analysis is quite helpful when working on linear regression projects in python. For example, it helps in forecasting future values and trends. It can also predict the effects of changes. ... There are many research papers on this topic, so you won't have trouble finding relevant data sources. In-demand Machine Learning ...

  2. Regression Analysis

    Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices. Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or ...

  3. A Refresher on Regression Analysis

    A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...

  4. Linear Regression Project Ideas

    The standard research project will ask students to select and research variables before using linear regression for statistical analysis. Below are some research suggestions along with project ideas.

  5. Regression Analysis

    Complex Survey Design. Raghunath Arnab, in Survey Sampling Theory and Applications, 2017. 20.1 Introduction. In regression analysis we describe the relationship between a response (dependent) variable and a number of explanatory (independent) variables. We also predict the future value of the dependent variable using the established relationship. The relationships are explained through ...

  6. Regression Analysis: The Complete Guide

    XM for Strategy & Research Research. Get faster, richer insights with qual and quant tools that make powerful market research available to everyone. User Experience. ... Marketing and advertising spending are common topics for regression analysis. Companies use regression when trying to assess the value of ad spend and marketing spend on revenue.

  7. The clinician's guide to interpreting a regression analysis

    Regression analysis is an important statistical method that is commonly used to determine the relationship between several factors ... Logistic regression in medical research. Anesth Analg. 2021 ...

  8. Regression Tutorial with Analysis Examples

    My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.

  9. Regression Analysis

    The aim of linear regression analysis is to estimate the coefficients of the regression equation b 0 and b k (k∈K) so that the sum of the squared residuals (i.e., the sum over all squared differences between the observed values of the i th observation of y i and the corresponding predicted values \( {\hat{y}}_i \)) is minimized.The lower part of Fig. 1 illustrates this approach, which is ...

  10. Sage Research Methods Foundations

    Even though relatively few modern analyses stop with the most basic type of regression analysis, its foundational concepts and techniques lie at the core of advanced modeling strategies. This entry explains these fundamental ideas and approaches based on a linear regression estimated with the ordinary least squares approach, setting the stage ...

  11. Regression Analysis

    Multiple linear regression analysis is essentially similar to the simple linear model, with the exception that multiple independent variables are used in the model. The mathematical representation of multiple linear regression is: Y = a + b X1 + c X2 + d X3 + ϵ. Where: Y - Dependent variable. X1, X2, X3 - Independent (explanatory) variables.

  12. A short intro to linear regression analysis using survey data

    Regression is a statistical method that allows us to look at the relationship between two variables, while holding other factors equal. This post will show how to estimate and interpret linear regression models with survey data using R. We'll use data taken from a Pew Research Center 2016 post-election survey, and you can download the dataset ...

  13. 14 Topics in Multiple Regression

    In short, \(X_2\) has a negative estimated partial regression coefficient represented by the difference in height between the two regression lines. Figure 14.1: Dummy Intercept Variables For a case with multiple nominal categories (e.g., region) the procedure is as follows: (a) determine which category will be assigned as the referent group; (b ...

  14. Regression Analysis

    Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

  15. Simple Linear Regression

    The formula for a simple linear regression is: y is the predicted value of the dependent variable ( y) for any given value of the independent variable ( x ). B0 is the intercept, the predicted value of y when the x is 0. B1 is the regression coefficient - how much we expect y to change as x increases. x is the independent variable ( the ...

  16. Linear Regression in Medical Research

    Linear regression is an extremely versatile technique that can be used to address a variety of research questions and study aims. ... (eg, treatment group or patient sex) and a quantitative outcome (eg, blood pressure). The 2-sample t test and analysis of variance, 3 which are commonly used for this purpose, are essentially special cases of ...

  17. Sage Research Methods

    Understanding Regression Analysis: An Introductory Guide presents the fundamentals of regression analysis, from its meaning to uses, in a concise, easy-to-read, and non-technical style. It illustrates how regression coefficients are estimated, interpreted, and used in a variety of settings within the social sciences, business, law, and public ...

  18. Selected Topics in Regression Analysis

    This chapter covers the specialized topics within the area of regression analysis; using categorical and dummy variables in regression models, constrained least squares, and estimation using the method of moments and its generalization. Categorical variables are variables that represent group membership.

  19. Regression Analysis for COVID-19 Infections and Deaths Based on Food

    More specifically, research on the correlation with health issues has been applied and presents a correlation to various health issues. A multivariable linear regression analysis on global data of COVID-19 cases and deaths recorded a high correlation of cases and deaths with high cholesterol and high body mass .

  20. A Beginner's Guide to Regression Analysis

    Logistic Regression. Logistic Regression comes into play when the dependent variable is discrete. This means that the target value will only have one or two values. For instance, a true or false, a yes or no, a 0 or 1, and so on. In this case, a sigmoid curve describes the relationship between the independent and dependent variables.

  21. Regression Analysis

    Regression Analysis. Description: This file contains information regarding lecture 6 notes. Resource Type: Lecture Notes. pdf. 464 kB ... Topics Business. Finance. Mathematics. Applied Mathematics. Probability and Statistics. Learning Resource Types theaters Lecture Videos. notes Lecture Notes.

  22. II. Explaining the Regression Analyses

    II. Explaining the Regression Analyses. A regression analysis is a statistical technique designed to show the relative importance of each of a number of independent variables in predicting a phenomenon of interest- in this case, the likelihood that a respondent is very happy. For the purpose of this analysis, we constructed two regression ...

  23. regression analysis

    The regression model revealed greater energy savings from retrofits including the facade as compared to those that excluded it. However, those savings are modest considering the energy reductions that are anticipated from deep-energy retrofits. Other relevant factors, such as occupants and their behavior, are vital for determining the value of ...

  24. "On the uses and abuses of regression models: a call for reform of

    When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. ... but unsure where that leaves me. So much research uses regression modeling in some form, and so ...

  25. Degrees of Return: Estimating Internal Rates of Return for College

    Trending Topic Research Files; Communication Resources for Researchers ... Our analysis shows significant differences in the age-earnings trajectories and IRRs across college majors. ... Degrees of return: Estimating internal rates of return for college majors using quantile regression. American Educational Research Journal. Prepublished March ...

  26. Land

    Aimed at advancing the reform of the Paid Use of Residential Land, this study investigates the willingness to pay among farmers and its underlying factors. Based on a Logistic Regression analysis of a micro-survey of 450 pieces of data from the Sichuan Province in 2023, we evaluated the effects of three factors, namely individual, regional and cultural forces. Further, Random Forest analysis ...

  27. Research Roundup: How the Pandemic Changed Management

    Researchers recently reviewed 69 articles focused on the management implications of the Covid-19 pandemic that were published between March 2020 and July 2023 in top journals in management and ...

  28. Frontiers

    Understanding the pattern of species diversity and underlying ecological determinants driving a forest ecosystem is fundamental to conservation biology and forest management. Boreal forests play an irreplaceable role in providing ecosystem services and maintaining the carbon cycle globally, yet research attention remains disproportionately limited and lacking throughout time.Based on field ...

  29. Journal of Medical Internet Research

    Background: Loneliness and social isolation are major public health concerns for older adults, with severe mental and physical health consequences. New technologies may have a great impact in providing support to the daily lives of older adults and addressing the many challenges they face. In this scenario, technologies based on voice assistants (VAs) are of great interest and potential ...

  30. What the Data Says About Pandemic School Closures, Four Years Later

    At the school district level, that finding also holds, according to an analysis of test scores from third through eighth grade in thousands of U.S. districts, led by researchers at Stanford and ...