• Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

data analysis research design

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection  methods, and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

company culture software

Top 10 Company Culture Software for 2024 

Mar 20, 2024

customer testimonial software

Top 15 Best Customer Testimonial Software in 2024

student engagement platform software

Top 10 Best Student Engagement Platform Software

Mar 19, 2024

customer success tool

Customer Success Tool: What it is, Features & Importance

Mar 18, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology

Research Design | Step-by-Step Guide with Examples

Published on 5 May 2022 by Shona McCombes . Revised on 20 March 2023.

A research design is a strategy for answering your research question  using empirical data. Creating a research design means making decisions about:

  • Your overall aims and approach
  • The type of research design you’ll use
  • Your sampling methods or criteria for selecting subjects
  • Your data collection methods
  • The procedures you’ll follow to collect data
  • Your data analysis methods

A well-planned research design helps ensure that your methods match your research aims and that you use the right kind of analysis for your data.

Table of contents

Step 1: consider your aims and approach, step 2: choose a type of research design, step 3: identify your population and sampling method, step 4: choose your data collection methods, step 5: plan your data collection procedures, step 6: decide on your data analysis strategies, frequently asked questions.

  • Introduction

Before you can start designing your research, you should already have a clear idea of the research question you want to investigate.

There are many different ways you could go about answering this question. Your research design choices should be driven by your aims and priorities – start by thinking carefully about what you want to achieve.

The first choice you need to make is whether you’ll take a qualitative or quantitative approach.

Qualitative research designs tend to be more flexible and inductive , allowing you to adjust your approach based on what you find throughout the research process.

Quantitative research designs tend to be more fixed and deductive , with variables and hypotheses clearly defined in advance of data collection.

It’s also possible to use a mixed methods design that integrates aspects of both approaches. By combining qualitative and quantitative insights, you can gain a more complete picture of the problem you’re studying and strengthen the credibility of your conclusions.

Practical and ethical considerations when designing research

As well as scientific considerations, you need to think practically when designing your research. If your research involves people or animals, you also need to consider research ethics .

  • How much time do you have to collect data and write up the research?
  • Will you be able to gain access to the data you need (e.g., by travelling to a specific location or contacting specific people)?
  • Do you have the necessary research skills (e.g., statistical analysis or interview techniques)?
  • Will you need ethical approval ?

At each stage of the research design process, make sure that your choices are practically feasible.

Prevent plagiarism, run a free check.

Within both qualitative and quantitative approaches, there are several types of research design to choose from. Each type provides a framework for the overall shape of your research.

Types of quantitative research designs

Quantitative designs can be split into four main types. Experimental and   quasi-experimental designs allow you to test cause-and-effect relationships, while descriptive and correlational designs allow you to measure variables and describe relationships between them.

With descriptive and correlational designs, you can get a clear picture of characteristics, trends, and relationships as they exist in the real world. However, you can’t draw conclusions about cause and effect (because correlation doesn’t imply causation ).

Experiments are the strongest way to test cause-and-effect relationships without the risk of other variables influencing the results. However, their controlled conditions may not always reflect how things work in the real world. They’re often also more difficult and expensive to implement.

Types of qualitative research designs

Qualitative designs are less strictly defined. This approach is about gaining a rich, detailed understanding of a specific context or phenomenon, and you can often be more creative and flexible in designing your research.

The table below shows some common types of qualitative design. They often have similar approaches in terms of data collection, but focus on different aspects when analysing the data.

Your research design should clearly define who or what your research will focus on, and how you’ll go about choosing your participants or subjects.

In research, a population is the entire group that you want to draw conclusions about, while a sample is the smaller group of individuals you’ll actually collect data from.

Defining the population

A population can be made up of anything you want to study – plants, animals, organisations, texts, countries, etc. In the social sciences, it most often refers to a group of people.

For example, will you focus on people from a specific demographic, region, or background? Are you interested in people with a certain job or medical condition, or users of a particular product?

The more precisely you define your population, the easier it will be to gather a representative sample.

Sampling methods

Even with a narrowly defined population, it’s rarely possible to collect data from every individual. Instead, you’ll collect data from a sample.

To select a sample, there are two main approaches: probability sampling and non-probability sampling . The sampling method you use affects how confidently you can generalise your results to the population as a whole.

Probability sampling is the most statistically valid option, but it’s often difficult to achieve unless you’re dealing with a very small and accessible population.

For practical reasons, many studies use non-probability sampling, but it’s important to be aware of the limitations and carefully consider potential biases. You should always make an effort to gather a sample that’s as representative as possible of the population.

Case selection in qualitative research

In some types of qualitative designs, sampling may not be relevant.

For example, in an ethnography or a case study, your aim is to deeply understand a specific context, not to generalise to a population. Instead of sampling, you may simply aim to collect as much data as possible about the context you are studying.

In these types of design, you still have to carefully consider your choice of case or community. You should have a clear rationale for why this particular case is suitable for answering your research question.

For example, you might choose a case study that reveals an unusual or neglected aspect of your research problem, or you might choose several very similar or very different cases in order to compare them.

Data collection methods are ways of directly measuring variables and gathering information. They allow you to gain first-hand knowledge and original insights into your research problem.

You can choose just one data collection method, or use several methods in the same study.

Survey methods

Surveys allow you to collect data about opinions, behaviours, experiences, and characteristics by asking people directly. There are two main survey methods to choose from: questionnaires and interviews.

Observation methods

Observations allow you to collect data unobtrusively, observing characteristics, behaviours, or social interactions without relying on self-reporting.

Observations may be conducted in real time, taking notes as you observe, or you might make audiovisual recordings for later analysis. They can be qualitative or quantitative.

Other methods of data collection

There are many other ways you might collect data depending on your field and topic.

If you’re not sure which methods will work best for your research design, try reading some papers in your field to see what data collection methods they used.

Secondary data

If you don’t have the time or resources to collect data from the population you’re interested in, you can also choose to use secondary data that other researchers already collected – for example, datasets from government surveys or previous studies on your topic.

With this raw data, you can do your own analysis to answer new research questions that weren’t addressed by the original study.

Using secondary data can expand the scope of your research, as you may be able to access much larger and more varied samples than you could collect yourself.

However, it also means you don’t have any control over which variables to measure or how to measure them, so the conclusions you can draw may be limited.

As well as deciding on your methods, you need to plan exactly how you’ll use these methods to collect data that’s consistent, accurate, and unbiased.

Planning systematic procedures is especially important in quantitative research, where you need to precisely define your variables and ensure your measurements are reliable and valid.

Operationalisation

Some variables, like height or age, are easily measured. But often you’ll be dealing with more abstract concepts, like satisfaction, anxiety, or competence. Operationalisation means turning these fuzzy ideas into measurable indicators.

If you’re using observations , which events or actions will you count?

If you’re using surveys , which questions will you ask and what range of responses will be offered?

You may also choose to use or adapt existing materials designed to measure the concept you’re interested in – for example, questionnaires or inventories whose reliability and validity has already been established.

Reliability and validity

Reliability means your results can be consistently reproduced , while validity means that you’re actually measuring the concept you’re interested in.

For valid and reliable results, your measurement materials should be thoroughly researched and carefully designed. Plan your procedures to make sure you carry out the same steps in the same way for each participant.

If you’re developing a new questionnaire or other instrument to measure a specific concept, running a pilot study allows you to check its validity and reliability in advance.

Sampling procedures

As well as choosing an appropriate sampling method, you need a concrete plan for how you’ll actually contact and recruit your selected sample.

That means making decisions about things like:

  • How many participants do you need for an adequate sample size?
  • What inclusion and exclusion criteria will you use to identify eligible participants?
  • How will you contact your sample – by mail, online, by phone, or in person?

If you’re using a probability sampling method, it’s important that everyone who is randomly selected actually participates in the study. How will you ensure a high response rate?

If you’re using a non-probability method, how will you avoid bias and ensure a representative sample?

Data management

It’s also important to create a data management plan for organising and storing your data.

Will you need to transcribe interviews or perform data entry for observations? You should anonymise and safeguard any sensitive data, and make sure it’s backed up regularly.

Keeping your data well organised will save time when it comes to analysing them. It can also help other researchers validate and add to your findings.

On their own, raw data can’t answer your research question. The last step of designing your research is planning how you’ll analyse the data.

Quantitative data analysis

In quantitative research, you’ll most likely use some form of statistical analysis . With statistics, you can summarise your sample data, make estimates, and test hypotheses.

Using descriptive statistics , you can summarise your sample data in terms of:

  • The distribution of the data (e.g., the frequency of each score on a test)
  • The central tendency of the data (e.g., the mean to describe the average score)
  • The variability of the data (e.g., the standard deviation to describe how spread out the scores are)

The specific calculations you can do depend on the level of measurement of your variables.

Using inferential statistics , you can:

  • Make estimates about the population based on your sample data.
  • Test hypotheses about a relationship between variables.

Regression and correlation tests look for associations between two or more variables, while comparison tests (such as t tests and ANOVAs ) look for differences in the outcomes of different groups.

Your choice of statistical test depends on various aspects of your research design, including the types of variables you’re dealing with and the distribution of your data.

Qualitative data analysis

In qualitative research, your data will usually be very dense with information and ideas. Instead of summing it up in numbers, you’ll need to comb through the data in detail, interpret its meanings, identify patterns, and extract the parts that are most relevant to your research question.

Two of the most common approaches to doing this are thematic analysis and discourse analysis .

There are many other ways of analysing qualitative data depending on the aims of your research. To get a sense of potential approaches, try reading some qualitative research papers in your field.

A sample is a subset of individuals from a larger population. Sampling means selecting the group that you will actually collect data from in your research.

For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

Statistical sampling allows you to test a hypothesis about the characteristics of a population. There are various sampling methods you can use to ensure that your sample is representative of the population as a whole.

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts, and meanings, use qualitative methods .
  • If you want to analyse a large amount of readily available data, use secondary data. If you want data specific to your purposes with control over how they are generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

McCombes, S. (2023, March 20). Research Design | Step-by-Step Guide with Examples. Scribbr. Retrieved 20 March 2024, from https://www.scribbr.co.uk/research-methods/research-design/

Is this article helpful?

Shona McCombes

Shona McCombes

Grad Coach

Research Design 101

Everything You Need To Get Started (With Examples)

By: Derek Jansen (MBA) | Reviewers: Eunice Rautenbach (DTech) & Kerryn Warren (PhD) | April 2023

Research design for qualitative and quantitative studies

Navigating the world of research can be daunting, especially if you’re a first-time researcher. One concept you’re bound to run into fairly early in your research journey is that of “ research design ”. Here, we’ll guide you through the basics using practical examples , so that you can approach your research with confidence.

Overview: Research Design 101

What is research design.

  • Research design types for quantitative studies
  • Video explainer : quantitative research design
  • Research design types for qualitative studies
  • Video explainer : qualitative research design
  • How to choose a research design
  • Key takeaways

Research design refers to the overall plan, structure or strategy that guides a research project , from its conception to the final data analysis. A good research design serves as the blueprint for how you, as the researcher, will collect and analyse data while ensuring consistency, reliability and validity throughout your study.

Understanding different types of research designs is essential as helps ensure that your approach is suitable  given your research aims, objectives and questions , as well as the resources you have available to you. Without a clear big-picture view of how you’ll design your research, you run the risk of potentially making misaligned choices in terms of your methodology – especially your sampling , data collection and data analysis decisions.

The problem with defining research design…

One of the reasons students struggle with a clear definition of research design is because the term is used very loosely across the internet, and even within academia.

Some sources claim that the three research design types are qualitative, quantitative and mixed methods , which isn’t quite accurate (these just refer to the type of data that you’ll collect and analyse). Other sources state that research design refers to the sum of all your design choices, suggesting it’s more like a research methodology . Others run off on other less common tangents. No wonder there’s confusion!

In this article, we’ll clear up the confusion. We’ll explain the most common research design types for both qualitative and quantitative research projects, whether that is for a full dissertation or thesis, or a smaller research paper or article.

Free Webinar: Research Methodology 101

Research Design: Quantitative Studies

Quantitative research involves collecting and analysing data in a numerical form. Broadly speaking, there are four types of quantitative research designs: descriptive , correlational , experimental , and quasi-experimental . 

Descriptive Research Design

As the name suggests, descriptive research design focuses on describing existing conditions, behaviours, or characteristics by systematically gathering information without manipulating any variables. In other words, there is no intervention on the researcher’s part – only data collection.

For example, if you’re studying smartphone addiction among adolescents in your community, you could deploy a survey to a sample of teens asking them to rate their agreement with certain statements that relate to smartphone addiction. The collected data would then provide insight regarding how widespread the issue may be – in other words, it would describe the situation.

The key defining attribute of this type of research design is that it purely describes the situation . In other words, descriptive research design does not explore potential relationships between different variables or the causes that may underlie those relationships. Therefore, descriptive research is useful for generating insight into a research problem by describing its characteristics . By doing so, it can provide valuable insights and is often used as a precursor to other research design types.

Correlational Research Design

Correlational design is a popular choice for researchers aiming to identify and measure the relationship between two or more variables without manipulating them . In other words, this type of research design is useful when you want to know whether a change in one thing tends to be accompanied by a change in another thing.

For example, if you wanted to explore the relationship between exercise frequency and overall health, you could use a correlational design to help you achieve this. In this case, you might gather data on participants’ exercise habits, as well as records of their health indicators like blood pressure, heart rate, or body mass index. Thereafter, you’d use a statistical test to assess whether there’s a relationship between the two variables (exercise frequency and health).

As you can see, correlational research design is useful when you want to explore potential relationships between variables that cannot be manipulated or controlled for ethical, practical, or logistical reasons. It is particularly helpful in terms of developing predictions , and given that it doesn’t involve the manipulation of variables, it can be implemented at a large scale more easily than experimental designs (which will look at next).

That said, it’s important to keep in mind that correlational research design has limitations – most notably that it cannot be used to establish causality . In other words, correlation does not equal causation . To establish causality, you’ll need to move into the realm of experimental design, coming up next…

Need a helping hand?

data analysis research design

Experimental Research Design

Experimental research design is used to determine if there is a causal relationship between two or more variables . With this type of research design, you, as the researcher, manipulate one variable (the independent variable) while controlling others (dependent variables). Doing so allows you to observe the effect of the former on the latter and draw conclusions about potential causality.

For example, if you wanted to measure if/how different types of fertiliser affect plant growth, you could set up several groups of plants, with each group receiving a different type of fertiliser, as well as one with no fertiliser at all. You could then measure how much each plant group grew (on average) over time and compare the results from the different groups to see which fertiliser was most effective.

Overall, experimental research design provides researchers with a powerful way to identify and measure causal relationships (and the direction of causality) between variables. However, developing a rigorous experimental design can be challenging as it’s not always easy to control all the variables in a study. This often results in smaller sample sizes , which can reduce the statistical power and generalisability of the results.

Moreover, experimental research design requires random assignment . This means that the researcher needs to assign participants to different groups or conditions in a way that each participant has an equal chance of being assigned to any group (note that this is not the same as random sampling ). Doing so helps reduce the potential for bias and confounding variables . This need for random assignment can lead to ethics-related issues . For example, withholding a potentially beneficial medical treatment from a control group may be considered unethical in certain situations.

Quasi-Experimental Research Design

Quasi-experimental research design is used when the research aims involve identifying causal relations , but one cannot (or doesn’t want to) randomly assign participants to different groups (for practical or ethical reasons). Instead, with a quasi-experimental research design, the researcher relies on existing groups or pre-existing conditions to form groups for comparison.

For example, if you were studying the effects of a new teaching method on student achievement in a particular school district, you may be unable to randomly assign students to either group and instead have to choose classes or schools that already use different teaching methods. This way, you still achieve separate groups, without having to assign participants to specific groups yourself.

Naturally, quasi-experimental research designs have limitations when compared to experimental designs. Given that participant assignment is not random, it’s more difficult to confidently establish causality between variables, and, as a researcher, you have less control over other variables that may impact findings.

All that said, quasi-experimental designs can still be valuable in research contexts where random assignment is not possible and can often be undertaken on a much larger scale than experimental research, thus increasing the statistical power of the results. What’s important is that you, as the researcher, understand the limitations of the design and conduct your quasi-experiment as rigorously as possible, paying careful attention to any potential confounding variables .

The four most common quantitative research design types are descriptive, correlational, experimental and quasi-experimental.

Research Design: Qualitative Studies

There are many different research design types when it comes to qualitative studies, but here we’ll narrow our focus to explore the “Big 4”. Specifically, we’ll look at phenomenological design, grounded theory design, ethnographic design, and case study design.

Phenomenological Research Design

Phenomenological design involves exploring the meaning of lived experiences and how they are perceived by individuals. This type of research design seeks to understand people’s perspectives , emotions, and behaviours in specific situations. Here, the aim for researchers is to uncover the essence of human experience without making any assumptions or imposing preconceived ideas on their subjects.

For example, you could adopt a phenomenological design to study why cancer survivors have such varied perceptions of their lives after overcoming their disease. This could be achieved by interviewing survivors and then analysing the data using a qualitative analysis method such as thematic analysis to identify commonalities and differences.

Phenomenological research design typically involves in-depth interviews or open-ended questionnaires to collect rich, detailed data about participants’ subjective experiences. This richness is one of the key strengths of phenomenological research design but, naturally, it also has limitations. These include potential biases in data collection and interpretation and the lack of generalisability of findings to broader populations.

Grounded Theory Research Design

Grounded theory (also referred to as “GT”) aims to develop theories by continuously and iteratively analysing and comparing data collected from a relatively large number of participants in a study. It takes an inductive (bottom-up) approach, with a focus on letting the data “speak for itself”, without being influenced by preexisting theories or the researcher’s preconceptions.

As an example, let’s assume your research aims involved understanding how people cope with chronic pain from a specific medical condition, with a view to developing a theory around this. In this case, grounded theory design would allow you to explore this concept thoroughly without preconceptions about what coping mechanisms might exist. You may find that some patients prefer cognitive-behavioural therapy (CBT) while others prefer to rely on herbal remedies. Based on multiple, iterative rounds of analysis, you could then develop a theory in this regard, derived directly from the data (as opposed to other preexisting theories and models).

Grounded theory typically involves collecting data through interviews or observations and then analysing it to identify patterns and themes that emerge from the data. These emerging ideas are then validated by collecting more data until a saturation point is reached (i.e., no new information can be squeezed from the data). From that base, a theory can then be developed .

As you can see, grounded theory is ideally suited to studies where the research aims involve theory generation , especially in under-researched areas. Keep in mind though that this type of research design can be quite time-intensive , given the need for multiple rounds of data collection and analysis.

data analysis research design

Ethnographic Research Design

Ethnographic design involves observing and studying a culture-sharing group of people in their natural setting to gain insight into their behaviours, beliefs, and values. The focus here is on observing participants in their natural environment (as opposed to a controlled environment). This typically involves the researcher spending an extended period of time with the participants in their environment, carefully observing and taking field notes .

All of this is not to say that ethnographic research design relies purely on observation. On the contrary, this design typically also involves in-depth interviews to explore participants’ views, beliefs, etc. However, unobtrusive observation is a core component of the ethnographic approach.

As an example, an ethnographer may study how different communities celebrate traditional festivals or how individuals from different generations interact with technology differently. This may involve a lengthy period of observation, combined with in-depth interviews to further explore specific areas of interest that emerge as a result of the observations that the researcher has made.

As you can probably imagine, ethnographic research design has the ability to provide rich, contextually embedded insights into the socio-cultural dynamics of human behaviour within a natural, uncontrived setting. Naturally, however, it does come with its own set of challenges, including researcher bias (since the researcher can become quite immersed in the group), participant confidentiality and, predictably, ethical complexities . All of these need to be carefully managed if you choose to adopt this type of research design.

Case Study Design

With case study research design, you, as the researcher, investigate a single individual (or a single group of individuals) to gain an in-depth understanding of their experiences, behaviours or outcomes. Unlike other research designs that are aimed at larger sample sizes, case studies offer a deep dive into the specific circumstances surrounding a person, group of people, event or phenomenon, generally within a bounded setting or context .

As an example, a case study design could be used to explore the factors influencing the success of a specific small business. This would involve diving deeply into the organisation to explore and understand what makes it tick – from marketing to HR to finance. In terms of data collection, this could include interviews with staff and management, review of policy documents and financial statements, surveying customers, etc.

While the above example is focused squarely on one organisation, it’s worth noting that case study research designs can have different variation s, including single-case, multiple-case and longitudinal designs. As you can see in the example, a single-case design involves intensely examining a single entity to understand its unique characteristics and complexities. Conversely, in a multiple-case design , multiple cases are compared and contrasted to identify patterns and commonalities. Lastly, in a longitudinal case design , a single case or multiple cases are studied over an extended period of time to understand how factors develop over time.

As you can see, a case study research design is particularly useful where a deep and contextualised understanding of a specific phenomenon or issue is desired. However, this strength is also its weakness. In other words, you can’t generalise the findings from a case study to the broader population. So, keep this in mind if you’re considering going the case study route.

Case study design often involves investigating an individual to gain an in-depth understanding of their experiences, behaviours or outcomes.

How To Choose A Research Design

Having worked through all of these potential research designs, you’d be forgiven for feeling a little overwhelmed and wondering, “ But how do I decide which research design to use? ”. While we could write an entire post covering that alone, here are a few factors to consider that will help you choose a suitable research design for your study.

Data type: The first determining factor is naturally the type of data you plan to be collecting – i.e., qualitative or quantitative. This may sound obvious, but we have to be clear about this – don’t try to use a quantitative research design on qualitative data (or vice versa)!

Research aim(s) and question(s): As with all methodological decisions, your research aim and research questions will heavily influence your research design. For example, if your research aims involve developing a theory from qualitative data, grounded theory would be a strong option. Similarly, if your research aims involve identifying and measuring relationships between variables, one of the experimental designs would likely be a better option.

Time: It’s essential that you consider any time constraints you have, as this will impact the type of research design you can choose. For example, if you’ve only got a month to complete your project, a lengthy design such as ethnography wouldn’t be a good fit.

Resources: Take into account the resources realistically available to you, as these need to factor into your research design choice. For example, if you require highly specialised lab equipment to execute an experimental design, you need to be sure that you’ll have access to that before you make a decision.

Keep in mind that when it comes to research, it’s important to manage your risks and play as conservatively as possible. If your entire project relies on you achieving a huge sample, having access to niche equipment or holding interviews with very difficult-to-reach participants, you’re creating risks that could kill your project. So, be sure to think through your choices carefully and make sure that you have backup plans for any existential risks. Remember that a relatively simple methodology executed well generally will typically earn better marks than a highly-complex methodology executed poorly.

data analysis research design

Recap: Key Takeaways

We’ve covered a lot of ground here. Let’s recap by looking at the key takeaways:

  • Research design refers to the overall plan, structure or strategy that guides a research project, from its conception to the final analysis of data.
  • Research designs for quantitative studies include descriptive , correlational , experimental and quasi-experimenta l designs.
  • Research designs for qualitative studies include phenomenological , grounded theory , ethnographic and case study designs.
  • When choosing a research design, you need to consider a variety of factors, including the type of data you’ll be working with, your research aims and questions, your time and the resources available to you.

If you need a helping hand with your research design (or any other aspect of your research), check out our private coaching services .

data analysis research design

Psst… there’s more (for free)

This post is part of our dissertation mini-course, which covers everything you need to get started with your dissertation, thesis or research project. 

You Might Also Like:

Survey Design 101: The Basics

Is there any blog article explaining more on Case study research design? Is there a Case study write-up template? Thank you.

Solly Khan

Thanks this was quite valuable to clarify such an important concept.

hetty

Thanks for this simplified explanations. it is quite very helpful.

Belz

This was really helpful. thanks

Imur

Thank you for your explanation. I think case study research design and the use of secondary data in researches needs to be talked about more in your videos and articles because there a lot of case studies research design tailored projects out there.

Please is there any template for a case study research design whose data type is a secondary data on your repository?

Sam Msongole

This post is very clear, comprehensive and has been very helpful to me. It has cleared the confusion I had in regard to research design and methodology.

Robyn Pritchard

This post is helpful, easy to understand, and deconstructs what a research design is. Thanks

kelebogile

how to cite this page

Peter

Thank you very much for the post. It is wonderful and has cleared many worries in my mind regarding research designs. I really appreciate .

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Your Modern Business Guide To Data Analysis Methods And Techniques

Data analysis methods and techniques blog post by datapine

Table of Contents

1) What Is Data Analysis?

2) Why Is Data Analysis Important?

3) What Is The Data Analysis Process?

4) Types Of Data Analysis Methods

5) Top Data Analysis Techniques To Apply

6) Quality Criteria For Data Analysis

7) Data Analysis Limitations & Barriers

8) Data Analysis Skills

9) Data Analysis In The Big Data Environment

In our data-rich age, understanding how to analyze and extract true meaning from our business’s digital insights is one of the primary drivers of success.

Despite the colossal volume of data we create every day, a mere 0.5% is actually analyzed and used for data discovery , improvement, and intelligence. While that may not seem like much, considering the amount of digital information we have at our fingertips, half a percent still accounts for a vast amount of data.

With so much data and so little time, knowing how to collect, curate, organize, and make sense of all of this potentially business-boosting information can be a minefield – but online data analysis is the solution.

In science, data analysis uses a more complex approach with advanced techniques to explore and experiment with data. On the other hand, in a business context, data is used to make data-driven decisions that will enable the company to improve its overall performance. In this post, we will cover the analysis of data from an organizational point of view while still going through the scientific and statistical foundations that are fundamental to understanding the basics of data analysis. 

To put all of that into perspective, we will answer a host of important analytical questions, explore analytical methods and techniques, while demonstrating how to perform analysis in the real world with a 17-step blueprint for success.

What Is Data Analysis?

Data analysis is the process of collecting, modeling, and analyzing data using various statistical and logical methods and techniques. Businesses rely on analytics processes and tools to extract insights that support strategic and operational decision-making.

All these various methods are largely based on two core areas: quantitative and qualitative research.

To explain the key differences between qualitative and quantitative research, here’s a video for your viewing pleasure:

Gaining a better understanding of different techniques and methods in quantitative research as well as qualitative insights will give your analyzing efforts a more clearly defined direction, so it’s worth taking the time to allow this particular knowledge to sink in. Additionally, you will be able to create a comprehensive analytical report that will skyrocket your analysis.

Apart from qualitative and quantitative categories, there are also other types of data that you should be aware of before dividing into complex data analysis processes. These categories include: 

  • Big data: Refers to massive data sets that need to be analyzed using advanced software to reveal patterns and trends. It is considered to be one of the best analytical assets as it provides larger volumes of data at a faster rate. 
  • Metadata: Putting it simply, metadata is data that provides insights about other data. It summarizes key information about specific data that makes it easier to find and reuse for later purposes. 
  • Real time data: As its name suggests, real time data is presented as soon as it is acquired. From an organizational perspective, this is the most valuable data as it can help you make important decisions based on the latest developments. Our guide on real time analytics will tell you more about the topic. 
  • Machine data: This is more complex data that is generated solely by a machine such as phones, computers, or even websites and embedded systems, without previous human interaction.

Why Is Data Analysis Important?

Before we go into detail about the categories of analysis along with its methods and techniques, you must understand the potential that analyzing data can bring to your organization.

  • Informed decision-making : From a management perspective, you can benefit from analyzing your data as it helps you make decisions based on facts and not simple intuition. For instance, you can understand where to invest your capital, detect growth opportunities, predict your income, or tackle uncommon situations before they become problems. Through this, you can extract relevant insights from all areas in your organization, and with the help of dashboard software , present the data in a professional and interactive way to different stakeholders.
  • Reduce costs : Another great benefit is to reduce costs. With the help of advanced technologies such as predictive analytics, businesses can spot improvement opportunities, trends, and patterns in their data and plan their strategies accordingly. In time, this will help you save money and resources on implementing the wrong strategies. And not just that, by predicting different scenarios such as sales and demand you can also anticipate production and supply. 
  • Target customers better : Customers are arguably the most crucial element in any business. By using analytics to get a 360° vision of all aspects related to your customers, you can understand which channels they use to communicate with you, their demographics, interests, habits, purchasing behaviors, and more. In the long run, it will drive success to your marketing strategies, allow you to identify new potential customers, and avoid wasting resources on targeting the wrong people or sending the wrong message. You can also track customer satisfaction by analyzing your client’s reviews or your customer service department’s performance.

What Is The Data Analysis Process?

Data analysis process graphic

When we talk about analyzing data there is an order to follow in order to extract the needed conclusions. The analysis process consists of 5 key stages. We will cover each of them more in detail later in the post, but to start providing the needed context to understand what is coming next, here is a rundown of the 5 essential steps of data analysis. 

  • Identify: Before you get your hands dirty with data, you first need to identify why you need it in the first place. The identification is the stage in which you establish the questions you will need to answer. For example, what is the customer's perception of our brand? Or what type of packaging is more engaging to our potential customers? Once the questions are outlined you are ready for the next step. 
  • Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you define which sources of data you will use and how you will use them. The collection of data can come in different forms such as internal or external sources, surveys, interviews, questionnaires, and focus groups, among others.  An important note here is that the way you collect the data will be different in a quantitative and qualitative scenario. 
  • Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the data you collect will be useful, when collecting big amounts of data in different formats it is very likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start working with your data you need to make sure to erase any white spaces, duplicate records, or formatting errors. This way you avoid hurting your analysis with bad-quality data. 
  • Analyze : With the help of various techniques such as statistical analysis, regressions, neural networks, text analysis, and more, you can start analyzing and manipulating your data to extract relevant conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you answer the questions you first thought of in the identify stage. Various technologies in the market assist researchers and average users with the management of their data. Some of them include business intelligence and visualization software, predictive analytics, and data mining, among others. 
  • Interpret: Last but not least you have one of the most important steps: it is time to interpret your results. This stage is where the researcher comes up with courses of action based on the findings. For example, here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations and work on them. 

Now that you have a basic understanding of the key data analysis steps, let’s look at the top 17 essential methods.

17 Essential Types Of Data Analysis Methods

Before diving into the 17 essential types of methods, it is important that we go over really fast through the main analysis categories. Starting with the category of descriptive up to prescriptive analysis, the complexity and effort of data evaluation increases, but also the added value for the company.

a) Descriptive analysis - What happened.

The descriptive analysis method is the starting point for any analytic reflection, and it aims to answer the question of what happened? It does this by ordering, manipulating, and interpreting raw data from various sources to turn it into valuable insights for your organization.

Performing descriptive analysis is essential, as it enables us to present our insights in a meaningful way. Although it is relevant to mention that this analysis on its own will not allow you to predict future outcomes or tell you the answer to questions like why something happened, it will leave your data organized and ready to conduct further investigations.

b) Exploratory analysis - How to explore data relationships.

As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there is still no notion of the relationship between the data and the variables. Once the data is investigated, exploratory analysis helps you to find connections and generate hypotheses and solutions for specific problems. A typical area of ​​application for it is data mining.

c) Diagnostic analysis - Why it happened.

Diagnostic data analytics empowers analysts and executives by helping them gain a firm contextual understanding of why something happened. If you know why something happened as well as how it happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Designed to provide direct and actionable answers to specific questions, this is one of the world’s most important methods in research, among its other key organizational functions such as retail analytics , e.g.

c) Predictive analysis - What will happen.

The predictive method allows you to look into the future to answer the question: what will happen? In order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic analysis, in addition to machine learning (ML) and artificial intelligence (AI). Through this, you can uncover future trends, potential problems or inefficiencies, connections, and casualties in your data.

With predictive analysis, you can unfold and develop initiatives that will not only enhance your various operational processes but also help you gain an all-important edge over the competition. If you understand why a trend, pattern, or event happened through data, you will be able to develop an informed projection of how things may unfold in particular areas of the business.

e) Prescriptive analysis - How will it happen.

Another of the most effective types of analysis methods in research. Prescriptive data techniques cross over from predictive analysis in the way that it revolves around using patterns or trends to develop responsive, practical business strategies.

By drilling down into prescriptive analysis, you will play an active role in the data consumption process by taking well-arranged sets of visual data and using it as a powerful fix to emerging issues in a number of key areas, including marketing, sales, customer experience, HR, fulfillment, finance, logistics analytics , and others.

Top 17 data analysis methods

As mentioned at the beginning of the post, data analysis methods can be divided into two big categories: quantitative and qualitative. Each of these categories holds a powerful analytical value that changes depending on the scenario and type of data you are working with. Below, we will discuss 17 methods that are divided into qualitative and quantitative approaches. 

Without further ado, here are the 17 essential types of data analysis methods with some use cases in the business world: 

A. Quantitative Methods 

To put it simply, quantitative analysis refers to all methods that use numerical data or data that can be turned into numbers (e.g. category variables like gender, age, etc.) to extract valuable insights. It is used to extract valuable conclusions about relationships, differences, and test hypotheses. Below we discuss some of the key quantitative methods. 

1. Cluster analysis

The action of grouping a set of data elements in a way that said elements are more similar (in a particular sense) to each other than to those in other groups – hence the term ‘cluster.’ Since there is no target variable when clustering, the method is often used to find hidden patterns in the data. The approach is also used to provide additional context to a trend or dataset.

Let's look at it from an organizational perspective. In a perfect world, marketers would be able to analyze each customer separately and give them the best-personalized service, but let's face it, with a large customer base, it is timely impossible to do that. That's where clustering comes in. By grouping customers into clusters based on demographics, purchasing behaviors, monetary value, or any other factor that might be relevant for your company, you will be able to immediately optimize your efforts and give your customers the best experience based on their needs.

2. Cohort analysis

This type of data analysis approach uses historical data to examine and compare a determined segment of users' behavior, which can then be grouped with others with similar characteristics. By using this methodology, it's possible to gain a wealth of insight into consumer needs or a firm understanding of a broader target group.

Cohort analysis can be really useful for performing analysis in marketing as it will allow you to understand the impact of your campaigns on specific groups of customers. To exemplify, imagine you send an email campaign encouraging customers to sign up for your site. For this, you create two versions of the campaign with different designs, CTAs, and ad content. Later on, you can use cohort analysis to track the performance of the campaign for a longer period of time and understand which type of content is driving your customers to sign up, repurchase, or engage in other ways.  

A useful tool to start performing cohort analysis method is Google Analytics. You can learn more about the benefits and limitations of using cohorts in GA in this useful guide . In the bottom image, you see an example of how you visualize a cohort in this tool. The segments (devices traffic) are divided into date cohorts (usage of devices) and then analyzed week by week to extract insights into performance.

Cohort analysis chart example from google analytics

3. Regression analysis

Regression uses historical data to understand how a dependent variable's value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable's relationship and how it developed in the past, you can anticipate possible outcomes and make better decisions in the future.

Let's bring it down with an example. Imagine you did a regression analysis of your sales in 2019 and discovered that variables like product quality, store design, customer service, marketing campaigns, and sales channels affected the overall result. Now you want to use regression to analyze which of these variables changed or if any new ones appeared during 2020. For example, you couldn’t sell as much in your physical store due to COVID lockdowns. Therefore, your sales could’ve either dropped in general or increased in your online channels. Through this, you can understand which independent variables affected the overall performance of your dependent variable, annual sales.

If you want to go deeper into this type of analysis, check out this article and learn more about how you can benefit from regression.

4. Neural networks

The neural network forms the basis for the intelligent algorithms of machine learning. It is a form of analytics that attempts, with minimal intervention, to understand how the human brain would generate insights and predict values. Neural networks learn from each and every data transaction, meaning that they evolve and advance over time.

A typical area of application for neural networks is predictive analytics. There are BI reporting tools that have this feature implemented within them, such as the Predictive Analytics Tool from datapine. This tool enables users to quickly and easily generate all kinds of predictions. All you have to do is select the data to be processed based on your KPIs, and the software automatically calculates forecasts based on historical and current data. Thanks to its user-friendly interface, anyone in your organization can manage it; there’s no need to be an advanced scientist. 

Here is an example of how you can use the predictive analysis tool from datapine:

Example on how to use predictive analytics tool from datapine

**click to enlarge**

5. Factor analysis

The factor analysis also called “dimension reduction” is a type of data analysis used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The aim here is to uncover independent latent variables, an ideal method for streamlining specific segments.

A good way to understand this data analysis method is a customer evaluation of a product. The initial assessment is based on different variables like color, shape, wearability, current trends, materials, comfort, the place where they bought the product, and frequency of usage. Like this, the list can be endless, depending on what you want to track. In this case, factor analysis comes into the picture by summarizing all of these variables into homogenous groups, for example, by grouping the variables color, materials, quality, and trends into a brother latent variable of design.

If you want to start analyzing data using factor analysis we recommend you take a look at this practical guide from UCLA.

6. Data mining

A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.  When considering how to analyze data, adopting a data mining mindset is essential to success - as such, it’s an area that is worth exploring in greater detail.

An excellent use case of data mining is datapine intelligent data alerts . With the help of artificial intelligence and machine learning, they provide automated signals based on particular commands or occurrences within a dataset. For example, if you’re monitoring supply chain KPIs , you could set an intelligent alarm to trigger when invalid or low-quality data appears. By doing so, you will be able to drill down deep into the issue and fix it swiftly and effectively.

In the following picture, you can see how the intelligent alarms from datapine work. By setting up ranges on daily orders, sessions, and revenues, the alarms will notify you if the goal was not completed or if it exceeded expectations.

Example on how to use intelligent alerts from datapine

7. Time series analysis

As its name suggests, time series analysis is used to analyze a set of data points collected over a specified period of time. Although analysts use this method to monitor the data points in a specific interval of time rather than just monitoring them intermittently, the time series analysis is not uniquely used for the purpose of collecting data over time. Instead, it allows researchers to understand if variables changed during the duration of the study, how the different variables are dependent, and how did it reach the end result. 

In a business context, this method is used to understand the causes of different trends and patterns to extract valuable insights. Another way of using this method is with the help of time series forecasting. Powered by predictive technologies, businesses can analyze various data sets over a period of time and forecast different future events. 

A great use case to put time series analysis into perspective is seasonality effects on sales. By using time series forecasting to analyze sales data of a specific product over time, you can understand if sales rise over a specific period of time (e.g. swimwear during summertime, or candy during Halloween). These insights allow you to predict demand and prepare production accordingly.  

8. Decision Trees 

The decision tree analysis aims to act as a support tool to make smart and strategic decisions. By visually displaying potential outcomes, consequences, and costs in a tree-like model, researchers and company users can easily evaluate all factors involved and choose the best course of action. Decision trees are helpful to analyze quantitative data and they allow for an improved decision-making process by helping you spot improvement opportunities, reduce costs, and enhance operational efficiency and production.

But how does a decision tree actually works? This method works like a flowchart that starts with the main decision that you need to make and branches out based on the different outcomes and consequences of each decision. Each outcome will outline its own consequences, costs, and gains and, at the end of the analysis, you can compare each of them and make the smartest decision. 

Businesses can use them to understand which project is more cost-effective and will bring more earnings in the long run. For example, imagine you need to decide if you want to update your software app or build a new app entirely.  Here you would compare the total costs, the time needed to be invested, potential revenue, and any other factor that might affect your decision.  In the end, you would be able to see which of these two options is more realistic and attainable for your company or research.

9. Conjoint analysis 

Last but not least, we have the conjoint analysis. This approach is usually used in surveys to understand how individuals value different attributes of a product or service and it is one of the most effective methods to extract consumer preferences. When it comes to purchasing, some clients might be more price-focused, others more features-focused, and others might have a sustainable focus. Whatever your customer's preferences are, you can find them with conjoint analysis. Through this, companies can define pricing strategies, packaging options, subscription packages, and more. 

A great example of conjoint analysis is in marketing and sales. For instance, a cupcake brand might use conjoint analysis and find that its clients prefer gluten-free options and cupcakes with healthier toppings over super sugary ones. Thus, the cupcake brand can turn these insights into advertisements and promotions to increase sales of this particular type of product. And not just that, conjoint analysis can also help businesses segment their customers based on their interests. This allows them to send different messaging that will bring value to each of the segments. 

10. Correspondence Analysis

Also known as reciprocal averaging, correspondence analysis is a method used to analyze the relationship between categorical variables presented within a contingency table. A contingency table is a table that displays two (simple correspondence analysis) or more (multiple correspondence analysis) categorical variables across rows and columns that show the distribution of the data, which is usually answers to a survey or questionnaire on a specific topic. 

This method starts by calculating an “expected value” which is done by multiplying row and column averages and dividing it by the overall original value of the specific table cell. The “expected value” is then subtracted from the original value resulting in a “residual number” which is what allows you to extract conclusions about relationships and distribution. The results of this analysis are later displayed using a map that represents the relationship between the different values. The closest two values are in the map, the bigger the relationship. Let’s put it into perspective with an example. 

Imagine you are carrying out a market research analysis about outdoor clothing brands and how they are perceived by the public. For this analysis, you ask a group of people to match each brand with a certain attribute which can be durability, innovation, quality materials, etc. When calculating the residual numbers, you can see that brand A has a positive residual for innovation but a negative one for durability. This means that brand A is not positioned as a durable brand in the market, something that competitors could take advantage of. 

11. Multidimensional Scaling (MDS)

MDS is a method used to observe the similarities or disparities between objects which can be colors, brands, people, geographical coordinates, and more. The objects are plotted using an “MDS map” that positions similar objects together and disparate ones far apart. The (dis) similarities between objects are represented using one or more dimensions that can be observed using a numerical scale. For example, if you want to know how people feel about the COVID-19 vaccine, you can use 1 for “don’t believe in the vaccine at all”  and 10 for “firmly believe in the vaccine” and a scale of 2 to 9 for in between responses.  When analyzing an MDS map the only thing that matters is the distance between the objects, the orientation of the dimensions is arbitrary and has no meaning at all. 

Multidimensional scaling is a valuable technique for market research, especially when it comes to evaluating product or brand positioning. For instance, if a cupcake brand wants to know how they are positioned compared to competitors, it can define 2-3 dimensions such as taste, ingredients, shopping experience, or more, and do a multidimensional scaling analysis to find improvement opportunities as well as areas in which competitors are currently leading. 

Another business example is in procurement when deciding on different suppliers. Decision makers can generate an MDS map to see how the different prices, delivery times, technical services, and more of the different suppliers differ and pick the one that suits their needs the best. 

A final example proposed by a research paper on "An Improved Study of Multilevel Semantic Network Visualization for Analyzing Sentiment Word of Movie Review Data". Researchers picked a two-dimensional MDS map to display the distances and relationships between different sentiments in movie reviews. They used 36 sentiment words and distributed them based on their emotional distance as we can see in the image below where the words "outraged" and "sweet" are on opposite sides of the map, marking the distance between the two emotions very clearly.

Example of multidimensional scaling analysis

Aside from being a valuable technique to analyze dissimilarities, MDS also serves as a dimension-reduction technique for large dimensional data. 

B. Qualitative Methods

Qualitative data analysis methods are defined as the observation of non-numerical data that is gathered and produced using methods of observation such as interviews, focus groups, questionnaires, and more. As opposed to quantitative methods, qualitative data is more subjective and highly valuable in analyzing customer retention and product development.

12. Text analysis

Text analysis, also known in the industry as text mining, works by taking large sets of textual data and arranging them in a way that makes it easier to manage. By working through this cleansing process in stringent detail, you will be able to extract the data that is truly relevant to your organization and use it to develop actionable insights that will propel you forward.

Modern software accelerate the application of text analytics. Thanks to the combination of machine learning and intelligent algorithms, you can perform advanced analytical processes such as sentiment analysis. This technique allows you to understand the intentions and emotions of a text, for example, if it's positive, negative, or neutral, and then give it a score depending on certain factors and categories that are relevant to your brand. Sentiment analysis is often used to monitor brand and product reputation and to understand how successful your customer experience is. To learn more about the topic check out this insightful article .

By analyzing data from various word-based sources, including product reviews, articles, social media communications, and survey responses, you will gain invaluable insights into your audience, as well as their needs, preferences, and pain points. This will allow you to create campaigns, services, and communications that meet your prospects’ needs on a personal level, growing your audience while boosting customer retention. There are various other “sub-methods” that are an extension of text analysis. Each of them serves a more specific purpose and we will look at them in detail next. 

13. Content Analysis

This is a straightforward and very popular method that examines the presence and frequency of certain words, concepts, and subjects in different content formats such as text, image, audio, or video. For example, the number of times the name of a celebrity is mentioned on social media or online tabloids. It does this by coding text data that is later categorized and tabulated in a way that can provide valuable insights, making it the perfect mix of quantitative and qualitative analysis.

There are two types of content analysis. The first one is the conceptual analysis which focuses on explicit data, for instance, the number of times a concept or word is mentioned in a piece of content. The second one is relational analysis, which focuses on the relationship between different concepts or words and how they are connected within a specific context. 

Content analysis is often used by marketers to measure brand reputation and customer behavior. For example, by analyzing customer reviews. It can also be used to analyze customer interviews and find directions for new product development. It is also important to note, that in order to extract the maximum potential out of this analysis method, it is necessary to have a clearly defined research question. 

14. Thematic Analysis

Very similar to content analysis, thematic analysis also helps in identifying and interpreting patterns in qualitative data with the main difference being that the first one can also be applied to quantitative analysis. The thematic method analyzes large pieces of text data such as focus group transcripts or interviews and groups them into themes or categories that come up frequently within the text. It is a great method when trying to figure out peoples view’s and opinions about a certain topic. For example, if you are a brand that cares about sustainability, you can do a survey of your customers to analyze their views and opinions about sustainability and how they apply it to their lives. You can also analyze customer service calls transcripts to find common issues and improve your service. 

Thematic analysis is a very subjective technique that relies on the researcher’s judgment. Therefore,  to avoid biases, it has 6 steps that include familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. It is also important to note that, because it is a flexible approach, the data can be interpreted in multiple ways and it can be hard to select what data is more important to emphasize. 

15. Narrative Analysis 

A bit more complex in nature than the two previous ones, narrative analysis is used to explore the meaning behind the stories that people tell and most importantly, how they tell them. By looking into the words that people use to describe a situation you can extract valuable conclusions about their perspective on a specific topic. Common sources for narrative data include autobiographies, family stories, opinion pieces, and testimonials, among others. 

From a business perspective, narrative analysis can be useful to analyze customer behaviors and feelings towards a specific product, service, feature, or others. It provides unique and deep insights that can be extremely valuable. However, it has some drawbacks.  

The biggest weakness of this method is that the sample sizes are usually very small due to the complexity and time-consuming nature of the collection of narrative data. Plus, the way a subject tells a story will be significantly influenced by his or her specific experiences, making it very hard to replicate in a subsequent study. 

16. Discourse Analysis

Discourse analysis is used to understand the meaning behind any type of written, verbal, or symbolic discourse based on its political, social, or cultural context. It mixes the analysis of languages and situations together. This means that the way the content is constructed and the meaning behind it is significantly influenced by the culture and society it takes place in. For example, if you are analyzing political speeches you need to consider different context elements such as the politician's background, the current political context of the country, the audience to which the speech is directed, and so on. 

From a business point of view, discourse analysis is a great market research tool. It allows marketers to understand how the norms and ideas of the specific market work and how their customers relate to those ideas. It can be very useful to build a brand mission or develop a unique tone of voice. 

17. Grounded Theory Analysis

Traditionally, researchers decide on a method and hypothesis and start to collect the data to prove that hypothesis. The grounded theory is the only method that doesn’t require an initial research question or hypothesis as its value lies in the generation of new theories. With the grounded theory method, you can go into the analysis process with an open mind and explore the data to generate new theories through tests and revisions. In fact, it is not necessary to collect the data and then start to analyze it. Researchers usually start to find valuable insights as they are gathering the data. 

All of these elements make grounded theory a very valuable method as theories are fully backed by data instead of initial assumptions. It is a great technique to analyze poorly researched topics or find the causes behind specific company outcomes. For example, product managers and marketers might use the grounded theory to find the causes of high levels of customer churn and look into customer surveys and reviews to develop new theories about the causes. 

How To Analyze Data? Top 17 Data Analysis Techniques To Apply

17 top data analysis techniques by datapine

Now that we’ve answered the questions “what is data analysis’”, why is it important, and covered the different data analysis types, it’s time to dig deeper into how to perform your analysis by working through these 17 essential techniques.

1. Collaborate your needs

Before you begin analyzing or drilling down into any techniques, it’s crucial to sit down collaboratively with all key stakeholders within your organization, decide on your primary campaign or strategic goals, and gain a fundamental understanding of the types of insights that will best benefit your progress or provide you with the level of vision you need to evolve your organization.

2. Establish your questions

Once you’ve outlined your core objectives, you should consider which questions will need answering to help you achieve your mission. This is one of the most important techniques as it will shape the very foundations of your success.

To help you ask the right things and ensure your data works for you, you have to ask the right data analysis questions .

3. Data democratization

After giving your data analytics methodology some real direction, and knowing which questions need answering to extract optimum value from the information available to your organization, you should continue with democratization.

Data democratization is an action that aims to connect data from various sources efficiently and quickly so that anyone in your organization can access it at any given moment. You can extract data in text, images, videos, numbers, or any other format. And then perform cross-database analysis to achieve more advanced insights to share with the rest of the company interactively.  

Once you have decided on your most valuable sources, you need to take all of this into a structured format to start collecting your insights. For this purpose, datapine offers an easy all-in-one data connectors feature to integrate all your internal and external sources and manage them at your will. Additionally, datapine’s end-to-end solution automatically updates your data, allowing you to save time and focus on performing the right analysis to grow your company.

data connectors from datapine

4. Think of governance 

When collecting data in a business or research context you always need to think about security and privacy. With data breaches becoming a topic of concern for businesses, the need to protect your client's or subject’s sensitive information becomes critical. 

To ensure that all this is taken care of, you need to think of a data governance strategy. According to Gartner , this concept refers to “ the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics .” In simpler words, data governance is a collection of processes, roles, and policies, that ensure the efficient use of data while still achieving the main company goals. It ensures that clear roles are in place for who can access the information and how they can access it. In time, this not only ensures that sensitive information is protected but also allows for an efficient analysis as a whole. 

5. Clean your data

After harvesting from so many sources you will be left with a vast amount of information that can be overwhelming to deal with. At the same time, you can be faced with incorrect data that can be misleading to your analysis. The smartest thing you can do to avoid dealing with this in the future is to clean the data. This is fundamental before visualizing it, as it will ensure that the insights you extract from it are correct.

There are many things that you need to look for in the cleaning process. The most important one is to eliminate any duplicate observations; this usually appears when using multiple internal and external sources of information. You can also add any missing codes, fix empty fields, and eliminate incorrectly formatted data.

Another usual form of cleaning is done with text data. As we mentioned earlier, most companies today analyze customer reviews, social media comments, questionnaires, and several other text inputs. In order for algorithms to detect patterns, text data needs to be revised to avoid invalid characters or any syntax or spelling errors. 

Most importantly, the aim of cleaning is to prevent you from arriving at false conclusions that can damage your company in the long run. By using clean data, you will also help BI solutions to interact better with your information and create better reports for your organization.

6. Set your KPIs

Once you’ve set your sources, cleaned your data, and established clear-cut questions you want your insights to answer, you need to set a host of key performance indicators (KPIs) that will help you track, measure, and shape your progress in a number of key areas.

KPIs are critical to both qualitative and quantitative analysis research. This is one of the primary methods of data analysis you certainly shouldn’t overlook.

To help you set the best possible KPIs for your initiatives and activities, here is an example of a relevant logistics KPI : transportation-related costs. If you want to see more go explore our collection of key performance indicator examples .

Transportation costs logistics KPIs

7. Omit useless data

Having bestowed your data analysis tools and techniques with true purpose and defined your mission, you should explore the raw data you’ve collected from all sources and use your KPIs as a reference for chopping out any information you deem to be useless.

Trimming the informational fat is one of the most crucial methods of analysis as it will allow you to focus your analytical efforts and squeeze every drop of value from the remaining ‘lean’ information.

Any stats, facts, figures, or metrics that don’t align with your business goals or fit with your KPI management strategies should be eliminated from the equation.

8. Build a data management roadmap

While, at this point, this particular step is optional (you will have already gained a wealth of insight and formed a fairly sound strategy by now), creating a data governance roadmap will help your data analysis methods and techniques become successful on a more sustainable basis. These roadmaps, if developed properly, are also built so they can be tweaked and scaled over time.

Invest ample time in developing a roadmap that will help you store, manage, and handle your data internally, and you will make your analysis techniques all the more fluid and functional – one of the most powerful types of data analysis methods available today.

9. Integrate technology

There are many ways to analyze data, but one of the most vital aspects of analytical success in a business context is integrating the right decision support software and technology.

Robust analysis platforms will not only allow you to pull critical data from your most valuable sources while working with dynamic KPIs that will offer you actionable insights; it will also present them in a digestible, visual, interactive format from one central, live dashboard . A data methodology you can count on.

By integrating the right technology within your data analysis methodology, you’ll avoid fragmenting your insights, saving you time and effort while allowing you to enjoy the maximum value from your business’s most valuable insights.

For a look at the power of software for the purpose of analysis and to enhance your methods of analyzing, glance over our selection of dashboard examples .

10. Answer your questions

By considering each of the above efforts, working with the right technology, and fostering a cohesive internal culture where everyone buys into the different ways to analyze data as well as the power of digital intelligence, you will swiftly start to answer your most burning business questions. Arguably, the best way to make your data concepts accessible across the organization is through data visualization.

11. Visualize your data

Online data visualization is a powerful tool as it lets you tell a story with your metrics, allowing users across the organization to extract meaningful insights that aid business evolution – and it covers all the different ways to analyze data.

The purpose of analyzing is to make your entire organization more informed and intelligent, and with the right platform or dashboard, this is simpler than you think, as demonstrated by our marketing dashboard .

An executive dashboard example showcasing high-level marketing KPIs such as cost per lead, MQL, SQL, and cost per customer.

This visual, dynamic, and interactive online dashboard is a data analysis example designed to give Chief Marketing Officers (CMO) an overview of relevant metrics to help them understand if they achieved their monthly goals.

In detail, this example generated with a modern dashboard creator displays interactive charts for monthly revenues, costs, net income, and net income per customer; all of them are compared with the previous month so that you can understand how the data fluctuated. In addition, it shows a detailed summary of the number of users, customers, SQLs, and MQLs per month to visualize the whole picture and extract relevant insights or trends for your marketing reports .

The CMO dashboard is perfect for c-level management as it can help them monitor the strategic outcome of their marketing efforts and make data-driven decisions that can benefit the company exponentially.

12. Be careful with the interpretation

We already dedicated an entire post to data interpretation as it is a fundamental part of the process of data analysis. It gives meaning to the analytical information and aims to drive a concise conclusion from the analysis results. Since most of the time companies are dealing with data from many different sources, the interpretation stage needs to be done carefully and properly in order to avoid misinterpretations. 

To help you through the process, here we list three common practices that you need to avoid at all costs when looking at your data:

  • Correlation vs. causation: The human brain is formatted to find patterns. This behavior leads to one of the most common mistakes when performing interpretation: confusing correlation with causation. Although these two aspects can exist simultaneously, it is not correct to assume that because two things happened together, one provoked the other. A piece of advice to avoid falling into this mistake is never to trust just intuition, trust the data. If there is no objective evidence of causation, then always stick to correlation. 
  • Confirmation bias: This phenomenon describes the tendency to select and interpret only the data necessary to prove one hypothesis, often ignoring the elements that might disprove it. Even if it's not done on purpose, confirmation bias can represent a real problem, as excluding relevant information can lead to false conclusions and, therefore, bad business decisions. To avoid it, always try to disprove your hypothesis instead of proving it, share your analysis with other team members, and avoid drawing any conclusions before the entire analytical project is finalized.
  • Statistical significance: To put it in short words, statistical significance helps analysts understand if a result is actually accurate or if it happened because of a sampling error or pure chance. The level of statistical significance needed might depend on the sample size and the industry being analyzed. In any case, ignoring the significance of a result when it might influence decision-making can be a huge mistake.

13. Build a narrative

Now, we’re going to look at how you can bring all of these elements together in a way that will benefit your business - starting with a little something called data storytelling.

The human brain responds incredibly well to strong stories or narratives. Once you’ve cleansed, shaped, and visualized your most invaluable data using various BI dashboard tools , you should strive to tell a story - one with a clear-cut beginning, middle, and end.

By doing so, you will make your analytical efforts more accessible, digestible, and universal, empowering more people within your organization to use your discoveries to their actionable advantage.

14. Consider autonomous technology

Autonomous technologies, such as artificial intelligence (AI) and machine learning (ML), play a significant role in the advancement of understanding how to analyze data more effectively.

Gartner predicts that by the end of this year, 80% of emerging technologies will be developed with AI foundations. This is a testament to the ever-growing power and value of autonomous technologies.

At the moment, these technologies are revolutionizing the analysis industry. Some examples that we mentioned earlier are neural networks, intelligent alarms, and sentiment analysis.

15. Share the load

If you work with the right tools and dashboards, you will be able to present your metrics in a digestible, value-driven format, allowing almost everyone in the organization to connect with and use relevant data to their advantage.

Modern dashboards consolidate data from various sources, providing access to a wealth of insights in one centralized location, no matter if you need to monitor recruitment metrics or generate reports that need to be sent across numerous departments. Moreover, these cutting-edge tools offer access to dashboards from a multitude of devices, meaning that everyone within the business can connect with practical insights remotely - and share the load.

Once everyone is able to work with a data-driven mindset, you will catalyze the success of your business in ways you never thought possible. And when it comes to knowing how to analyze data, this kind of collaborative approach is essential.

16. Data analysis tools

In order to perform high-quality analysis of data, it is fundamental to use tools and software that will ensure the best results. Here we leave you a small summary of four fundamental categories of data analysis tools for your organization.

  • Business Intelligence: BI tools allow you to process significant amounts of data from several sources in any format. Through this, you can not only analyze and monitor your data to extract relevant insights but also create interactive reports and dashboards to visualize your KPIs and use them for your company's good. datapine is an amazing online BI software that is focused on delivering powerful online analysis features that are accessible to beginner and advanced users. Like this, it offers a full-service solution that includes cutting-edge analysis of data, KPIs visualization, live dashboards, reporting, and artificial intelligence technologies to predict trends and minimize risk.
  • Statistical analysis: These tools are usually designed for scientists, statisticians, market researchers, and mathematicians, as they allow them to perform complex statistical analyses with methods like regression analysis, predictive analysis, and statistical modeling. A good tool to perform this type of analysis is R-Studio as it offers a powerful data modeling and hypothesis testing feature that can cover both academic and general data analysis. This tool is one of the favorite ones in the industry, due to its capability for data cleaning, data reduction, and performing advanced analysis with several statistical methods. Another relevant tool to mention is SPSS from IBM. The software offers advanced statistical analysis for users of all skill levels. Thanks to a vast library of machine learning algorithms, text analysis, and a hypothesis testing approach it can help your company find relevant insights to drive better decisions. SPSS also works as a cloud service that enables you to run it anywhere.
  • SQL Consoles: SQL is a programming language often used to handle structured data in relational databases. Tools like these are popular among data scientists as they are extremely effective in unlocking these databases' value. Undoubtedly, one of the most used SQL software in the market is MySQL Workbench . This tool offers several features such as a visual tool for database modeling and monitoring, complete SQL optimization, administration tools, and visual performance dashboards to keep track of KPIs.
  • Data Visualization: These tools are used to represent your data through charts, graphs, and maps that allow you to find patterns and trends in the data. datapine's already mentioned BI platform also offers a wealth of powerful online data visualization tools with several benefits. Some of them include: delivering compelling data-driven presentations to share with your entire company, the ability to see your data online with any device wherever you are, an interactive dashboard design feature that enables you to showcase your results in an interactive and understandable way, and to perform online self-service reports that can be used simultaneously with several other people to enhance team productivity.

17. Refine your process constantly 

Last is a step that might seem obvious to some people, but it can be easily ignored if you think you are done. Once you have extracted the needed results, you should always take a retrospective look at your project and think about what you can improve. As you saw throughout this long list of techniques, data analysis is a complex process that requires constant refinement. For this reason, you should always go one step further and keep improving. 

Quality Criteria For Data Analysis

So far we’ve covered a list of methods and techniques that should help you perform efficient data analysis. But how do you measure the quality and validity of your results? This is done with the help of some science quality criteria. Here we will go into a more theoretical area that is critical to understanding the fundamentals of statistical analysis in science. However, you should also be aware of these steps in a business context, as they will allow you to assess the quality of your results in the correct way. Let’s dig in. 

  • Internal validity: The results of a survey are internally valid if they measure what they are supposed to measure and thus provide credible results. In other words , internal validity measures the trustworthiness of the results and how they can be affected by factors such as the research design, operational definitions, how the variables are measured, and more. For instance, imagine you are doing an interview to ask people if they brush their teeth two times a day. While most of them will answer yes, you can still notice that their answers correspond to what is socially acceptable, which is to brush your teeth at least twice a day. In this case, you can’t be 100% sure if respondents actually brush their teeth twice a day or if they just say that they do, therefore, the internal validity of this interview is very low. 
  • External validity: Essentially, external validity refers to the extent to which the results of your research can be applied to a broader context. It basically aims to prove that the findings of a study can be applied in the real world. If the research can be applied to other settings, individuals, and times, then the external validity is high. 
  • Reliability : If your research is reliable, it means that it can be reproduced. If your measurement were repeated under the same conditions, it would produce similar results. This means that your measuring instrument consistently produces reliable results. For example, imagine a doctor building a symptoms questionnaire to detect a specific disease in a patient. Then, various other doctors use this questionnaire but end up diagnosing the same patient with a different condition. This means the questionnaire is not reliable in detecting the initial disease. Another important note here is that in order for your research to be reliable, it also needs to be objective. If the results of a study are the same, independent of who assesses them or interprets them, the study can be considered reliable. Let’s see the objectivity criteria in more detail now. 
  • Objectivity: In data science, objectivity means that the researcher needs to stay fully objective when it comes to its analysis. The results of a study need to be affected by objective criteria and not by the beliefs, personality, or values of the researcher. Objectivity needs to be ensured when you are gathering the data, for example, when interviewing individuals, the questions need to be asked in a way that doesn't influence the results. Paired with this, objectivity also needs to be thought of when interpreting the data. If different researchers reach the same conclusions, then the study is objective. For this last point, you can set predefined criteria to interpret the results to ensure all researchers follow the same steps. 

The discussed quality criteria cover mostly potential influences in a quantitative context. Analysis in qualitative research has by default additional subjective influences that must be controlled in a different way. Therefore, there are other quality criteria for this kind of research such as credibility, transferability, dependability, and confirmability. You can see each of them more in detail on this resource . 

Data Analysis Limitations & Barriers

Analyzing data is not an easy task. As you’ve seen throughout this post, there are many steps and techniques that you need to apply in order to extract useful information from your research. While a well-performed analysis can bring various benefits to your organization it doesn't come without limitations. In this section, we will discuss some of the main barriers you might encounter when conducting an analysis. Let’s see them more in detail. 

  • Lack of clear goals: No matter how good your data or analysis might be if you don’t have clear goals or a hypothesis the process might be worthless. While we mentioned some methods that don’t require a predefined hypothesis, it is always better to enter the analytical process with some clear guidelines of what you are expecting to get out of it, especially in a business context in which data is utilized to support important strategic decisions. 
  • Objectivity: Arguably one of the biggest barriers when it comes to data analysis in research is to stay objective. When trying to prove a hypothesis, researchers might find themselves, intentionally or unintentionally, directing the results toward an outcome that they want. To avoid this, always question your assumptions and avoid confusing facts with opinions. You can also show your findings to a research partner or external person to confirm that your results are objective. 
  • Data representation: A fundamental part of the analytical procedure is the way you represent your data. You can use various graphs and charts to represent your findings, but not all of them will work for all purposes. Choosing the wrong visual can not only damage your analysis but can mislead your audience, therefore, it is important to understand when to use each type of data depending on your analytical goals. Our complete guide on the types of graphs and charts lists 20 different visuals with examples of when to use them. 
  • Flawed correlation : Misleading statistics can significantly damage your research. We’ve already pointed out a few interpretation issues previously in the post, but it is an important barrier that we can't avoid addressing here as well. Flawed correlations occur when two variables appear related to each other but they are not. Confusing correlations with causation can lead to a wrong interpretation of results which can lead to building wrong strategies and loss of resources, therefore, it is very important to identify the different interpretation mistakes and avoid them. 
  • Sample size: A very common barrier to a reliable and efficient analysis process is the sample size. In order for the results to be trustworthy, the sample size should be representative of what you are analyzing. For example, imagine you have a company of 1000 employees and you ask the question “do you like working here?” to 50 employees of which 49 say yes, which means 95%. Now, imagine you ask the same question to the 1000 employees and 950 say yes, which also means 95%. Saying that 95% of employees like working in the company when the sample size was only 50 is not a representative or trustworthy conclusion. The significance of the results is way more accurate when surveying a bigger sample size.   
  • Privacy concerns: In some cases, data collection can be subjected to privacy regulations. Businesses gather all kinds of information from their customers from purchasing behaviors to addresses and phone numbers. If this falls into the wrong hands due to a breach, it can affect the security and confidentiality of your clients. To avoid this issue, you need to collect only the data that is needed for your research and, if you are using sensitive facts, make it anonymous so customers are protected. The misuse of customer data can severely damage a business's reputation, so it is important to keep an eye on privacy. 
  • Lack of communication between teams : When it comes to performing data analysis on a business level, it is very likely that each department and team will have different goals and strategies. However, they are all working for the same common goal of helping the business run smoothly and keep growing. When teams are not connected and communicating with each other, it can directly affect the way general strategies are built. To avoid these issues, tools such as data dashboards enable teams to stay connected through data in a visually appealing way. 
  • Innumeracy : Businesses are working with data more and more every day. While there are many BI tools available to perform effective analysis, data literacy is still a constant barrier. Not all employees know how to apply analysis techniques or extract insights from them. To prevent this from happening, you can implement different training opportunities that will prepare every relevant user to deal with data. 

Key Data Analysis Skills

As you've learned throughout this lengthy guide, analyzing data is a complex task that requires a lot of knowledge and skills. That said, thanks to the rise of self-service tools the process is way more accessible and agile than it once was. Regardless, there are still some key skills that are valuable to have when working with data, we list the most important ones below.

  • Critical and statistical thinking: To successfully analyze data you need to be creative and think out of the box. Yes, that might sound like a weird statement considering that data is often tight to facts. However, a great level of critical thinking is required to uncover connections, come up with a valuable hypothesis, and extract conclusions that go a step further from the surface. This, of course, needs to be complemented by statistical thinking and an understanding of numbers. 
  • Data cleaning: Anyone who has ever worked with data before will tell you that the cleaning and preparation process accounts for 80% of a data analyst's work, therefore, the skill is fundamental. But not just that, not cleaning the data adequately can also significantly damage the analysis which can lead to poor decision-making in a business scenario. While there are multiple tools that automate the cleaning process and eliminate the possibility of human error, it is still a valuable skill to dominate. 
  • Data visualization: Visuals make the information easier to understand and analyze, not only for professional users but especially for non-technical ones. Having the necessary skills to not only choose the right chart type but know when to apply it correctly is key. This also means being able to design visually compelling charts that make the data exploration process more efficient. 
  • SQL: The Structured Query Language or SQL is a programming language used to communicate with databases. It is fundamental knowledge as it enables you to update, manipulate, and organize data from relational databases which are the most common databases used by companies. It is fairly easy to learn and one of the most valuable skills when it comes to data analysis. 
  • Communication skills: This is a skill that is especially valuable in a business environment. Being able to clearly communicate analytical outcomes to colleagues is incredibly important, especially when the information you are trying to convey is complex for non-technical people. This applies to in-person communication as well as written format, for example, when generating a dashboard or report. While this might be considered a “soft” skill compared to the other ones we mentioned, it should not be ignored as you most likely will need to share analytical findings with others no matter the context. 

Data Analysis In The Big Data Environment

Big data is invaluable to today’s businesses, and by using different methods for data analysis, it’s possible to view your data in a way that can help you turn insight into positive action.

To inspire your efforts and put the importance of big data into context, here are some insights that you should know:

  • By 2026 the industry of big data is expected to be worth approximately $273.4 billion.
  • 94% of enterprises say that analyzing data is important for their growth and digital transformation. 
  • Companies that exploit the full potential of their data can increase their operating margins by 60% .
  • We already told you the benefits of Artificial Intelligence through this article. This industry's financial impact is expected to grow up to $40 billion by 2025.

Data analysis concepts may come in many forms, but fundamentally, any solid methodology will help to make your business more streamlined, cohesive, insightful, and successful than ever before.

Key Takeaways From Data Analysis 

As we reach the end of our data analysis journey, we leave a small summary of the main methods and techniques to perform excellent analysis and grow your business.

17 Essential Types of Data Analysis Methods:

  • Cluster analysis
  • Cohort analysis
  • Regression analysis
  • Factor analysis
  • Neural Networks
  • Data Mining
  • Text analysis
  • Time series analysis
  • Decision trees
  • Conjoint analysis 
  • Correspondence Analysis
  • Multidimensional Scaling 
  • Content analysis 
  • Thematic analysis
  • Narrative analysis 
  • Grounded theory analysis
  • Discourse analysis 

Top 17 Data Analysis Techniques:

  • Collaborate your needs
  • Establish your questions
  • Data democratization
  • Think of data governance 
  • Clean your data
  • Set your KPIs
  • Omit useless data
  • Build a data management roadmap
  • Integrate technology
  • Answer your questions
  • Visualize your data
  • Interpretation of data
  • Consider autonomous technology
  • Build a narrative
  • Share the load
  • Data Analysis tools
  • Refine your process constantly 

We’ve pondered the data analysis definition and drilled down into the practical applications of data-centric analytics, and one thing is clear: by taking measures to arrange your data and making your metrics work for you, it’s possible to transform raw information into action - the kind of that will push your business to the next level.

Yes, good data analytics techniques result in enhanced business intelligence (BI). To help you understand this notion in more detail, read our exploration of business intelligence reporting .

And, if you’re ready to perform your own analysis, drill down into your facts and figures while interacting with your data on astonishing visuals, you can try our software for a free, 14-day trial .

  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Research Design – Types, Methods and Examples

Research Design – Types, Methods and Examples

Table of Contents

Research Design

Research Design

Definition:

Research design refers to the overall strategy or plan for conducting a research study. It outlines the methods and procedures that will be used to collect and analyze data, as well as the goals and objectives of the study. Research design is important because it guides the entire research process and ensures that the study is conducted in a systematic and rigorous manner.

Types of Research Design

Types of Research Design are as follows:

Descriptive Research Design

This type of research design is used to describe a phenomenon or situation. It involves collecting data through surveys, questionnaires, interviews, and observations. The aim of descriptive research is to provide an accurate and detailed portrayal of a particular group, event, or situation. It can be useful in identifying patterns, trends, and relationships in the data.

Correlational Research Design

Correlational research design is used to determine if there is a relationship between two or more variables. This type of research design involves collecting data from participants and analyzing the relationship between the variables using statistical methods. The aim of correlational research is to identify the strength and direction of the relationship between the variables.

Experimental Research Design

Experimental research design is used to investigate cause-and-effect relationships between variables. This type of research design involves manipulating one variable and measuring the effect on another variable. It usually involves randomly assigning participants to groups and manipulating an independent variable to determine its effect on a dependent variable. The aim of experimental research is to establish causality.

Quasi-experimental Research Design

Quasi-experimental research design is similar to experimental research design, but it lacks one or more of the features of a true experiment. For example, there may not be random assignment to groups or a control group. This type of research design is used when it is not feasible or ethical to conduct a true experiment.

Case Study Research Design

Case study research design is used to investigate a single case or a small number of cases in depth. It involves collecting data through various methods, such as interviews, observations, and document analysis. The aim of case study research is to provide an in-depth understanding of a particular case or situation.

Longitudinal Research Design

Longitudinal research design is used to study changes in a particular phenomenon over time. It involves collecting data at multiple time points and analyzing the changes that occur. The aim of longitudinal research is to provide insights into the development, growth, or decline of a particular phenomenon over time.

Structure of Research Design

The format of a research design typically includes the following sections:

  • Introduction : This section provides an overview of the research problem, the research questions, and the importance of the study. It also includes a brief literature review that summarizes previous research on the topic and identifies gaps in the existing knowledge.
  • Research Questions or Hypotheses: This section identifies the specific research questions or hypotheses that the study will address. These questions should be clear, specific, and testable.
  • Research Methods : This section describes the methods that will be used to collect and analyze data. It includes details about the study design, the sampling strategy, the data collection instruments, and the data analysis techniques.
  • Data Collection: This section describes how the data will be collected, including the sample size, data collection procedures, and any ethical considerations.
  • Data Analysis: This section describes how the data will be analyzed, including the statistical techniques that will be used to test the research questions or hypotheses.
  • Results : This section presents the findings of the study, including descriptive statistics and statistical tests.
  • Discussion and Conclusion : This section summarizes the key findings of the study, interprets the results, and discusses the implications of the findings. It also includes recommendations for future research.
  • References : This section lists the sources cited in the research design.

Example of Research Design

An Example of Research Design could be:

Research question: Does the use of social media affect the academic performance of high school students?

Research design:

  • Research approach : The research approach will be quantitative as it involves collecting numerical data to test the hypothesis.
  • Research design : The research design will be a quasi-experimental design, with a pretest-posttest control group design.
  • Sample : The sample will be 200 high school students from two schools, with 100 students in the experimental group and 100 students in the control group.
  • Data collection : The data will be collected through surveys administered to the students at the beginning and end of the academic year. The surveys will include questions about their social media usage and academic performance.
  • Data analysis : The data collected will be analyzed using statistical software. The mean scores of the experimental and control groups will be compared to determine whether there is a significant difference in academic performance between the two groups.
  • Limitations : The limitations of the study will be acknowledged, including the fact that social media usage can vary greatly among individuals, and the study only focuses on two schools, which may not be representative of the entire population.
  • Ethical considerations: Ethical considerations will be taken into account, such as obtaining informed consent from the participants and ensuring their anonymity and confidentiality.

How to Write Research Design

Writing a research design involves planning and outlining the methodology and approach that will be used to answer a research question or hypothesis. Here are some steps to help you write a research design:

  • Define the research question or hypothesis : Before beginning your research design, you should clearly define your research question or hypothesis. This will guide your research design and help you select appropriate methods.
  • Select a research design: There are many different research designs to choose from, including experimental, survey, case study, and qualitative designs. Choose a design that best fits your research question and objectives.
  • Develop a sampling plan : If your research involves collecting data from a sample, you will need to develop a sampling plan. This should outline how you will select participants and how many participants you will include.
  • Define variables: Clearly define the variables you will be measuring or manipulating in your study. This will help ensure that your results are meaningful and relevant to your research question.
  • Choose data collection methods : Decide on the data collection methods you will use to gather information. This may include surveys, interviews, observations, experiments, or secondary data sources.
  • Create a data analysis plan: Develop a plan for analyzing your data, including the statistical or qualitative techniques you will use.
  • Consider ethical concerns : Finally, be sure to consider any ethical concerns related to your research, such as participant confidentiality or potential harm.

When to Write Research Design

Research design should be written before conducting any research study. It is an important planning phase that outlines the research methodology, data collection methods, and data analysis techniques that will be used to investigate a research question or problem. The research design helps to ensure that the research is conducted in a systematic and logical manner, and that the data collected is relevant and reliable.

Ideally, the research design should be developed as early as possible in the research process, before any data is collected. This allows the researcher to carefully consider the research question, identify the most appropriate research methodology, and plan the data collection and analysis procedures in advance. By doing so, the research can be conducted in a more efficient and effective manner, and the results are more likely to be valid and reliable.

Purpose of Research Design

The purpose of research design is to plan and structure a research study in a way that enables the researcher to achieve the desired research goals with accuracy, validity, and reliability. Research design is the blueprint or the framework for conducting a study that outlines the methods, procedures, techniques, and tools for data collection and analysis.

Some of the key purposes of research design include:

  • Providing a clear and concise plan of action for the research study.
  • Ensuring that the research is conducted ethically and with rigor.
  • Maximizing the accuracy and reliability of the research findings.
  • Minimizing the possibility of errors, biases, or confounding variables.
  • Ensuring that the research is feasible, practical, and cost-effective.
  • Determining the appropriate research methodology to answer the research question(s).
  • Identifying the sample size, sampling method, and data collection techniques.
  • Determining the data analysis method and statistical tests to be used.
  • Facilitating the replication of the study by other researchers.
  • Enhancing the validity and generalizability of the research findings.

Applications of Research Design

There are numerous applications of research design in various fields, some of which are:

  • Social sciences: In fields such as psychology, sociology, and anthropology, research design is used to investigate human behavior and social phenomena. Researchers use various research designs, such as experimental, quasi-experimental, and correlational designs, to study different aspects of social behavior.
  • Education : Research design is essential in the field of education to investigate the effectiveness of different teaching methods and learning strategies. Researchers use various designs such as experimental, quasi-experimental, and case study designs to understand how students learn and how to improve teaching practices.
  • Health sciences : In the health sciences, research design is used to investigate the causes, prevention, and treatment of diseases. Researchers use various designs, such as randomized controlled trials, cohort studies, and case-control studies, to study different aspects of health and healthcare.
  • Business : Research design is used in the field of business to investigate consumer behavior, marketing strategies, and the impact of different business practices. Researchers use various designs, such as survey research, experimental research, and case studies, to study different aspects of the business world.
  • Engineering : In the field of engineering, research design is used to investigate the development and implementation of new technologies. Researchers use various designs, such as experimental research and case studies, to study the effectiveness of new technologies and to identify areas for improvement.

Advantages of Research Design

Here are some advantages of research design:

  • Systematic and organized approach : A well-designed research plan ensures that the research is conducted in a systematic and organized manner, which makes it easier to manage and analyze the data.
  • Clear objectives: The research design helps to clarify the objectives of the study, which makes it easier to identify the variables that need to be measured, and the methods that need to be used to collect and analyze data.
  • Minimizes bias: A well-designed research plan minimizes the chances of bias, by ensuring that the data is collected and analyzed objectively, and that the results are not influenced by the researcher’s personal biases or preferences.
  • Efficient use of resources: A well-designed research plan helps to ensure that the resources (time, money, and personnel) are used efficiently and effectively, by focusing on the most important variables and methods.
  • Replicability: A well-designed research plan makes it easier for other researchers to replicate the study, which enhances the credibility and reliability of the findings.
  • Validity: A well-designed research plan helps to ensure that the findings are valid, by ensuring that the methods used to collect and analyze data are appropriate for the research question.
  • Generalizability : A well-designed research plan helps to ensure that the findings can be generalized to other populations, settings, or situations, which increases the external validity of the study.

Research Design Vs Research Methodology

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Thesis Outline

Thesis Outline – Example, Template and Writing...

Research Paper Conclusion

Research Paper Conclusion – Writing Guide and...

Appendices

Appendices – Writing Guide, Types and Examples

Research Paper Citation

How to Cite Research Paper – All Formats and...

Research Report

Research Report – Example, Writing Guide and...

Delimitations

Delimitations in Research – Types, Examples and...

Leave a comment x.

Save my name, email, and website in this browser for the next time I comment.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Research Methods | Definitions, Types, Examples

Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design . When planning your methods, there are two key decisions you will make.

First, decide how you will collect data . Your methods depend on what type of data you need to answer your research question :

  • Qualitative vs. quantitative : Will your data take the form of words or numbers?
  • Primary vs. secondary : Will you collect original data yourself, or will you use data that has already been collected by someone else?
  • Descriptive vs. experimental : Will you take measurements of something as it is, or will you perform an experiment?

Second, decide how you will analyze the data .

  • For quantitative data, you can use statistical analysis methods to test relationships between variables.
  • For qualitative data, you can use methods such as thematic analysis to interpret patterns and meanings in the data.

Table of contents

Methods for collecting data, examples of data collection methods, methods for analyzing data, examples of data analysis methods, other interesting articles, frequently asked questions about research methods.

Data is the information that you collect for the purposes of answering your research question . The type of data you need depends on the aims of your research.

Qualitative vs. quantitative data

Your choice of qualitative or quantitative data collection depends on the type of knowledge you want to develop.

For questions about ideas, experiences and meanings, or to study something that can’t be described numerically, collect qualitative data .

If you want to develop a more mechanistic understanding of a topic, or your research involves hypothesis testing , collect quantitative data .

You can also take a mixed methods approach , where you use both qualitative and quantitative research methods.

Primary vs. secondary research

Primary research is any original data that you collect yourself for the purposes of answering your research question (e.g. through surveys , observations and experiments ). Secondary research is data that has already been collected by other researchers (e.g. in a government census or previous scientific studies).

If you are exploring a novel research question, you’ll probably need to collect primary data . But if you want to synthesize existing knowledge, analyze historical trends, or identify patterns on a large scale, secondary data might be a better choice.

Descriptive vs. experimental data

In descriptive research , you collect data about your study subject without intervening. The validity of your research will depend on your sampling method .

In experimental research , you systematically intervene in a process and measure the outcome. The validity of your research will depend on your experimental design .

To conduct an experiment, you need to be able to vary your independent variable , precisely measure your dependent variable, and control for confounding variables . If it’s practically and ethically possible, this method is the best choice for answering questions about cause and effect.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

data analysis research design

Your data analysis methods will depend on the type of data you collect and how you prepare it for analysis.

Data can often be analyzed both quantitatively and qualitatively. For example, survey responses could be analyzed qualitatively by studying the meanings of responses or quantitatively by studying the frequencies of responses.

Qualitative analysis methods

Qualitative analysis is used to understand words, ideas, and experiences. You can use it to interpret data that was collected:

  • From open-ended surveys and interviews , literature reviews , case studies , ethnographies , and other sources that use text rather than numbers.
  • Using non-probability sampling methods .

Qualitative analysis tends to be quite flexible and relies on the researcher’s judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias .

Quantitative analysis methods

Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive studies) or cause-and-effect relationships (in experiments).

You can use quantitative analysis to interpret data that was collected either:

  • During an experiment .
  • Using probability sampling methods .

Because the data is collected and analyzed in a statistically valid way, the results of quantitative analysis can be easily standardized and shared among researchers.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis
  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

A sample is a subset of individuals from a larger population . Sampling means selecting the group that you will actually collect data from in your research. For example, if you are researching the opinions of students in your university, you could survey a sample of 100 students.

In statistics, sampling allows you to test a hypothesis about the characteristics of a population.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts and meanings, use qualitative methods .
  • If you want to analyze a large amount of readily-available data, use secondary data. If you want data specific to your purposes with control over how it is generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Methodology refers to the overarching strategy and rationale of your research project . It involves studying the methods used in your field and the theories or principles behind them, in order to develop an approach that matches your objectives.

Methods are the specific tools and procedures you use to collect and analyze data (for example, experiments, surveys , and statistical tests ).

In shorter scientific papers, where the aim is to report the findings of a specific study, you might simply describe what you did in a methods section .

In a longer or more complex research project, such as a thesis or dissertation , you will probably include a methodology section , where you explain your approach to answering the research questions and cite relevant sources to support your choice of methods.

Is this article helpful?

Other students also liked, writing strong research questions | criteria & examples.

  • What Is a Research Design | Types, Guide & Examples
  • Data Collection | Definition, Methods & Examples

More interesting articles

  • Between-Subjects Design | Examples, Pros, & Cons
  • Cluster Sampling | A Simple Step-by-Step Guide with Examples
  • Confounding Variables | Definition, Examples & Controls
  • Construct Validity | Definition, Types, & Examples
  • Content Analysis | Guide, Methods & Examples
  • Control Groups and Treatment Groups | Uses & Examples
  • Control Variables | What Are They & Why Do They Matter?
  • Correlation vs. Causation | Difference, Designs & Examples
  • Correlational Research | When & How to Use
  • Critical Discourse Analysis | Definition, Guide & Examples
  • Cross-Sectional Study | Definition, Uses & Examples
  • Descriptive Research | Definition, Types, Methods & Examples
  • Ethical Considerations in Research | Types & Examples
  • Explanatory and Response Variables | Definitions & Examples
  • Explanatory Research | Definition, Guide, & Examples
  • Exploratory Research | Definition, Guide, & Examples
  • External Validity | Definition, Types, Threats & Examples
  • Extraneous Variables | Examples, Types & Controls
  • Guide to Experimental Design | Overview, Steps, & Examples
  • How Do You Incorporate an Interview into a Dissertation? | Tips
  • How to Do Thematic Analysis | Step-by-Step Guide & Examples
  • How to Write a Literature Review | Guide, Examples, & Templates
  • How to Write a Strong Hypothesis | Steps & Examples
  • Inclusion and Exclusion Criteria | Examples & Definition
  • Independent vs. Dependent Variables | Definition & Examples
  • Inductive Reasoning | Types, Examples, Explanation
  • Inductive vs. Deductive Research Approach | Steps & Examples
  • Internal Validity in Research | Definition, Threats, & Examples
  • Internal vs. External Validity | Understanding Differences & Threats
  • Longitudinal Study | Definition, Approaches & Examples
  • Mediator vs. Moderator Variables | Differences & Examples
  • Mixed Methods Research | Definition, Guide & Examples
  • Multistage Sampling | Introductory Guide & Examples
  • Naturalistic Observation | Definition, Guide & Examples
  • Operationalization | A Guide with Examples, Pros & Cons
  • Population vs. Sample | Definitions, Differences & Examples
  • Primary Research | Definition, Types, & Examples
  • Qualitative vs. Quantitative Research | Differences, Examples & Methods
  • Quasi-Experimental Design | Definition, Types & Examples
  • Questionnaire Design | Methods, Question Types & Examples
  • Random Assignment in Experiments | Introduction & Examples
  • Random vs. Systematic Error | Definition & Examples
  • Reliability vs. Validity in Research | Difference, Types and Examples
  • Reproducibility vs Replicability | Difference & Examples
  • Reproducibility vs. Replicability | Difference & Examples
  • Sampling Methods | Types, Techniques & Examples
  • Semi-Structured Interview | Definition, Guide & Examples
  • Simple Random Sampling | Definition, Steps & Examples
  • Single, Double, & Triple Blind Study | Definition & Examples
  • Stratified Sampling | Definition, Guide & Examples
  • Structured Interview | Definition, Guide & Examples
  • Survey Research | Definition, Examples & Methods
  • Systematic Review | Definition, Example, & Guide
  • Systematic Sampling | A Step-by-Step Guide with Examples
  • Textual Analysis | Guide, 3 Approaches & Examples
  • The 4 Types of Reliability in Research | Definitions & Examples
  • The 4 Types of Validity in Research | Definitions & Examples
  • Transcribing an Interview | 5 Steps & Transcription Software
  • Triangulation in Research | Guide, Types, Examples
  • Types of Interviews in Research | Guide & Examples
  • Types of Research Designs Compared | Guide & Examples
  • Types of Variables in Research & Statistics | Examples
  • Unstructured Interview | Definition, Guide & Examples
  • What Is a Case Study? | Definition, Examples & Methods
  • What Is a Case-Control Study? | Definition & Examples
  • What Is a Cohort Study? | Definition & Examples
  • What Is a Conceptual Framework? | Tips & Examples
  • What Is a Controlled Experiment? | Definitions & Examples
  • What Is a Double-Barreled Question?
  • What Is a Focus Group? | Step-by-Step Guide & Examples
  • What Is a Likert Scale? | Guide & Examples
  • What Is a Prospective Cohort Study? | Definition & Examples
  • What Is a Retrospective Cohort Study? | Definition & Examples
  • What Is Action Research? | Definition & Examples
  • What Is an Observational Study? | Guide & Examples
  • What Is Concurrent Validity? | Definition & Examples
  • What Is Content Validity? | Definition & Examples
  • What Is Convenience Sampling? | Definition & Examples
  • What Is Convergent Validity? | Definition & Examples
  • What Is Criterion Validity? | Definition & Examples
  • What Is Data Cleansing? | Definition, Guide & Examples
  • What Is Deductive Reasoning? | Explanation & Examples
  • What Is Discriminant Validity? | Definition & Example
  • What Is Ecological Validity? | Definition & Examples
  • What Is Ethnography? | Definition, Guide & Examples
  • What Is Face Validity? | Guide, Definition & Examples
  • What Is Non-Probability Sampling? | Types & Examples
  • What Is Participant Observation? | Definition & Examples
  • What Is Peer Review? | Types & Examples
  • What Is Predictive Validity? | Examples & Definition
  • What Is Probability Sampling? | Types & Examples
  • What Is Purposive Sampling? | Definition & Examples
  • What Is Qualitative Observation? | Definition & Examples
  • What Is Qualitative Research? | Methods & Examples
  • What Is Quantitative Observation? | Definition & Examples
  • What Is Quantitative Research? | Definition, Uses & Methods

Unlimited Academic AI-Proofreading

✔ Document error-free in 5minutes ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

What is Research Design?

Crafting a well-defined research design is essential for guiding the entire project, ensuring coherence in methodology and analysis, and upholding the validity and reproducibility of outcomes in the complex landscape of research.

Updated on March 8, 2024

What is Research Design?

Diving into any new project necessitates a solid plan, a blueprint for navigating the very complex research process. It requires a framework that illustrates how all the principal components of the project are intended to work together to address your central research questions - the research design .

This research design is crucial not only for guiding your entire project, from methodology to analysis, but also for ensuring the validity and reproducibility of its outcomes. Let’s take a closer look at research design by focusing on some of its benefits and core elements.

Why do researchers need a research design?

By taking a deliberate approach to research design, you ensure your chosen methods realistically match the project’s objectives. For example:

  • If your project seeks to find out how a certain group of people was influenced by a natural disaster, you could use interviews as methods for gathering data. Then, inductive or deductive coding may be used for analysis.
  • On the other hand, if your project asks how drinking water was affected by that same natural disaster, you would conduct an experiment to measure certain variables. Inferential or descriptive statistical analysis might then be used to assess the data.

Attention to robust research design helps the project run smoothly and efficiently by reducing both errors and unnecessary busywork. Good research design possesses these specific characteristics :

  • Neutrality : Stick to only the facts throughout, creating a plan based on relevant research methods and analysis. Use it as an opportunity to identify possible sources of bias.
  • Reliability : Include reliable methods that support the consistent measurement of project variables. Not only does it improve the legitimacy of your conclusions but also improves the possibility of replication.
  • Validity : Apply measurement tools that minimize systematic errors. Show the straightforward connection between your project results and research hypothesis.
  • Generalizability : Verify that research outcomes are applicable to a larger population beyond the sample studied for your project. Employ sensible methods and processes that easily adapt to variations in the population.
  • Flexibility : Consider alternative measures for adjusting to unexpected data or outcomes. Veer away from rigid procedures and requirements and plan for adaptability.

When you make the effort to focus on these characteristics while developing a research design, the process itself weeds out many potential challenges. It illuminates the relationships between the project’s multiple elements and allows for modifications from the start. 

What makes up a research design?

As the overarching strategy for your entire project, the research design outlines the plans, considerations, and feasibility of every facet. To make this task less daunting, divide it into logical sections by asking yourself these questions:

  • What is your general approach for the study?
  • What type of design will you employ?
  • How will you choose the population and sampling methods?
  • Which data collection methods will you use?
  • How will the data be analyzed?

The answers to these questions depend on your research questions and hypothesis. Before starting your research design, make certain that these elements are well thought out, basically solidified, and truly represent your intentions for the project.

When considering the overall approach for your project, decide what kind of data is needed to answer the research questions. Start by asking yourself:

  • Do I want to establish a cause-and-effect relationship, test a hypothesis, or identify patterns in data? If yes, use quantitative methodologies.
  • Or, am I seeking non-numerical textual information, like human beliefs, cultural experiences, or individual behaviors? If so, use qualitative methods.

Quantitative research methods offer a systematic means of investigating complex phenomena by measuring, describing, and testing relationships between variables. On the other hand, the qualitative approach explores subjective experiences and concepts within their natural settings. Here are some key characteristics of both approaches:

Approach : Basis

Quantitative : The research begins with the formulation of specific research questions or hypotheses that can be tested empirically using numerical data.

Qualitative : The exploratory and flexible nature allows researchers to delve deeply into the subject matter and generate insights.

Approach : Data collection

Quantitative : Typically involves collecting numerical data through methods such as surveys, experiments, structured observations, or existing datasets.

Qualitative : To collect detailed, contextually rich information directly from participants, researchers use methods such as interviews, focus groups, participant observation, and document analysis.

Approach : Data analysis

Quantitative : Quantitative data are analyzed using statistical techniques.

Qualitative : Data analysis in qualitative research involves systematic techniques for organizing, coding, and interpreting textual or visual data. 

Approach : Interpretation of findings

Quantitative : Researchers interpret the results of the statistical analysis in relation to the research questions or hypotheses.

Qualitative : By paying close attention to context, qualitative researchers focus on interpreting the meanings, patterns, and themes that emerge from the data. 

Approach : Reporting results

Quantitative : Reported in a structured format, often including tables, charts, and graphs to present the data visually.

Qualitative : Contributes to theory building and exploration by generating new insights, challenging existing theories, and uncovering unexpected findings.

Approach : Types

Quantitative :

  • Experimental
  • Quasi-experimental
  • Correlational
  • Descriptive

Qualitative :

  • Ethnography
  • Grounded theory
  • Phenomenology

Population and sampling method

In research, the population, or target population, encompasses all individuals, objects, or events that share the specific attributes you’ve decided are relevant to the study’s objectives. As it is impractical to investigate every individual of this broad population, you will need to choose a subset, or sample.

Starting with a comprehensive understanding of the target population is crucial for selecting a sample that will assure the generalizability of your study’s results. However, drawing a truly random sample can be challenging, often resulting in some degree of sampling bias in most studies.

Sampling strategies vary across research fields, but are generally subdivided into these two categories:

  • Probability Sampling : accurately measurable probability for each member of the target population to have a chance of being included in the sample.
  • Non-probability sampling : selection is non-systematic and does not offer an equal chance for those in the target population to be selected for the sample.

There are several specific sampling methods that fall under these two broad headings:

Probability Sampling Examples

  • Simple random sampling: Each individual is chosen entirely by chance from a population, ensuring equal probability of selection. 
  • Convenience sampling: Participants are selected based on availability and willingness to participate.
  • Systematic sampling: Individuals are selected at regular intervals from the sampling frame based on a systematic rule.
  • Quota sampling: Interviewers are given quotas of specific subjects to recruit.

Non-probability Sampling Examples

  • Stratified sampling: The population is divided into homogenous subgroups based on shared characteristics, then used for a random sample.
  • Judgmental sampling: Researchers select participants based on their judgment or specific criteria.
  • Clustered sampling: Subgroups, or clusters, of the population are determined and then randomly selected for inclusion.
  • Snowball sampling: Existing subjects nominate further subjects known to them, allowing for sampling of hard-to-reach groups.

While they are often resource intensive, probability sampling methods have the advantage of providing representative samples with reduced biases. Non-probability sampling methods, on the other hand, are more cost-effective and convenient, yet lack representativeness and are prone to bias.

Data collection

Throughout the research process, you'll employ a variety of sources to gather, record, and organize information that is relevant to your study or project. Achieving results that hold validity and significance requires the skillful use of efficient data collection methods.

Primary and secondary data collection methods are two distinct approaches to consider when gathering information for your project. Let's take a look at these methods and their associated techniques:

Primary data collection : involves gathering original data directly from the source or through direct interaction with respondents. 

  • Surveys and Questionnaires: collecting data from individuals or groups through face-to-face interviews, telephone calls, mail, or online platforms.
  • Interviews: direct interaction between the researcher and the respondent, conducted in person, over the phone, or through video conferencing.
  • Observations: researchers observe and record behaviors, actions, or events in their natural setting.
  • Experiments: manipulating variables to observe their impact on outcomes. 
  • Focus Groups: small groups of individuals discuss specific topics in a moderated setting.

Secondary data collection: entails collecting and analyzing existing data already collected by someone else for a different purpose.

  • Published sources: books, academic journals, magazines, newspapers, government reports, and other published materials that contain relevant data.
  • Online sources: databases, websites, repositories, and other platforms available for consuming and downloading from the internet. 
  • Government and institutional sources: records, statistics, and other pertinent information to access and purchase.
  • Publicly available data: shared by individuals, organizations, or communities on public stages, websites, or social media.
  • Past research: studies and results available through libraries, educational institutions, and other communal archives. 

Though primary methods offer significant control over data collection, they can be time-consuming, costly, and susceptible to biases. Secondary methods, in contrast, provide cost-effective and time-saving alternatives but offer reduced control over the data collection process.

Data analysis

To extract maximum value from your collected data, it's essential to engage in purposeful evaluation and interpretation. This process of data analysis involves thorough examination, meticulous cleaning, and insightful modeling to reveal patterns pertinent to your research questions.

The choice of methods depends on the specific research objectives, data characteristics, and analytical requirements of your particular project. Here are a few examples of the diverse range of methods you can use for data analysis:

Descriptive statistics : Summarizes key features of the data, like central tendency, spread, and variability. 

Inferential statistics : Draws conclusions about populations based on sample data to test relationships and make predictions.

Qualitative analysis : Considers non-numerical transcripts to identify themes, patterns, and connections.

Causal analysis : Looks at the cause and effect of relationships between variables to test correlations.

Survey and questionnaire analysis : Transforms responses into usable data through processes like cross-tabulation and benchmarking.

Machine learning and data mining : Employs algorithms and computational techniques to discover patterns and insights from large datasets.

By integrating various data analysis tools, you can approach research questions from multiple perspectives to enhance the depth and breadth of your analysis.

Considerations for research design

A meticulous and thorough research design is essential to maintain the quality, reliability, and overall value of your study results. Consider these tips:

Do : Clearly define research questions

Don’t : Rush through the design process

Do : Choose appropriate methods

Don’t : Overlook ethical considerations

Do : Ensure data reliability and validity

Don’t : Neglect practical constraints

Do : Mitigate biases and confounding factors

Don’t : Use overly complex designs

Do : Pilot test the research design

Don’t : Ignore feedback from peers and experts

Do : Document the research design

Don’t : Assume the design is flawless

Final thoughts

A robust research design is undeniably crucial. It sets the framework for data collection, analysis, and interpretation throughout the entire research process. 

Because vagueness and assumptions can jeopardize the success of your project, you must prioritize clarity, make informed choices, and pay meticulous attention to detail. By embracing these strategies, your valuable research has the best chance of making its maximum impact on the world.

Charla Viera, MS

See our "Privacy Policy"

Book cover

Professional Empowerment in the Software Industry through Experience-Driven Shared Tacit Knowledge pp 87–110 Cite as

Research Design and Process

  • Hui Chen 3 &
  • Miguel Baptista Nunes 4  
  • First Online: 20 May 2023

46 Accesses

Research design aims to provide a rationale, framework and structure before engaging with data collection and data analysis (Vaus, Research design in social research, Sage, 2001). A reasonable research design defines the structure of the research process, arrangement of the different methods required to respond to the research questions and the different outputs at each of the stages established.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Baiduchuan. (2012). Company profile . http://weibo.com/baiduchuan or http://www.baiduchuan.com . Accessed 5 December 2012.

Bazeley, P. (2007). Qualitative data analysis with NVivo (2nd ed.). Sage.

Google Scholar  

Bian, Y. (2001). Guanxi capital and social eating in Chinese cities. In N. Lin, K. Cook, & R. Burt (Eds.), Social capital, theory and research (pp. 275–295). Transaction Publishers.

Bryman, A. (2012). Social research methods (4th ed.). Oxford University Press.

Bryman, A., & Bell, E. (2007). Business research methods (2nd ed.). Oxford University Press.

Cameron, S., & Price, D. (2009). Business research methods: A practical approach. Charted Institute of Personnel and Development, CIPD House.

Conrad, L. Y., & Tucker, V. M. (2019). Making it tangible: Hybrid card sorting within qualitative interviews. Journal of Documentation, 75 (2), 397–416.

Article   Google Scholar  

Creswell, J. W. (2013). Qualitative inquiry & research design: Choosing among five approaches (3rd ed.). Sage.

Davidson, P., & Griffin, R. W. (2000). Management: Australia in a global context . Wiley.

De Vaus, D. A. (2001). Research design in social research . Sage.

Denscombe, M. (1998). The good research guide . Open University Press.

Denscombe, M. (2007). The good research guide: For small-scale social research projects (3rd ed.). Open University Press.

Diener, E., & Crandall, R. (1978). Ethics and values in social and behavioral research . University of Chicago Press.

Electronic Records and Archives Department. (2015). Department profile . http://www.unisra.com/ucms/ . Accessed 5 November 2015.

Gibbs, G. R. (2002). Qualitative data analysis: Exploration with NVivo. Open University Press.

Goulding, C. (2002). Grounded theory: A practical guide for management, business and market researchers . Sage.

Book   Google Scholar  

GPRI. (2005). Company information technology and service profile . http://www.sgepri.sgcc.com.cn/html/nari/col1030000503/2014-05/13/20140513150038769627978_1.htmll . Accessed 5 November 2015.

Haberman, M. R., & Bush, N. E. (2012). Quality of life: Methodological and measurement issues. In C. R. King & P. S. Hinds (Eds.), Quality of life: From nursing and patient perspectives (3rd ed.). (pp. 167–188). Jones & Bartlett Learning, LLC.

Hammond, S., & Glenn, L. (2004). The ancient practice of Chinese social networking: Guanxi and social network theory. E:CO Special Double Issue, 6 (1–2), 24–31.

Hartley, J. (2004). Case study research. In C. Cassel & G. Symon (Eds.), Essential guide to qualitative methods in organizational research (pp. 323–333). Sage.

Chapter   Google Scholar  

Hatcher, T., & Colton, S. (2007). Using the internet to improve HRD research: The case of the web-based Delphi research technique to achieve content validity of an HRD-oriented measurement. Journal of European Industrial Training, 31 (7), 570–587.

Kazdin, A. E. (1977). Assessing the clinical or applied importance of behavior change. Behavior Modification, 1 (4), 427–452.

Kvale, S. (2007). Doing interview . Sage.

Lavelle, F., McGowan, L., Spence, M., Caraher, M., Raats, M. M., Hollywood, L., McDowell, D., McCloat, A., Mooney, E., & Dean, M. (2016). Barriers and facilitators to cooking from ‘scratch’ using basic or raw ingredients: A qualitative interview study. Appetite, 107 , 383–391.

Millar, R., Crute, V., & Hargie, O. (1992). Professional interview . Routledge.

Millar, R., & Tracey, A. (2009). The interview approach. In O. Hargie & D. Tourish (Eds.), Auditing organizational communication: A handbook of research, theory and practice (pp. 78–102). Routledge.

Nunes, J. M., Martins, J., Zhou, L., Almamari, S., & Alajamy, M. (2010). Contextual sensitivity in grounded theory: The role of pilot studies. The Electronic Journal of Business Research Methods, 8 (2), 73–84.

Punch, K. (2005). Introduction to social research: Quantitative and qualitative approaches . Sage.

Ramasamy, B., Goh, K. W., & Yeung, M. C. H. (2006). Is Guanxi (relationship) a bridge to knowledge transfer? Journal of Business Research, 59 (1), 130–139.

Sampson, H. (2004). Navigating the waves: The usefulness of a pilot in qualitative research. Qualitative Research, 4 (3), 383–402.

Saunders, M., Lewis, P., & Thornhill, A. (2003). Research methods for business students (3rd ed.). Prentice Hall.

SGCC. (2002). Brief introduction . http://www.sgcc.com.cn/ywlm/gsgk-e/gsgk-e/gsgk-e1.shtml . Accessed 5 November 2015.

Sireci, S. G. (1998, April 15). Evaluating content validity using multidimensional scaling. Proceedings of American Educational Research Association Annual Meeting. San Diego, CA. http://files.eric.ed.gov/fulltext/ED428121.pdf . Accessed 11 July 2015.

Somi, M. F., Butler, J. R. G., Vahid, F., Njau, J. D., Kachur, S. P., & Abdulla, S. (2007). Economic burden of malaria in rural Tanzania: Variations by socioeconomic status and season. Tropical Medicine and International Health, 12 (10), 1139–1147.

Strauss, A., & Corbin, J. (1998). Basics of qualitative research: Techniques and procedures for developing grounded theory (2nd ed.). Sage.

Sturges, J. E., & Hanrahan, K. J. (2004). Comparing telephone and face-to-face qualitative interviewing: A research note. Qualitative Research, 4 (1), 107–118.

UNIS. (2015). Company profile . http://www.unissoft.com/ . Accessed 5 November 2015.

van Teijlingen, E., & Hundley, V. (2001, Winter ). The importance of pilot studies. Social Research Update, 35, 1–4. http://sru.soc.surrey.ac.uk/SRU35.pdf . Accessed 10 July 2015.

Wilkie, R., Peat, G., Thomas, E., Hooper, H., & Croft, P. (2005). The Keele assessment of participation: A new instrument to measure participation restriction in population studies. Combined qualitative and quantitative examination of its psychometric properties. Quality of Life Research, 14 (8), 1889–1899.

Yin, R. (2003). Case study research: Design and methods (3rd ed.). Sage.

Zhou, L., & Nunes, J. M. (2010, June 24–25). Doing qualitative research in Chinese contexts: Lessons learned from a grounded theory study in a Chinese healthcare environment. Proceedings of 9th European Conference on Research Methodology for Business and Management Studies (pp. 576–584). IE Business School.

Download references

Author information

Authors and affiliations.

School of Information Management, Central China Normal University, Wuhan, Hubei, China

School of Internet of Things, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China

Miguel Baptista Nunes

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hui Chen .

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter.

Chen, H., Baptista Nunes, M. (2023). Research Design and Process. In: Professional Empowerment in the Software Industry through Experience-Driven Shared Tacit Knowledge. Springer, Singapore. https://doi.org/10.1007/978-981-99-1486-9_4

Download citation

DOI : https://doi.org/10.1007/978-981-99-1486-9_4

Published : 20 May 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-1485-2

Online ISBN : 978-981-99-1486-9

eBook Packages : Social Sciences Social Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Child Care and Early Education Research Connections

Data analysis.

Different statistics and methods used to describe the characteristics of the members of a sample or population, explore the relationships between variables, to test research hypotheses, and to visually represent data are described. Terms relating to the topics covered are defined in the  Research Glossary .

Descriptive Statistics

Tests of Significance

Graphical/Pictorial Methods

Analytical techniques.

Descriptive statistics can be useful for two purposes:

To provide basic information about the characteristics of a sample or population. These characteristics are represented by variables in a research study dataset.

To highlight potential relationships between these characteristics, or the relationships among the variables in the dataset.

The four most common descriptive statistics are:

Proportions, Percentages and Ratios

Measures of central tendency, measures of dispersion, measures of association.

One of the most basic ways of describing the characteristics of a sample or population is to classify its individual members into mutually exclusive categories and counting the number of cases in each of the categories. In research, variables with discrete, qualitative categories are called nominal or categorical variables. The categories can be given numerical codes, but they cannot be ranked, added, or multiplied. Examples of nominal variables include gender (male, female), preschool program attendance (yes, no), and race/ethnicity (White, African American, Hispanic, Asian, American Indian). Researchers calculate proportions, percentages and ratios in order to summarize the data from nominal or categorical variables and to allow for comparisons to be made between groups.

Proportion —The number of cases in a category divided by the total number of cases across all categories of a variable.

Percentage —The proportion multiplied by 100 (or the number of cases in a category divided by the total number of cases across all categories of a value times 100).

Ratio —The number of cases in one category to the number of cases in a second category.

A researcher selects a sample of 100 students from a Head Start program. The sample includes 20 White children, 30 African American children, 40 Hispanic children and 10 children of mixed-race/ethnicity.

Proportion of Hispanic children in the program = 40 / (20+30+40+10) = .40.

Percentage of Hispanic children in the program = .40 x 100 = 40%.

Ratio of Hispanic children to White children in the program = 40/20 = 2.0, or the ratio of Hispanic to White children enrolled in the Head Start program is 2 to 1.

Proportions, percentages and ratios are used to summarize the characteristics of a sample or population that fall into discrete categories. Measures of central tendency are the most basic and, often, the most informative description of a population's characteristics, when those characteristics are measured using an interval scale. The values of an interval variable are ordered where the distance between any two adjacent values is the same but the zero point is arbitrary. Values on an interval scale can be added and subtracted. Examples of interval scales or interval variables include household income, years of schooling, hours a child spends in child care and the cost of child care.

Measures of central tendency describe the "average" member of the sample or population of interest. There are three measures of central tendency:

Mean —The arithmetic average of the values of a variable. To calculate the mean, all the values of a variable are summed and divided by the total number of cases.

Median —The value within a set of values that divides the values in half (i.e. 50% of the variable's values lie above the median, and 50% lie below the median).

Mode —The value of a variable that occurs most often.

The annual incomes of five randomly selected people in the United States are $10,000, $10,000, $45,000, $60,000, and $1,000,000.

Mean Income = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000.

Median Income = $45,000.

Modal Income = $10,000.

The mean is the most commonly used measure of central tendency. Medians are generally used when a few values are extremely different from the rest of the values (this is called a skewed distribution). For example, the median income is often the best measure of the average income because, while most individuals earn between $0 and $200,000 annually, a handful of individuals earn millions.

Measures of dispersion provide information about the spread of a variable's values. There are three key measures of dispersion:

Range  is simply the difference between the smallest and largest values in the data. Researchers often report simply the values of the range (e.g., 75 – 100).

Variance  is a commonly used measure of dispersion, or how spread out a set of values are around the mean. It is calculated by taking the average of the squared differences between each value and the mean. The variance is the standard deviation squared.

Standard deviation , like variance, is a measure of the spread of a set of values around the mean of the values. The wider the spread, the greater the standard deviation and the greater the range of the values from their mean. A small standard deviation indicates that most of the values are close to the mean. A large standard deviation on the other hand indicates that the values are more spread out. The standard deviation is the square root of the variance.

Five randomly selected children were administered a standardized reading assessment. Their scores on the assessment were 50, 50, 60,75 and 90 with a mean score of 65.

Range = 90 - 50 = 40.

Variance = [(50 - 65)2 + (50 - 65)2 + (60 - 65)2 + (75 - 65)2 + (90 - 65)2] / 5 = 300.

Standard Deviation = Square Root (150,540,000,000) = 17.32.

Skewness and Kurtosis

The range, variance and standard deviation are measures of dispersion and provide information about the spread of the values of a variable. Two additional measures provide information about the shape of the distribution of values.

Skew  is a measure of whether some values of a variable are extremely different from the majority of the values. Skewness refers to the tendency of the values of a variable to depart from symmetry. A distribution is symmetric if one half of the distribution is exactly equal to the other half. For example, the distribution of annual income in the U.S. is skewed because most people make between $0 and $200,000 a year, but a handful of people earn millions. A variable is positively skewed (skewed to the right) if the extreme values are higher than the majority of values. A variable is negatively skewed (skewed to the left) if the extreme values are lower than the majority of values. In the example of students' standardized test scores, the distribution is slightly positively skewed.

Kurtosis  measures how outlier-prone a distribution is. Outliers are values of a variable that are much smaller or larger than most of the values found in a dataset. The kurtosis of a normal distribution is 0. If the kurtosis is different from 0, then the distribution produces outliers that are either more extreme (positive kurtosis) or less extreme (negative kurtosis) than are produced by the normal distribution.

Measures of association indicate whether two variables are related. Two measures are commonly used:

Chi-square test of independence

Correlation

Chi-Square test of independence  is used to evaluate whether there is an association between two variables. (The chi-square test can also be used as a measure of goodness of fit, to test if data from a sample come from a population with a specific distribution, as an alternative to Anderson-Darling and Kolmogorov-Smirnov goodness-of-fit tests.)

It is most often used with nominal data (i.e., data that are put into discrete categories: e.g., gender [male, female] and type of job [unskilled, semi-skilled, skilled]) to determine whether they are associated. However, it can also be used with ordinal data.

Assumes that the samples being compared (e.g., males, females) are independent.

Tests the null hypothesis of no difference between the two variables (i.e., type of job is not related to gender).

To test for associations, a chi-square is calculated in the following way: Suppose a researcher wants to know whether there is a relationship between gender and two types of jobs, construction worker and administrative assistant. To perform a chi-square test, the researcher counts the number of female administrative assistants, the number of female construction workers, the number of male administrative assistants, and the number of male construction workers in the data. These counts are compared with the number that would be expected in each category if there were no association between job type and gender (this expected count is based on statistical calculations). The association between the two variables is determined to be significant (the null hypothesis is rejected), if the value of the chi-square test is greater than or equal to the critical value for a given significance level (typically .05) and the degrees of freedom associated with the test found in a chi-square table. The degrees of freedom for the chi-square are calculated using the following formula:  df  = (r-1)(c-1) where r is the number of rows and c is the number of columns in a contingency or cross-tabulation table. For example, the critical value for a 2 x 2 table with 1 degree of freedom ([2-1][2-1]=1) is 3.841.

Correlation coefficient  is used to measure the strength and direction of the relationship between numeric variables (e.g., weight and height).

The most common correlation coefficient is the Pearson's product-moment correlation coefficient (or simply  Pearson's r ), which can range from -1 to +1.

Values closer to 1 (either positive or negative) indicate that a stronger association exists between the two variables.

A positive coefficient (values between 0 and 1) suggests that larger values of one of the variables are accompanied by larger values of the other variable. For example, height and weight are usually positively correlated because taller people tend to weigh more.

A negative association (values between 0 and -1) suggests that larger values of one of the variables are accompanied by smaller values of the other variable. For example, age and hours slept per night are often negatively correlated because older people usually sleep fewer hours per night than younger people.

The findings reported by researchers are typically based on data collected from a single sample that was drawn from the population of interest (e.g., a sample of children selected from the population of children enrolled in Head Start or Early Head Start). If additional random samples of the same size were drawn from this population, the estimated percentages and means calculated using the data from each of these other samples might differ by chance somewhat from the estimates produced from one sample. Researchers use one of several tests to evaluate whether their findings are statistically significant.

Statistical significance refers to the probability or likelihood that the difference between groups or the relationship between variables observed in statistical analyses is not due to random chance (e.g., that differences between the average scores on a measure of language development between 3- and 4-year-olds are likely to be “real” rather than just observed in this sample by chance). If there is a very small probability that an observed difference or relationship is due to chance, the results are said to reach statistical significance. This means that the researcher concludes that there is a real difference between two groups or a real relationship between the observed variables.

Significance tests and the associated  p-  value only tell us how likely it is that a statistical result (e.g., a difference between the means of two or more groups, or a correlation between two variables) is due to chance. The p-value is the probability that the results of a statistical test are due to chance. In the social and behavioral sciences, a p-value less than or equal to .05 is usually interpreted to mean that the results are statistically significant (that the statistical results would occur by chance 5 times or fewer out of 100), although sometimes researchers use a p-value of .10 to indicate whether a result is statistically significant. The lower the p-value, the less likely a statistical result is due to chance. Lower p-values are therefore a more rigorous criteria for concluding significance.

Researchers use a variety of approaches to test whether their findings are statistically significant or not. The choice depends on several factors, including the number of groups being compared, whether the groups are independent from one another, and the type of variables used in the analysis. Three widely used tests are the t-test, F-test, and Chi-square test.

Three of the more widely used tests of statistical significance are described briefly below.

Chi-Square test  is used when testing for associations between categorical variables (e.g., differences in whether a child has been diagnosed as having a cognitive disability by gender or race/ethnicity). It is also used as a goodness-of-fit test to determine whether data from a sample come from a population with a specific distribution.

t-test  is used to compare the means of two independent samples (independent t-test), the means of one sample at different times (paired sample t-test) or the mean of one sample against a known mean (one sample t-test). For example, when comparing the mean assessment scores of boys and girls or the mean scores of 3- and 4-year-old children, an independent t-test would be used. When comparing the mean assessment scores of girls only at two time points (e.g., fall and spring of the program year) a paired t-test would be used. A one sample t-test would be used when comparing the mean scores of a sample of children to the mean score of a population of children. The t- test is appropriate for small sample sizes (less than 30) although it is often used when testing group differences for larger samples. It is also used to test whether correlation and regression coefficients are significantly different from zero.

F-test  is an extension of the t-test and is used to compare the means of three or more independent samples (groups). The F-test is used in Analysis of Variance (ANOVA) to examine the ratio of the between groups to within groups variance. It is also used to test the significance of the total variance explained by a regression model with multiple independent variables.

Significance tests alone do not tell us anything about the size of the difference between groups or the strength of the association between variables. Because significance test results are sensitive to sample size, studies with different sample sizes with the same means and standard deviations would have different t statistics and p values. It is therefore important that researchers provide additional information about the size of the difference between groups or the association and whether the difference/association is substantively meaningful.

See the following for additional information about descriptive statistics and tests of significance:

Descriptive analysis in education: A guide for researchers  (PDF)

Basic Statistics

Effect Sizes and Statistical Significance

Summarizing and Presenting Data

There are several graphical and pictorial methods that enhance understanding of individual variables and the relationships between variables. Graphical and pictorial methods provide a visual representation of the data. Some of these methods include:

Line graphs

Scatter plots.

Geographical Information Systems (GIS)

Bar charts visually represent the frequencies or percentages with which different categories of a variable occur.

Bar charts are most often used when describing the percentages of different groups with a specific characteristic. For example, the percentages of boys and girls who participate in team sports. However, they may also be used when describing averages such as the average boys and girls spend per week participating in team sports.

Each category of a variable (e.g., gender [boys and girls], children's age [3, 4, and 5]) is displayed along the bottom (or horizontal or X axis) of a bar chart.

The vertical axis (or Y axis) includes the values of the statistic on that the groups are being compared (e.g., percentage participating in team sports).

A bar is drawn for each of the categories along the horizontal axis and the height of the bar corresponds to the frequency or percentage with which that value occurs.

A pie chart (or a circle chart) is one of the most commonly used methods for graphically presenting statistical data.

As its name suggests, it is a circular graphic, which is divided into slices to illustrate the proportion or percentage of a sample or population that belong to each of the categories of a variable.

The size of each slice represents the proportion or percentage of the total sample or population with a specific characteristic (found in a specific category). For example, the percentage of children enrolled in Early Head Start who are members of different racial/ethnic groups would be represented by different slices with the size of each slice proportionate to the group's representation in the total population of children enrolled in the Early Head Start program.

A line graph is a type of chart which displays information as a series of data points connected by a straight line.

Line graphs are often used to show changes in a characteristic over time.

It has an X-axis (horizontal axis) and a Y axis (vertical axis). The time segments of interest are displayed on the X-axis (e.g., years, months). The range of values that the characteristic of interest can take are displayed along the Y-axis (e.g., annual household income, mean years of schooling, average cost of child care). A data point is plotted coinciding with the value of the Y variable plotted for each of the values of the X variable, and a line is drawn connecting the points.

Scatter plots display the relationship between two quantitative or numeric variables by plotting one variable against the value of another variable

The values of one of the two variables are displayed on the horizontal axis (x axis) and the values of the other variable are displayed on the vertical axis (y axis)

Each person or subject in a study would receive one data point on the scatter plot that corresponds to his or her values on the two variables. For example, a scatter plot could be used to show the relationship between income and children's scores on a math assessment. A data point for each child in the study showing his or her math score and family income would be shown on the scatter plot. Thus, the number of data points would equal the total number of children in the study.

Geographic Information Systems (GIS)

A Geographic Information System is computer software capable of capturing, storing, analyzing, and displaying geographically referenced information; that is, data identified according to location.

Using a GIS program, a researcher can create a map to represent data relationships visually. For example, the National Center for Education Statistics creates maps showing the characteristics of school districts across the United States such as the percentage of children living in married couple households, median family incomes and percentage of population that speaks a language other than English. The data that are linked to school district location come from the American Community Survey.

Display networks of relationships among variables, enabling researchers to identify the nature of relationships that would otherwise be too complex to conceptualize.

See the following for additional information about different graphic methods:

Graphical Analytic Techniques

Geographic Information Systems

Researchers use different analytical techniques to examine complex relationships between variables. There are three basic types of analytical techniques:

Regression Analysis

Grouping methods, multiple equation models.

Regression analysis assumes that the dependent, or outcome, variable is directly affected by one or more independent variables. There are four important types of regression analyses:

Ordinary least squares (OLS) regression

OLS regression (also known as linear regression) is used to determine the relationship between a dependent variable and one or more independent variables.

OLS regression is used when the dependent variable is continuous. Continuous variables, in theory, can take on any value with a range. For example, family child care expenses, measured in dollars, is a continuous variable.

Independent variables may be nominal, ordinal or continuous. Nominal variables, which are also referred to as categorical variables, have two or more non-numeric or qualitative categories. Examples of nominal variables are children's gender (male, female), their parents' marital status (single, married, separated, divorced), and the type of child care children receive (center-based, home-based care). Ordinal variables are similar to nominal variables except it is possible to order the categories and the order has meaning. For example, children's families’ socioeconomic status may be grouped as low, middle and high.

When used to estimate the associations between two or more independent variables and a single dependent variable, it is called multiple linear regression.

In multiple regression, the coefficient (i.e., standardized or unstandardized regression coefficient for each independent variable) tells you how much the dependent variable is expected to change when that independent variable increases by one, holding all the other independent variables constant.

Logistic regression

Logistic regression (or logit regression) is a special form of regression analysis that is used to examine the associations between a set of independent or predictor variables and a dichotomous outcome variable. A dichotomous variable is a variable with only two possible values, e.g. child receives child care before or after the Head Start program day (yes, no).

Like linear regression, the independent variables may be either interval, ordinal, or nominal. A researcher might use logistic regression to study the relationships between parental education, household income, and parental employment and whether children receive child care from someone other than their parents (receives nonparent care/does not receive nonparent care).

Hierarchical linear modeling (HLM)

Used when data are nested. Nested data occur when several individuals belong to the same group under study. For example, in child care research, children enrolled in a center-based child care program are grouped into classrooms with several classrooms in a center. Thus, the children are nested within classrooms and classrooms are nested within centers.

Allows researchers to determine the effects of characteristics for each level of nested data, classrooms and centers, on the outcome variables. HLM is also used to study growth (e.g., growth in children’s reading and math knowledge and skills over time).

Duration models

Used to estimate the length of time before a given event occurs or the length of time spent in a state. For example, in child care policy research, duration models have been used to estimate the length of time that families receive child care subsidies.

Sometimes referred to as survival analysis or event history analysis.

Grouping methods are techniques for classifying observations into meaningful categories. Two of the most common grouping methods are discriminant analysis and cluster analysis.

Discriminant analysis

Identifies characteristics that distinguish between groups. For example, a researcher could use discriminant analysis to determine which characteristics identify families that seek child care subsidies and which identify families that do not.

It is used when the dependent variable is a categorical variable (e.g., family receives child care subsidies [yes, no], child enrolled in family care [yes, no], type of child care child receives [relative care, non-relative care, center-based care]). The independent variables are interval variables (e.g., years of schooling, family income).

Cluster analysis

Used to classify similar individuals together. It uses a set of measured variables to classify a sample of individuals (or organizations) into a number of groups such that individuals with similar values on the variables are placed in the same group. For example, cluster analysis would be used to group together parents who hold similar views of child care or children who are suspended from school.

Its goal is to sort individuals into groups in such a way that individuals in the same group (cluster) are more similar to each other than to individuals in other groups.

The variables used in cluster analysis may be nominal, ordinal or interval.

Multiple equation modeling, which is an extension of regression, is used to examine the causal pathways from independent variables to the dependent variable. For example, what are the variables that link (or explain) the relationship between maternal education (independent variable) and children's early reading skills (dependent variable)? These variables might include the nature and quality of mother-child interactions or the frequency and quality of shared book reading.

There are two main types of multiple equation models:

Path analysis

Structural equation modeling

Path analysis is an extension of multiple regression that allows researchers to examine multiple direct and indirect effects of a set of variables on a dependent, or outcome, variable. In path analysis, a direct effect measures the extent to which the dependent variable is influenced by an independent variable. An indirect effect measures the extent to which an independent variable's influence on the dependent variable is due to another variable.

A path diagram is created that identifies the relationships (paths) between all the variables and the direction of the influence between them.

The paths can run directly from an independent variable to a dependent variable (e.g., X→Y), or they can run indirectly from an independent variable, through an intermediary, or mediating, variable, to the dependent variable (e.g. X1→X2→Y).

The paths in the model are tested to determine the relative importance of each.

Because the relationships between variables in a path model can become complex, researchers often avoid labeling the variables in the model as independent and dependent variables. Instead, two types of variables are found in these models:

Exogenous variables  are not affected by other variables in the model. They have straight arrows emerging from them and not pointing to them.

Endogenous variables  are influenced by at least one other variable in the model. They have at least one straight arrow pointing to them.

Structural equation modeling (SEM)

Structural equation modeling expands path analysis by allowing for multiple indicators of unobserved (or latent) variables in the model. Latent variables are variables that are not directly observed (measured), but instead are inferred from other variables that are observed or directly measured. For example, children's school readiness is a latent variable with multiple indicators of children's development across multiple domains (e.g., children's scores on standardized assessments of early math and literacy, language, scores based on teacher reports of children's social skills and problem behaviors).

There are two parts to a SEM analysis. First, the measurement model is tested. This involves examining the relationships between the latent variables and their measures (indicators). Second, the structural model is tested in order to examine how the latent variables are related to one another. For example, a researcher might use SEM to investigate the relationships between different types of executive functions and word reading and reading comprehension for elementary school children. In this example, the latent variables word reading and reading comprehension might be inferred from a set of standardized reading assessments and the latent variables cognitive flexibility and inhibitory control from a set of executive function tasks. The measurement model of SEM allows the researcher to evaluate how well children's scores on the standardized reading assessments combine to identify children's word reading and reading comprehension. Assuming that the results of these analyses are acceptable, the researcher would move on to an evaluation of the structural model, examining the predicted relationships between two types of executive functions and two dimensions of reading.

SEM has several advantages over traditional path analysis:

Use of multiple indicators for key variables reduces measurement error.

Can test whether the effects of variables in the model and the relationships depicted in the entire model are the same for different groups (e.g., are the direct and indirect effects of parent investments on children's school readiness the same for White, Hispanic and African American children).

Can test models with multiple dependent variables (e.g., models predicting several domains of child development).

See the following for additional information about multiple equation models:

Finding Our Way: An Introduction to Path Analysis (Streiner)

An Introduction to Structural Equation Modeling (Hox & Bechger)  (PDF)

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. ​ Fig.1 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. ​ Fig.1, 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig1_HTML.jpg

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

  • To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.
  • To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.
  • To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.
  • To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure ​ Figure2 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

  • Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.
  • Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.
  • Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.
  • Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.
  • Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig2_HTML.jpg

An example of data science modeling from real-world data to data-driven system and decision making

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

  • Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.
  • Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.
  • Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.
  • Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table ​ Table1, 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Various types of analytical methods with examples

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure ​ Figure3 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig3_HTML.jpg

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure ​ Figure4 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig4_HTML.jpg

An example of a random forest structure considering multiple decision trees

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure ​ Figure5 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig5_HTML.jpg

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure ​ Figure6 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_765_Fig6_HTML.jpg

A structure of an artificial neural network modeling with multiple processing layers

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

  • Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.
  • Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.
  • Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.
  • IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.
  • Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.
  • Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.
  • Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.
  • Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.
  • Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.
  • Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

  • Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.
  • The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.
  • Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.
  • The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.
  • The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.
  • In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • - Google Chrome

Intended for healthcare professionals

  • Access provided by Google Indexer
  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution

Home

Search form

  • Advanced search
  • Search responses
  • Search blogs
  • 25 year trends in...

25 year trends in cancer incidence and mortality among adults aged 35-69 years in the UK, 1993-2018: retrospective secondary analysis

Linked editorial.

Cancer trends in the UK

  • Related content
  • Peer review
  • Jon Shelton , head of cancer intelligence 1 ,
  • Ewa Zotow , visiting lecturer (statistics) 2 ,
  • Lesley Smith , senior research fellow 3 ,
  • Shane A Johnson , senior data and research analyst 1 ,
  • Catherine S Thomson , service manager (cancer and adult screening) 4 ,
  • Amar Ahmad , principal statistician 1 ,
  • Lars Murdock , data analysis and research manager 1 ,
  • Diana Nagarwalla , data analysis and research manager 1 ,
  • David Forman , visiting professor of epidemiology 5
  • 1 Cancer Research UK, London, UK
  • 2 University College London, London, UK
  • 3 Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
  • 4 Public Health Scotland, Edinburgh, UK
  • 5 Faculty of Medicine and Health, University of Leeds, Leeds, UK
  • Correspondence to: J Shelton jon.shelton{at}cancer.org.uk
  • Accepted 19 January 2024

Objective To examine and interpret trends in UK cancer incidence and mortality for all cancers combined and for the most common cancer sites in adults aged 35-69 years.

Design Retrospective secondary data analysis.

Data sources Cancer registration data, cancer mortality and national population data from the Office for National Statistics, Public Health Wales, Public Health Scotland, Northern Ireland Cancer Registry, NHS England, and the General Register Office for Northern Ireland.

Setting 23 cancer sites were included in the analysis in the UK.

Participants Men and women aged 35-69 years diagnosed with or who died from cancer between 1993 to 2018.

Main outcome measures Change in cancer incidence and mortality age standardised rates over time.

Results The number of cancer cases in this age range rose by 57% for men (from 55 014 cases registered in 1993 to 86 297 in 2018) and by 48% for women (60 187 to 88 970) with age standardised rates showing average annual increases of 0.8% in both sexes. The increase in incidence was predominantly driven by increases in prostate (male) and breast (female) cancers. Without these two sites, all cancer trends in age standardised incidence rates were relatively stable. Trends for a small number of less common cancers showed concerning increases in incidence rates, for example, in melanoma skin, liver, oral, and kidney cancers. The number of cancer deaths decreased over the 25 year period, by 20% in men (from 32 878 to 26 322) and 17% in women (28 516 to 23 719); age standardised mortality rates reduced for all cancers combined by 37% in men (−2.0% per year) and 33% in women (−1.6% per year). The largest decreases in mortality were noted for stomach, mesothelioma, and bladder cancers in men and stomach and cervical cancers and non-Hodgkin lymphoma in women. Most incidence and mortality changes were statistically significant even when the size of change was relatively small.

Conclusions Cancer mortality had a substantial reduction during the past 25 years in both men and women aged 35-69 years. This decline is likely a reflection of the successes in cancer prevention (eg, smoking prevention policies and cessation programmes), earlier detection (eg, screening programmes) and improved diagnostic tests, and more effective treatment. By contrast, increased prevalence of non-smoking risk factors are the likely cause of the observed increased incidence for a small number of specific cancers. This analysis also provides a benchmark for the following decade, which will include the impact of covid-19 on cancer incidence and outcomes.

Introduction

The availability of comprehensive cancer registration data across the UK since 1993 makes comparison of cancer incidence and mortality trends over 25 years possible. We examined UK trends in cancer incidence and mortality for men and women, aged 35-69 years, for all cancers combined and for the most common sites (or site groups) of cancer between 1993 and 2018.

This study focuses on the 35-69 years age group because cancer trend data are more reliable and easier to interpret in this age range. 1 Diagnostic accuracy is better in this age range than in older patients who have a greater proportion of clinical and uncertain diagnoses, as evidenced by the relatively low proportion of microscopically verified tumours, 2 especially in the earlier part of the period analysed. By the age of 35 years, the pattern of cancer broadly represents the usual adult profiles because specific cancers that are observed in childhood, adolescence, and young people would not impact on the overall pattern. Trends in the 35-69 years age group are also reflective of causal factors in the more recent and medium term past rather than in the longer term and, therefore, will be more indicative of future patterns of cancer in the older populations.

This time period has also seen the introduction of three population screening programmes across the UK, which have affected trends by diagnosing some cancers at an earlier stage, preventing cancers, but also had the potential for diagnosing some cancers that would not have otherwise caused harm to the individual, particularly breast cancer. 3 4 Cervical smear tests have been used since the 1960s and the national screening programme was introduced in 1988, with over 85% coverage of the target population (women and people with a cervix aged 25-64 years) in the UK by 1994. 5 The breast screening programme was introduced in 1988 and covered all UK countries by the mid-1990s, with women aged 50-70 years being invited. 6 The bowel screening programme was introduced from 2006 and took a number of years to reach full roll-out. Currently, people aged 60-74 across England, Wales, and Northern Ireland, and 50-74 for Scotland are eligible. Prostate specific antigen testing is not part of the national screening programme. Anyone older than 50 years with a prostate can request a prostate specific antigen test from their family doctor (general practitioner).

The past 25 years have seen differing trends in cancer risk factors, with the two most important risk factors displaying trends in opposing directions. In one direction, smoking prevalence is reducing due to introductions of tax rises on tobacco products, further advertising bans, and smokefree policies, including education and encouraging quitting, and, in the other direction, the proportion of the population classified as overweight or obese is increasing, of which diet and exercise contribute to, as well as being independent risk factors for cancer. 7

Cancer registration data are currently collected by four national registries in the UK. These organisations collect detailed information on newly diagnosed primary tumours, referred to as registrations. Prior to 2013, cancer registrations in England were collected by eight regional registries and compiled by the Office for National Statistics, 8 with these regional registries producing complete population coverage for England since 1971. 9 Cancer Research UK aggregate these data from the UK registries, with incidence, mortality, and corresponding national population data provided by the Office for National Statistics, Public Health Wales, 10 Public Health Scotland, 11 the Northern Ireland Cancer Registry, 12 NHS England, 13 and the General Register Office for Northern Ireland. 14 Coding of cancer registrations is consistent between countries of the UK, using internationally accepted codes from the International Classification of Diseases 10th revision (ICD-10) and collaboration through the UK and Ireland Association of Cancer Registries. 15

Cancer sites (for single sites) or site groups (with multiple sites, such as oral) included in these analyses were selected as the most common causes of cancer incidence or death. These cancer sites are: all cancers combined (excluding non-melanoma skin cancer for incidence) (C00-C97, excluding C44); bladder (C67); bowel (C18-C20); brain and central nervous system (C70-C72, C75.1-C75.3, D32-D33, D35.2-D35.4, D42-D43, D44.3-D44.5); breast (women only) (C50); cervix (C53); Hodgkin lymphoma (C81); kidney (C64-C66, C68); larynx (C32); leukaemia (C91-C95); liver (C22); lung (C33-C34); melanoma skin(C43); mesothelioma (C45); myeloma (C90); non-Hodgkin lymphoma (C82-C86); oesophagus (C15); lip, oral cavity, and pharynx (oral) (C00-C06, C09-C10, C12-C14); ovary (C56-C57.4); pancreas (C25); prostate (C61); stomach (C16); testis (C62); and uterus (C54-C55). In addition, sex specific all cancer groups are also presented without breast and prostate cancers to inspect the overall trends in the absence of the most common cancer site for each sex. Sex is reported as recorded by the cancer registries at the time of registration. Mesothelioma was a new specific code introduced in ICD-10 and no reliable mortality data are available for this site before 2001, hence, we have not included this type of cancer prior to then. Non-malignant brain and central nervous system codes (ICD-10 D codes) are included despite their benign nature because they can cause mortality due to their location in the cranial cavity. The codes included for the brain and central nervous system have been chosen following clinical engagement and discussion with cancer registries across the UK. Non-melanoma skin cancer is excluded for incidence data because of the lack of completeness in the recording of these cancers and therefore unreliability of the data; this process is standard practice among UK cancer registries. 16 A proportion of non-melanoma skin cancer cases can be diagnosed and treated within primary care and have not consistently been captured within cancer registration data. 17

To overcome yearly variation for sites with low numbers of cases, we calculated three-year rolling average age standardised rates per 100 000 population. 18 These rates were based on the European standard population 2013 for men and women separately for each cancer site or site group for both incidence and mortality, restricted to the 35-69 years age group. 19

The estimated annual percentage change is commonly computed using a generalised linear regression model with Gaussian or Poisson link function. 18 20 In this analysis, a generalised linear model was performed with quasi-Poisson link function as overdispersion is very common when modelling rates and count data. 21 The outcome was the age standardised cancer (incidence or mortality) rate per 100 000 and the independent variable was the period variable, which was defined as the three year period for each data point, starting from 1993-95 and ending with 2016-18. Estimated annual percentage change was estimated as (exp (β^−1)' 100, where β^ is the estimated slope of the period variable, with corresponding 95% confidence interval, which is derived from the fitted quasi-Poisson regression model. 22 The determination of trends was based on the following criteria: firstly, an increasing trend was identified when the estimated annual percentage change value and its 95% confidence interval were greater than zero. This value suggests a statistically significant increase in the age standardised rate over time. Secondly, a decreasing trend was indicated when both the estimated annual percentage change value and its 95% confidence interval were less than zero, signifying a statistically significant decline in the age standardised rate over the period considered. Finally, in cases where these conditions were not met, the age standardised rate was concluded to have remained relatively stable. This designation means that no significant change in the age standardised rate over the period examined was noted. These criteria ensure a thorough and precise interpretation of the estimated annual percentage change values and their corresponding trends. These analyses were carried out for each sex and site or site group separately. Statistical analysis was performed using R version 4.0.2. 23

Patient and public involvement

This work uses aggregated and non-identifiable routine data that have been provided by patients and collected by the health services of the UK as part of their care and support. Given the aggregated nature of the data, attempts to identify or involve any of the patients whose data are included is not possible nor permitted. Although patients and the public were not involved in the design and conduct of this research, the aim of this research is to provide an assessment of trends in cancer incidence and mortality and the impacts of treatment and policy changes to improve outcomes for cancer patients across the UK. Dissemination to the public will include a press release and a summary published online, written using layman’s terms, and a webinar to discuss the results.

Table 1 and table 2 show the percentage of all newly diagnosed cancer cases and deaths by age group in 1993 and 2018. For male registrations, around 43% of all registrations were in the 35-69 years age group in 1993 and 2018, while for female registrations, between 47% and 48% of all registrations were in this age group in 1993 and 2018, respectively. For mortality, around 40% of male cancer deaths occurred in the 35-69 years age group in 1993 and this value was lower at 30% in 2018. For female cancer deaths, a slightly smaller reduction was noted, from 38% in the 35-69 years age group in 1993 to 31% in 2018.

Number of newly diagnosed cancer cases (% of total) in the UK for all cancers, excluding non-melanoma skin cancer, (ICD-10 C00-C97 excluding C44) by sex and age group in 1993 and 2018

  • View inline

Number of deaths (% of total) in the UK for all cancers, (ICD-10 C00-C97) by sex and age group in 1993 and 2018

Figure 1 shows the number of newly diagnosed cancer cases and deaths in the 35-69 years age group between 1993 and 2018 by sex. Across the UK, of cancer registrations in 2018, 83% were from England, and 5.1% from Wales, 9.2% from Scotland, and 2.7% from Northern Ireland; for deaths in 2018, 81.4%, 5.3%, 10.4%, and 2.9% were from England, Wales, Scotland, and Northern Ireland, respectively. These proportions remained relatively stable over the study period. For men, the number of cancer registrations increased by 57% from 55 014 cases registered in 1993 to 86 297 cases registered in 2018, while for women, cases increased by 48% from 60 187 in 1993 to 88 970 in 2018. The rate of increase in the number of cases of cancer was more marked between 2003 and 2013 for both sexes than in other time periods in the study.

Fig 1

Number of newly diagnosed cancer cases and deaths in the UK for all cancers, excluding non-melanoma skin cancer for incidence (International Classification of Diseases (10th revision) codes C00-C97 (excluding C44 for incidence)), men and women, 35-69 years, 1993 to 2018. An interactive version of this graphic is available at https://bit.ly/4acPDjP

  • Download figure
  • Open in new tab
  • Download powerpoint

The number of cancer deaths in men and women aged 35-69 years decreased: by 20% in men from 32 878 in 1993 to 26 322 deaths in 2018 and by 17% in women from 28 516 in 1993 to 23 719 deaths in 2018. The main decrease in the number of deaths per year occurred before the year 2000 ( fig 1 ) with a decrease of 14% in males and 11% in females between 1993 and 2000. Since 2000, the number of deaths each year in both men and women has remained fairly constant ( fig 1 ).

Table 3 , table 4 , figure 2 and figure 3 , and figure 4 and figure 5 show the trends over time in both incidence and mortality rates by sex and cancer site or site group. The tables only include specific age standardised incidence and mortality rates for the first (1993-95) and last (2016-18) period to give an indication of the change over the 25 year period. The trends in incidence and mortality age standardised rates for all years are shown in the figures. Figure 6 and figure 7 show the age adjusted average annual percentage change in the rates. Between 1993-95 and 2016-18, the age standardised incidence rate for all cancers (excluding non-melanoma skin cancer) increased slightly in men and women with age adjusted annual increases of 0.8% for both sexes. The trends in prostate and breast cancer, as the two largest cancer sites in men and women, respectively, substantially contribute to the overall all sites trends for cancer incidence. Figure 3 shows the trends for each sex without the largest cancer site. In contrast to the male age standardised incidence rate for all cancers, which showed a general increase, the incidence trend for men for all cancers excluding non-melanoma skin and prostate cancer, showed a decrease before 2000, but very little change in the following period. For women, an increase in age standardised incidence rates for all cancers excluding non-melanoma skin and breast cancer is still observed but the rate of increase is lower, at 0.7% per annum on average, over the 25 year period. Over the same period reductions in age standardised mortality for all cancers, including non-melanoma skin cancer, were −2.0% per year in men and −1.6% in women. Exclusion of prostate cancer from the mortality trends for men had a negligible effect on the average annual percentage change. For women, the exclusion of breast cancer from mortality trends led to a smaller decrease in mortality of −1.3% per annum.

Age standardised* incidence and mortality rates in 1993-95 and 2016-18 and percentage change by cancer type, men aged 35-69 years, UK

Age standardised* incidence and mortality rates in 1993-95 and 2016-18 and percentage change by cancer type, women aged 35-69 years, UK

Fig 2

European 2013 population age standardised incidence and mortality rates in the UK for all cancers, 19 excluding non-melanoma skin cancer for incidence (International Classification of Diseases (10th revision) codes C00-C97 excluding C44 for incidence), men and women, 35-69 years, 1993-95 to 2016-18. An interactive version of this graphic is available at https://bit.ly/4a484aE

Fig 3

European 2013 population age standardised incidence and mortality rates in the UK for all cancers in men and women aged 35-69 years during 1993-95 to 2016-18, 19 excluding non-melanoma skin cancer for incidence, and breast cancer in women and prostate cancer in men were excluded for incidence and mortality (International Classification of Diseases (10th revision) codes C00-C97 excluding C44 for incidence, C50, C61). An interactive version of this graphic is available at https://bit.ly/3vakQoX

Fig 4

European 2013 age standardised incidence and mortality rates by year, 19 in the UK, for men and women aged 35-69 years from 1993-95 to 2016-18, by cancer site. An interactive version of this graphic is available at https://bit.ly/49a6ovn

Fig 5

Relative European 2013 age standardised incidence and mortality rates by year, 19 in the UK, for men and women aged 35-69 years from 1993-95 to 2016-18 (the reference year is 1993-95=100), by cancer site. CNS=central nervous system. An interactive version of this graphic is available at https://bit.ly/3PiKGOk

Fig 6

Average annual percentage change in incidence and mortality rates, in the UK, for men aged 35-69 years from 1993-95 to 2016-18 by cancer site. An interactive version of this graphic is available at https://bit.ly/3wMR6yU

Fig 7

Average annual percentage change in incidence and mortality rates, in the UK, for women aged 35-69 years, from 1993-95 to 2016-18, by cancer site. An interactive version of this graphic is available at https://bit.ly/3v0QdT7

Incidence rates varied over time across the different cancer sites and site groups. The largest average annual percentage increases over time for cancer incidence rates for men aged 35-69 years were for cancers of the liver (4.7%), prostate (4.2%), and melanoma skin cancer (4.2%). Increases of 1% or more per annum were also seen for oral cancer (3.4%), kidney cancer (2.7%), myeloma (1.6%), Hodgkin lymphoma (1.5%), testicular cancer (1.3%), non-Hodgkin lymphoma (1.0%), and leukaemia (1.0%). The largest annual decreases over the two decades were seen for stomach (−4.2%), bladder (−4.1%), and lung cancers (−2.1%), with decreases of more than 1% per annum also observed for mesothelioma (−1.9% from 2001 onwards) and laryngeal cancer (−1.5%).

For women, the largest average annual percentage increases in incidence rates were noted for liver (3.9%), melanoma skin (3.5%), and oral (3.3%) cancers with increases in incidence of more than 1% per annum also observed for kidney (2.9%), uterus (1.9%), brain and central nervous system cancers (1.8%), Hodgkin lymphoma (1.6%), myeloma (1.1%), and non-Hodgkin lymphoma (1.0%). The largest annual decreases were reported for bladder (−3.6%) and stomach (−3.1%) cancers while the only other site showing a decrease of more than 1% per annum was cervical cancer (−1.3%). Although breast cancer represents the largest individual cancer site for women and therefore plays a large part in all cancer trends, the average annual increase was only 0.9%. All the incidence changes mentioned, for both men and women, and most incidence changes shown in table 3 and table 4 and in figure 6 and figure 7 were statistically significant (P<0.05) even when the size of change was relatively small.

Mortality rates mainly decreased over time in both sexes. For men, the cancer sites that showed average annual percentage reductions in mortality rates of more than 1% per annum were stomach (−5.1%), mesothelioma (–4.2% from 2001), bladder (–3.2%), lung (–3.1%), non-Hodgkin lymphoma (–2.9%), testis (–2.8%), Hodgkin lymphoma (–2.6%), bowel (–2.5%), larynx (–2.5%), prostate (–1.8%), myeloma (–1.7%), and leukaemia (–1.6%). Only liver (3.0%) and oral (1.1%) cancers showed an average annual increase in mortality of 1% or more with melanoma skin cancer (0.3%) the only other site showing an increase. For women, the cancer sites with average annual decreases in mortality per year of 1% or more were stomach (–4.2%), cervix (–3.6%), non-Hodgkin lymphoma (–3.2%), breast (–2.8%), Hodgkin lymphoma (–2.8%), ovary (–2.8%), myeloma (–2.3%), bowel (–2.2%), leukaemia (–2.1%), larynx (–2.0%), mesothelioma (–2.0% since 2001), bladder (–1.6%), oesophagus (–1.2%), and kidney (1.0%). As with men, liver (2.7%) and oral (1.2%) cancers showed average annual increases of more than 1%, in addition to uterine cancer (1.1%). For both men and women, the mortality changes mentioned previously and most mortality changes shown in table 3 and table 4 and in figure 6 and figure 7 were statistically significant (P<0.05), even when the size of change was relatively small.

Principal findings

The most striking finding in this analysis of UK cancer trends among the 35-69 years age group is the substantial decline in cancer mortality rates observed in both sexes (37% decline in men and 33% decline in women) across the period examined. A decrease in mortality was reported across nearly all the specific types of cancer examined (23 in total), with only liver, oral, and uterine cancers showing an increase together with melanoma skin cancer in men and pancreatic cancer in women, both showing small increases. By contrast, the incidence trends in this age group showed varying patterns with some sites increasing, some decreasing and some remaining relatively constant. Over all sites, a modest increase was noted in cancer incidence rates of around 0.8% per annum in both sexes, amounting to an increase of 15% in men and 16% in women over the 25 year time frame.

The increase in prostate cancer incidence over this period, especially in the 35-69 years age group considered here, is very likely to be a direct result of the uptake of prostate specific antigen testing, which results in the detection of early stage disease and, to an unknown extent, indolent disease that may otherwise never have been regarded as clinically significant. 24 25 The results do, however, affect people diagnosed and represent a large increase in workload for clinical staff. The fact that the overall mortality trends for men show no difference whether prostate cancer is included or excluded in the analysis indicates that the incidence increase for this cancer has largely been of non-fatal disease. That the specific mortality rates for prostate cancer showed an appreciable rate of decline during this time (–1.8% per annum) also indicates improved clinical treatment of the disease or an increase in the proportion of men diagnosed with a favourable prognosis, or both. 24 26 However, the increase in prostate cancer incidence still results in thousands of men each year dealing with the concerns of a cancer diagnosis and the impact this may have on their lives.

Breast cancer comprehensively dominated incidence and mortality trends in female cancer. Even though the average annual incidence increase of breast cancer over this period (0.9%) was modest in comparison to the prostate cancer increase in men (4.2%), breast cancer incidence rates remained substantially higher than those for any other cancer site in either sex. Inspection of figure 4 shows that breast cancer incidence rates (age standardised) increased at a faster rate until around 2003-05 (from 194.7 in 1993-95 to 229.9 in 2003-05), a slower rate from then until 2013-15 (240.8) but have levelled off in the most recent years analysed (238.0 in 2016-18). These changes in the incidence trend likely reflect a reduced effect of the initial incidence increases brought about by mammography screening in the UK introduced from the late 1980s or a possible effect of a decline in usage of hormone replacement treatment. 27 28 However, the effect of hormone replacement treatment on breast cancer risk is small in comparison to other risk factors, 7 and trends in this treatment has varied over time, such as changes in preferred formulations, doses, and treatment durations, 29 30 31 which may impact breast cancer risk levels. 32 33 As has been reported elsewhere, 34 35 36 mortality for breast cancer has declined substantially despite the incidence increase, which is indicative of improvements in early detection (including through screening 37 ) and improved treatment.

The other two major sites of cancer in men apart from prostate cancer, namely lung and bowel cancers, showed substantial reductions in mortality. These results are likely from primary prevention (historical reduction in smoking rates) 38 39 40 41 for lung cancer and earlier detection (including screening) and improved treatment for bowel cancer. 42 43 44 While lung cancer incidence substantially decreased, the incidence rates of bowel cancer remained unchanged. However, closer inspection of the bowel cancer incidence trends over the full period shows an increase from the point the bowel screening programme was first introduced from 2006 in the UK. This rate, however, has now decreased back to the observed level prior to the introduction of the screening programme. As others have shown, the introduction of bowel screening leads to an initial short-term increase in cancer incidence due to detection of as-yet undiagnosed cancer cases, followed by a decrease because of removal of adenomas. 42 45 46 Therefore, bowel cancer incidence trends can reasonably be assumed to decrease further over the coming years, unless other preventable risk factors for bowel cancer affect the trend.

Similarly, lung and bowel were the other two major cancer sites for women (alongside breast cancer), and both showed reductions in mortality. The decline in lung cancer mortality was, however, not as extensive as that for men (–0.5% compared with –3.1% per annum) likely reflecting the different demographic pattern in smoking rates that led to peak smoking prevalence in women occurring around 30 years later than men, albeit at around half the peak prevalence observed in men. 40 47 Smoking prevalence in women has always been lower than in men. 39 48 The lung cancer incidence trends showed a significant increase in women of 0.8% per annum as opposed to the –2.1% per annum decrease in men. That the incidence rate in 2016-18 was still higher in men than in women again is almost certainly a reflection of historical differences in smoking patterns. 39 49 50 Bowel cancer incidence in women followed a similar pattern to men and is equally reflective of the introduction of the bowel screening programme. Bowel cancer mortality in women has declined at a similar rate to men (–2.2% compared with –2.5% per annum), indicative of the same improvements in early detection and improved treatment.

These reductions in mortality across the most common cancers in both sexes are likely a representation of considerable success in cancer prevention, diagnosis, and treatment. Further improvements are likely to be realised from the continued reduction in smoking prevalence, of which smoking prevention policies continue to contribute, 51 alongside the recent move to faecal immunochemical testing in the bowel screening programme adopted throughout the UK during 2019. 52 The recommended rollout of targeted lung screening is expected to further help with the earlier diagnosis of lung cancer where surgery is a viable treatment option and outcomes are vastly improved. 53 54

Although four major sites influenced the overall pattern of cancer incidence and mortality, increases in rates among some of the less common sites do raise concerns. Four cancers showed substantial increases in incidence (more than 2% per annum) in both sexes: liver, melanoma skin, oral, and kidney cancers. All have strong associations with established risk factors: alcohol consumption, smoking, and HPV for oral cancer; 7 55 56 overweight and obesity, smoking, alcohol, and hepatitis B and C for liver cancer; 7 57 58 ultraviolet light for melanoma; 59 60 and obesity and smoking for kidney cancer. 61 62 63 Increases in liver cancer incidence and mortality for both men and women are very concerning, with nearly one in two attributable to modifiable risk factors. 7 With high prevalence of overweight and obesity and diabetes in the general population, other studies expect the rates to remain high. 64 For oral and kidney cancer, despite the association with smoking, incidence rates have not followed the decrease seen for lung cancer incidence in men. This is likely to be due to the smaller proportion of cases attributable to smoking in these two sites. Whilst smoking accounts for around 17% of oral cancers, over one in three are attributed to alcohol consumption. 7 For kidney cancer, smoking is attributable to 13% of cases whereas obesity causes around 25%, however, increasing trends in kidney mortality are shown for this age group and period. 7 Therefore, the increasing incidence trends could potentially have been worse, especially in men, if the reduction in smoking prevalence had not occurred. The increased incidence of melanoma skin cancer is likely to be caused by the increased sunlight and ultraviolet exposure caused by the availability of cheaper air travel to countries with a warmer climate and insufficient regulation of tanning beds until 2010. 65 66

In women, uterine cancer incidence increased by 1.9% per annum; although, this increase was predominantly seen over the period 1993-2007 and since then incidence trends have increased at a slower rate. One of the main risk factors for uterine cancer is the use of oestrogen-based hormone replacement therapy, 67 68 and since around 2000, use has substantially declined. 27 Around a third of uterine cancers in the UK are also attributed to overweight and obesity, but the increase in incidence is also likely to be caused by a decrease in the number of women undergoing hysterectomies for menorrhagia, in favour of endometrial ablation. 69

Other cancers that showed increases in incidence were cancers of the pancreas, brain, and central nervous system, together with Hodgkin and non-Hodgkin lymphoma, myeloma, and leukaemia in both sexes, and oesophageal and testicular cancers in men. With the exception of pancreatic cancer, which only decreased in women, all these cancers also showed a reduction in mortality in both sexes, indicating improving treatment or earlier detection, or both. Generally, the causes of these cancers are not well understood although obesity is associated with the adenocarcinoma histological subtype of oesophageal cancer, 70 especially in men, 7 while a combination of smoking and alcohol is implicated in the squamous cell carcinoma subtype. 71 The considerable male excess in oesophageal adenocarcinoma in comparison with squamous cell carcinoma rates, 72 possibly underlined by the higher incidence of gastroesophageal reflux disease in men 73 and the protective effect of oestrogen, 74 75 may explain the differing trends now observed between men and women.

Several cancer sites showed decreases in both incidence and mortality rates over the time period, notably stomach, larynx, and bladder cancer in both sexes, as well as cervical and ovarian cancers in women and mesothelioma in men. The changes in stomach cancer rates were of a similar magnitude and represented the largest percentage mortality decline in both sexes. This decline can probably be attributed to a combination of a reduction in the prevalence of Helicobacter pylori infection and an increase over time in fruit and vegetable consumption reducing the dependency on preserved foods. 76 77 Challenges in coding of stomach and oesophageal cancer before 2000 may also have had a role in shaping these trends. Laryngeal cancer is associated with tobacco use and alcohol consumption as well as occupational exposures, 56 78 79 and the decline in rates is most likely to be related to the decrease in smoking prevalence as well as decreases in occupational exposure. 80 The refinement of understanding pathology for bladder cancer during this period, in which previously diagnosed malignant disease is now categorised as benign, 81 is likely to have resulted in an artificial decline in incidence rates. 82 83 This artefact should not, however, have affected the decline in mortality rates given the benign nature of these tumours that do not cause death. 81 This decline in mortality, although not as marked as that for incidence, remained appreciable. The changes in cervical cancer rates, which showed the largest percentage mortality decline amongst gynaecological cancers, are almost certainly attributed to the success of the cytological screening programme during the whole of the time period considered. 84 85 With the introduction of the HPV vaccination programme for girls in 2008 86 and the subsequent expansion to boys in 2019, 87 rates of cervical cancer are expected to fall substantially over the following decades as the first cohort of vaccinated women reaches the peak age for cervical cancer incidence (aged 30-34 years). A reduction has already been shown for women aged 20-24. 88 The absolute incidence rates of mesothelioma in women were small in magnitude in 1993-95 (0.8 per 100 000 per annum) and remained similar over time (0.7 per 100 000 per annum in 2016-18). The incidence rates of mesothelioma in men were considerably greater, especially in 1993-95 (around 6.3 per 100 000 per annum), due largely to occupational asbestos exposure, 89 but a significant decrease was noted over time (to 3.6 per 100 000 per annum in 2016-18) resulting from both the decline in asbestos exposure and the decline in heavy industries, such as coal mining. Mortality decreased substantially in both sexes over the period for which data are available (2001-03 to 2016-18).

The conclusions that can be drawn from this analysis are, overall, positive and reassuring. Within the 35-69 year age group, cancer mortality rates have shown a substantial overall decline during the last quarter of a century in both men and women. The most probable causes are a combination of changes in the underlying risk of disease for some cancers (notably lung and stomach), in increased levels of early detection (notably breast 37 and cervix 90 ) and improved treatment (notably breast and bowel) for others. The specific circumstances leading to the increased incidence of breast cancer, of which risk factors are complex, need to be better understood and controlled. Similar results have been shown for incidence within Great Britain and mortality in the UK for some cancer sites. 91 Speculated overdiagnosis, where tumours are detected that would not have caused the patient any harm during their lifetimes, has been thought to increase rates for breast and prostate cancers in particular, of which prostate is especially affected by the widespread use of prostate specific antigen testing. 4 92 However, given the decreases in mortality across the wide set of cancer sites analysed here, improvements in early diagnosis, treatment, or both are having a positive effect for most cancer patients, although cancer mortality in this age group still needs reducing.

After accounting for the major two sites in men and women, the increase in overall incidence rates disappeared in men while it remained significant in women. This difference between sexes is due to a decrease in cancers with substantially higher initial incidence rates in men, such as lung, stomach, and bladder, resulting in a higher overall impact on male incidence, combined with an increase in incidence in uterine cancer, one of the most common cancers in women.

Strengths and limitations

This study benefits from high quality cancer registry data collected by all four cancer registries in each country across the UK, which allows for the inspection of a wide range of cancer sites over 25 years. ICD-10 coding changes have been minimal, only affecting trends in cancer incidence for bladder and ovarian cancers and cancer mortality for mesothelioma, whereas challenges in coding stomach and oesophageal cancer may have affected trends for these sites. Changes in registration practice may well have had a small effect on certain cancer sites. By focusing only on the 35-69 age range, we present a clear and reliable comparative picture of cancer incidence across 25 years within the UK, which provides a reliable indicator regarding future cancer incidence trends. Understanding cancer in older people and changes in the trends of different cancers is also of interest, but subject to a different study given the increasing life expectancy over this period, impact of comorbidities, and differing interaction with health services in this age group.

Limitations include the absence of staging data to substantiate any improvements in earlier diagnosis. Due to the number of sites analysed, we also have not broken down sites by histological type, which could be beneficial in certain sites to understand the trends within cancer sites—eg, small cell and non-small cell lung cancer or oestrogen receptor-positive and oestrogen receptor-negative breast cancer. In focusing on the age group selected, we are excluding older ages where rates of cancer are higher. Although this exclusion reduces the number of cases included, providing a smaller cohort for each year, the age group selected provides a more reliable comparator for future trends given the accuracy of incidence recording and also focuses on the cancers that lead to a larger number of years of life lost. The age range included in this study has been well defined; however, other studies are indicating potentially different trends worldwide in young adults with potential increases in risk factors such as dietary risk factors playing a role. 93 94 The data captured across the UK registries provides a basis for further understanding to see whether different trends are observed across younger age groups and whether the causes of this can be determined. Additionally, although we have included a broad range of cancer sites, cancers that have not been included in this study could well be showing different trends, such as a more recent increase in thyroid cancer in the UK. 95

This study also provides a baseline covering a 25 year period uninterrupted by covid-19. Trends in cancer incidence and mortality beyond these years will be affected and therefore understanding the causes of trends will be more complicated. Having a 25 year baseline provides the observed trend for which expected cases can be assessed against observed. This benchmark will present a comparison for the following decade as the presentation, diagnosis, and treatment of cancer have been hugely affected by rules and regulations affecting public and health service staff. Mortality trends will also be impacted with decision making regarding coding of deaths with covid-19 likely to be the underlying cause of death for people with cancer if that has directly led to the patient dying, rather than their cancer.

This study focuses on the overall sex specific trends for cancer incidence and mortality in the specified age group to observe and understand trends over the 25 year period across the entire UK. Further breakdowns have not been possible. Paucity of numbers for less common cancers precluded separate analyses for the individual UK nations while data limitations precluded analyses by other demographic characteristics, for example, ethnic group and deprivation. The main obstacle to analysing data by ethnic group is the completeness of recordings in hospitals. In England, completeness improved substantially in 2012, but prior to this, the proportion of cases with unknown ethnic group renders results over time to be incomparable. In other UK countries, completeness of ethnic group recording is still not good enough to conduct country-wide cancer incidence or mortality analyses by ethnicity. For deprivation, the measures currently available are derived within each UK nation, and a specific validated UK-wide deprivation measure does not yet exist. Given the obvious importance of looking at variation in UK trends within ethnic groups and deprivation categories, such analyses represent a priority for further research and highlights the importance of data collection across all UK nations.

Conclusions

Overall, these results substantiate the view that in this age group there is no generalised increase in cancer incidence, while there is a substantial decrease in cancer mortality in the UK over the 25 year study period. Specific concerns about individual cancer sites identified were raised, of which the most important numerically, apart from the increases in breast and prostate cancer incidence, was the need to accelerate the decrease in female lung cancer. After which, concerns about oral cancer, liver cancer, kidney cancer, uterine cancer, and melanoma skin cancer present the most pressing issues. There are also several cancer sites that showed decreases in both incidence and mortality, notably, stomach, larynx, bladder, and cervical.

What is already known on this topic

No recent studies have investigated cancer incidence and mortality rates over such a long time frame within the 35-69 year age group in the UK

Short term trends for specific cancer sites are related to known risk factors, screening programmes, and improved treatment

Trends in the 35-69 years age group can be indicative of future patterns of cancer in older people

What this study adds

Decreased rates of many cancers, including lung and laryngeal, is positive, and likely to be driven by the decrease in smoking prevalence across the UK

An increase in rates of other cancer sites, including uterine and kidney, was noted, which may be a result of the increasing prevalence of overweight/obesity and other risk factors

Organised population screening programmes have led to an increase in cancer incidence but also look to have contributed to a reduction in cancer mortality across the UK

Ethics statements

Ethical approval.

Ethics approval for this work was not required as the study used publicly available data.

Data availability statement

Data sharing may be possible for additional analyses. All code used for analyses in this paper are also available from the Cancer Research UK website and GitHub. Information on how to access the data used in this analysis are available from the Cancer Research UK website.

Acknowledgments

This work uses data that has been provided by patients and collected by the health services as part of their care and support. The data is collated, maintained, and quality assured by NHS England, Public Health Wales, Public Health Scotland, and the Northern Ireland Cancer Registry.

Contributors: All authors participated in study conception and design, and/or the analysis and interpretation of results. Conception and design: DF, LS, and CT. Analysis and interpretation: all authors. Writing manuscript: all authors. Supervision and guarantor: JS and DF. All authors critically reviewed drafts of the manuscript, read and approved the final manuscript. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

The manuscript’s guarantor (DF) affirms that this manuscript is an honest, accurate, and transparent account of the study being reported, that no important aspects of the study have been omitted and that any discrepancies from the study as planned have been explained.

Dissemination to participants and related and public communities: study results will be disseminated to the public and health professionals by a press release written using layman’s terms; findings will also be shared through mass media communications and social media postings. A webinar produced alongside a patient advocacy group is also planned to accompany the publication of this study, a recording of which will be made available on the Cancer Research UK website. Since the study analyses cancer registry data collected during routine care, and provided in aggregated form, we are unable to specifically disseminate results to study participants beyond the usual channels of publication.

Provenance and peer review: Not commissioned; externally peer reviewed.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ .

  • Queen’s University Belfast, Northern Ireland Cancer Registry
  • van Seijen M ,
  • Thompson AM ,
  • PRECISION team
  • Independent UK Panel on Breast Cancer Screening
  • Kitchener H ,
  • ↵ Cancer Research UK. Screening for cancer. https://www.cancerresearchuk.org/about-cancer/screening
  • Office for National Statistics
  • Public Health Wales
  • Public Health Scotland
  • NHS Digital
  • NI Direct Government Services
  • United Kingdom and Ireland Association of Cancer Registries
  • UK Health Security Agency
  • National Cancer Intelligence Network
  • Benhamou E ,
  • Laversanne M ,
  • Gardner W ,
  • Mulvey EP ,
  • Hankey BF ,
  • Kosary CL ,
  • R Core Team
  • Pashayan N ,
  • Pharoah P ,
  • Martin RM ,
  • Donovan JL ,
  • Turner EL ,
  • CAP Trial Group
  • Neuberger MM ,
  • Djulbegovic M ,
  • Johnson A ,
  • Bromley SE ,
  • de Vries CS ,
  • Burkard T ,
  • Alsugeir D ,
  • Adesuyan M ,
  • Vinogradova Y ,
  • Coupland C ,
  • Hippisley-Cox J
  • Collaborative Group on Hormonal Factors in Breast Cancer
  • Early Breast Cancer Trialists’ Collaborative Group (EBCTCG)
  • Coleman DA ,
  • Swerdlow AJ ,
  • Youlden DR ,
  • Cancer Research UK
  • ↵ International Agency for Research on Cancer, World Health Organization. Tobacco Smoke and Involuntary Smoking. 2004. IARC Monographs on the Evaluation of Carcinogenic Risks to Humans; vol. 83. https://publications.iarc.fr/Book-And-Report-Series/Iarc-Monographs-On-The-Identification-Of-Carcinogenic-Hazards-To-Humans/Tobacco-Smoke-And-Involuntary-Smoking-2004
  • McClements PL ,
  • Madurasinghe V ,
  • Thomson CS ,
  • Ait Ouakrim D ,
  • Niikura R ,
  • Brenner H ,
  • Schrotz-King P ,
  • Holleczek B ,
  • Katalinic A ,
  • Hoffmeister M
  • Lieberman DA ,
  • McFarland B ,
  • American Cancer Society Colorectal Cancer Advisory Group ,
  • US Multi-Society Task Force ,
  • American College of Radiology Colon Cancer Committee
  • Silcocks P ,
  • Whitley E ,
  • Malvezzi M ,
  • Bertuccio P ,
  • Lortet-Tieulent J ,
  • Renteria E ,
  • UK National Screening Committee
  • Royal College of Physicians
  • McLaughlin JK ,
  • Hashibe M ,
  • Brennan P ,
  • Chuang SC ,
  • Armstrong GL ,
  • Farrington LA ,
  • Hutin YJF ,
  • Miyamura T ,
  • Gilchrest BA ,
  • Geller AC ,
  • Berwick M ,
  • van der Hel OL ,
  • McMillan GP ,
  • Boffetta P ,
  • Zeegers MPA ,
  • van Den Brandt PA
  • Driver RJ ,
  • HCC-UK/BASL/NCRAS Partnership Steering Group
  • Doherty VR ,
  • Brewster DH ,
  • UK government
  • Gebretsadik T ,
  • Kerlikowske K ,
  • Ernster V ,
  • Million Women Study Collaborators
  • Mukhopadhaya N ,
  • Manyonda IT
  • Lagergren J
  • Rubenstein JH
  • Czanner G ,
  • Chandanos E ,
  • Roberts SE ,
  • Morrison-Rees S ,
  • Samuel DG ,
  • Williams JG
  • Rothenbacher D ,
  • Talamini R ,
  • Bosetti C ,
  • La Vecchia C ,
  • Menvielle G ,
  • Goldberg P ,
  • National Institute for Health and Care Excellence
  • Vaccarella S ,
  • Plummer M ,
  • Franceschi S ,
  • University of Oxford
  • Castanon A ,
  • Windridge P ,
  • Hodgson JT ,
  • Matthews FE ,
  • Castañón A ,
  • O’Sullivan JW ,
  • Nicholson BD
  • Bjurlin MA ,
  • Nicholson J ,
  • Sasamoto N ,

data analysis research design

ORIGINAL RESEARCH article

Sex-related disparities in vehicle crash injury and hemodynamics.

\r\nSusan Cronn,

  • 1 Comprehensive Injury Center, Medical College of Wisconsin, Milwaukee, WI, United States
  • 2 Division of Trauma and Acute Care Surgery, Department of Surgery, Medical College of Wisconsin, Milwaukee, WI, United States
  • 3 Neurosurgery Department, Medical College of Wisconsin, Milwaukee, WI, United States
  • 4 Joint Department of Biomedical Engineering, Marquette University/Medical College of Wisconsin, Milwaukee, WI, United States
  • 5 VA Medical Center-Research, Milwaukee, WI, United States

Objective: Multiple studies evaluate relative risk of female vs. male crash injury; clinical data may offer a more direct injury-specific evaluation of sex disparity in vehicle safety. This study sought to evaluate trauma injury patterns in a large trauma database to identify sex-related differences in crash injury victims.

Methods: Data on lap and shoulder belt wearing patients age 16 and up with abdominal and pelvic injuries from 2018 to 2021 were extracted from the National Trauma Data Bank for descriptive analysis using injuries, vital signs, International Classification of Disease (ICD) coding, age, and injury severity using AIS (Abbreviated Injury Scale) and ISS (Injury Severity Score). Multiple linear regression was used to assess the relationship of shock index (SI) and ISS, sex, age, and sex * age interaction. Regression analysis was performed on multiple injury regions to assess patient characteristics related to increased shock index.

Results: Sex, age, and ISS are strongly related to shock index for most injury regions. Women had greater overall SI than men, even in less severe injuries; women had greater numbers of pelvis and liver injuries across severity categories; men had greater numbers of injury in other abdominal/pelvis injury regions.

Conclusions: Female crash injury victims' tendency for higher (AIS) severity of pelvis and liver injuries may relate to how their bodies interact with safety equipment. Females are entering shock states (SI > 1.0) with lesser injury severity (ISS) than male crash injury victims, which may suggest that female crash patients are somehow more susceptible to compromised hemodynamics than males. These findings indicate an urgent need to conduct vehicle crash injury research within a sex-equity framework; evaluating sex-related clinical data may hold the key to reducing disparities in vehicle crash injury.

Introduction

The global burden of traffic injury has been reduced by innovations in vehicle safety design, but not all demographics have benefitted equally from this protection. Women (we will use this term to discuss our biological sex related study, with the understanding of and respect for the range of gender expression that does not correlate with biological sex) may be more vulnerable to risk of certain types of injury in vehicle crashes, yet safety features are largely based on testing with male-representative dummies. Atwood et al. ( 1 ) demonstrated greater relative fatality risk (on average, 2.9% higher fatality risk for female front row occupants vs. male) for females in vehicles with the newest generation of safety equipment, but the differences between male and female occupants' fatality risk fluctuate across age groups. Stigson et al. ( 2 ) report a greater risk of permanent medical impairment (PMI) in females compared to males, and countermeasures designed to mitigate this risk (specific to “neck” region injury, which is the region most associated with PMI) were not equally effective in men and women ( 3 ). Nutbeam et al. ( 4 ) found that female patients were more frequently entrapped after crash, and that entrapped male and female patients had differing injury patterns. As such, we cannot be certain that current vehicle safety standards in testing, equipment, and crashworthiness accurately reflect how women experience vehicle crashes; the current body of literature suggests that current methods in crash testing may not sufficiently account for male/female body differences.

Much of the literature evaluating sex differences in crash injury discusses relative risk and crash/occupants characteristics with significant effort toward comparing crashworthiness, crash severity, and adjusting for confounding factors that can affect estimated impact of sex on crash injury. In a recent IRCOBI conference, Brumbelow ( 5 ) asserts that “It is important to identify how non-physiological risk factors may affect injury risk estimates for females and males in order to encourage the most robust and effective countermeasures.” Brumbelow ( 5 ) also considers that investigating how differences in vehicles and crashes between men and women may reveal how these factors confound estimates for relative and fatality risk.

Acknowledging the difficulty in fully accounting for confounding factors is important in an accurate assessment of how men and women are injured in vehicle crashes. Clear representation of the problem of sex-related injury disparity is critical in prioritizing research, design, and allocating funding. The level of complexity of the issue, however, makes this clear representation challenging; true matched pairs comparisons are nearly impossible to achieve. We posit that by evaluating sex-related injury patterns using clinical data, we will demonstrate where and how male/female differences exist as real patient outcomes, regardless of how previous literature has estimated and quantified sex-related risk.

Atwood et al.'s ( 1 ) recent evaluation of the Fatality Analysis Reporting System (FARS) found that recent model year vehicles (2010–2020) with optimal occupant protection systems have reduced estimated female fatality risk relative to males to 5.8%; though this is an improvement, it still indicates disparity between sexes. Liu and Subramian ( 6 ) estimate the odds of a female occupant's severe injury likelihood as 1.25 times that of a male occupant. Males are more likely to engage in risky behaviors like speeding, driving while intoxicated, etc., increasing their overall likelihood of crash, death, and serious injury but even controlling for these factors, women are significantly more likely to suffer serious injuries due to vehicle crashes ( 7 ). Though some studies attribute differences in injury and fatality risk between sexes to driving patterns, behavior, and vehicle size, attempts to control for these factors in describing relative risk have not included physical stature, body mass, or other physiologic differences associated with sex ( 7 ).

A recent study by Brumbelow and Jermakian ( 7 ) discusses differences in injury severity between side and front crashes as well as differences in extremity injuries, and concludes that current vehicle safety testing has reduced injury risk to both sexes, perhaps more to female occupants. However, despite careful controlling for as many crash severity factors as possible, they posit that there are multiple sex-related properties as yet unknown, unmeasured, or unaccounted for within retrospective crash data analyses which may or may not be able to identify female vulnerability to (lower extremity) injury ( 7 ). Craig et al. ( 8 ) combined multiple crash-related databases to account for a broad range of crash types, crash variables, and occupant characteristics in an analysis of sex-based odds differences in crash outcomes. This study demonstrates the complexity of the issue, and concluded that:

“increased or decreased odds of injury for females vs. males is dependent on the type of injury and associated severity, the associated crash type, and other relevant independent variables significantly associated with the respective injury outcomes” ( 8 ).

Further, they found that in multiple models, female and male occupants were approximately equal in number of cases where each held the higher odds of injury ( 8 ). However, limitations of the study did not allow for some elements of analysis that may be relevant in comparisons of sex-related differences in crash injury outcomes—delta V, post-crash factors, or occupant characteristics like BMI, behavior, or vehicle selection ( 8 ).

To ultimately make cars that are safe for all bodies, we must isolate the risk factors which are truly due to male/female physiological differences and evaluate which elements need to be represented in crash testing. Though vehicle safety has improved overall in the past decades, this improved protection may not apply to all occupants equally. Abrams and Bass ( 9 ) posit that “there may be unobserved trends in the injury patterns, and therefore outcomes, between male and female occupants.” The objective of this study is to evaluate the disparities in abdominal and pelvic male/female injury patterns through a clinical lens; this novel approach evaluating the NTDB allows for analysis of injury patterns in trauma patients after vehicle crash. By reviewing patient injury data, this study demonstrates how injuries correlate with sex, how male and female patients are affected by these injuries, and how injury patterns demonstrate the clinical picture of known sex-related disparities in crash injury.

Materials and methods

Data source.

The National Trauma Data Bank (NTDB) is the largest aggregation of trauma data in the USA ( 10 ). It is maintained by the American College of Surgeons (ACS) for the purposes of injury surveillance, hospital benchmarking, research and quality improvement ( 10 ). The NTDB includes extensive patient-and injury-related information from pre-hospital to discharge disposition entered by trained data registrars using established data definitions and standards. Inclusion in the NTDB is based on clinical coding for traumatic injuries. The data are audited as part of the ACS trauma center verification program, which ensures data integrity and quality ( 10 ).

Study design

Data from the NTDB from 2018 to 2021 for patients 16 years and older were considered for analysis. This timeframe was chosen as it represents the most recently available data and also includes modern auto safety features available in newer vehicles. Only patients in vehicle crashes who were wearing lap and shoulder belts were extracted, defined using International Classification of Disease (ICD) external cause codes ( N = 125,642) ( 11 ). The initial query included age, primary external cause codes specific to traffic-related vehicle crash (V43.5, V43.6, V44.5, V44.6, V47.5, V47.6, V53.5, V53.6, V54.5, V54.6, V57.5, V57.6), and Abbreviated Injury Scale (AIS) ( 12 ) injury diagnosis codes related to abdomen and pelvis injuries (i.e., codes beginning with 54 and 856). Injuries were grouped according to 9 generalized regions (i.e., kidney, large intestine, liver, pancreas, pelvis bony, pelvis organ, small intestine, spleen, and stomach) and assessed by sex, age group, and shock index (SI). Further variables included sex, age, initial hospital systolic blood pressure, initial hospital heart rate, injury severity score (ISS), and AIS score (both designating severity of injury). AIS-2005 standards were used as those were consistently provided across all years accessed in the NTDB.

A waiver from the Institutional Review Board at the Medical College of Wisconsin was obtained for this study.

Statistical analysis

We report a descriptive analysis of patient characteristics. Frequencies and proportions of injury by region are described relative to total injuries and relative to total injuries by sex. Sex differences in injury by region were compared using a Chi-square test on the proportions of total injury within sex. For patients who sustained multiple injuries in the same region, the injury with the highest AIS severity was retained for analysis and the less severe injuries in that region were excluded. Multiple linear regression was used to assess the relationship of SI and ISS, sex, age, and sex * age interaction. Coefficient estimates are reported along with 95% confidence intervals for each term in each model. ISS was used in lieu of AIS scores to account for overall injury severity (rather than just severity within the respective body region examined). ISS is intended to be an objective anatomical scoring system that quantifies injury severity by summing the squares of the AIS scores for the 3 most severely injured body regions. ISS scores range 0–75 with scores 0–9 indicating mild severity, 9–15 indicating moderate, 16–24 severe, and over 25 indicating profound injury ( 12 ). ISS is greater than 15 is usually considered major trauma ( 13 ).

Where data were available, shock index (SI) calculated by the ratio of heart rate (HR) over systolic blood pressure (SBP) was derived ( N = 122,557; 97.5% of total sample) ( 14 ). Nine separate regressions were completed (one per injury region) to compare how patient characteristics may differ related to shock index depending on the injury sustained and a Bonferroni correction was applied to adjust for multiple comparisons. All analyses were completed in R (version 4.3.0).

Demographic characteristics of crash injury patients

From 2018 to 2021, 56,839 patients sustained at least one abdominal or pelvis injury from a traffic-related vehicle crash and also had complete demographic and injury related information available in the database (45.2% of total vehicle crash patients). Of these patients, 28,292 (49.7%) were men and 28,547 (50.2%) were women. Patients ranged in age from 16 to 25 years (24.7%), 26 to 34 years (18.1%), 35 to 44 years (14.1%), 45 to 54 years (11.9%), 55 to 64 years (12.5%), 65 to 74 years (9.96%), 75 to 84 years (6.9%), and above 85 years (1.8%). Within age categories, male and female representation was as follows: male 51.4% and female 48.5% of age 16–25 years; male 53.5% and female 46.4% of age 26–34 years; male 52.6% and female 47.3% of age 35–44 years; male 50.2% and female 49.7% of age 45–54 years; male 47.0% and female 52.9% of age 55–64 years; male 42.9% and female 57.0% of age 65–74 years; male 40.6% and female 59.3% of age 75–84 years; and male 43.7% and female 56.2% of age 85 and greater.

Injury patterns

Of the 56,839 patients, there were 81,459 total abdominal and pelvis injuries ( Table 1 ). Patients sustained on average 1.43 injuries (abdomen and pelvis only, range = 1–8 injuries per patient, median = 1). Of the total number of injuries sustained, 36.9% were pelvis (bony), 16.3% spleen, 13.8% liver, 11.3% small intestine, 7.6% kidney, 5.58% pancreas, 5.35% large intestine, 3% pelvis (organ), and < 1% stomach.

www.frontiersin.org

Table 1 . Patient characteristics of total abdominal and pelvis injuries ( N = 81,459 injuries).

For female patients, pelvis (bony) injuries were most frequent (38.1%), followed by spleen (15.8%), liver (15.0%), small intestine (10.5%), kidney (7.14%), pancreas (5.45%), large intestine (4.76%), pelvis (organ; 2.9%), and stomach (< 1%). For male patients, injury frequencies were ranked the same as female patients ( Figure 1 ), with the most frequent being pelvis (bony; 35.6%) followed by spleen (16.8%), liver (12.5%), small intestine (12.0%), kidney (8.0%), large intestine (5.94%), pancreas (5.71%), pelvis (organ; 3.1%), and stomach (< 1%). There were no significant sex differences in proportions of injury by region [ χ   ( 64 ) 2 = 72.0, p = 0.23].

www.frontiersin.org

Figure 1 . Proportions calculated relative to total injuries within each sex and then by injury region. Females sustain numerically more pelvis (bony), liver, and stomach injuries compared to men. There were no significant sex differences in rates of injury by region [ χ ( 64 ) 2 = 72.0, p = 0.23].

For all injury regions in both male and female patients, except pelvis (organ), AIS 2 injuries were most common (frequency/number) ( Figure 2 ). For pelvis (organ) injuries, AIS 3 injury was most common. For injury regions where females sustained greater numbers of injuries than males [i.e., liver, pelvis (bony), and stomach], they did so across AIS severity levels 3 and 4. A similar pattern (across higher AIS severity levels) held for injury regions where males sustained more injuries [i.e., kidney, large intestine, pancreas, pelvis (organ), small intestine, and spleen] than females.

www.frontiersin.org

Figure 2 . Proportions calculated within injury regions for each sex by AIS severity.

Results of the multiple linear regression models examining SI for each injury region are reported in Table 2 . For all injury regions, higher shock index was associated with greater injury severity and younger age ( Table 2 ). When one sex had greater numbers of injuries in a region, it was consistent across all levels of severity. Females across all injury regions, except stomach, had significantly higher shock indices than males at lower ISS scores; as ISS increased, sex differences largely dissipated ( Figure 3 ). Except for kidney, pelvis (organ), spleen, and stomach, there was a significant sex * age interaction for all other injury regions. Together these results suggest sex, age, and ISS are strongly related to shock index for most injury regions. Of note, in the full sample, age and ISS were significantly but very weakly correlated ( r 2 = 0.0002, p < 0.0001); in females this pattern holds ( r 2 = 0.0009, p < 0.0001), but in males there was no significant relationship ( r 2 = 0.000001, p = 0.38). This suggests age differences in ISS do not diminish sex differences. Further, we conducted a sensitivity analysis whereby we repeated the reported regressions for only those with ISS > 15, which indicates severe injury ( n = 45,839; 49.5% female). The only differences were that sex was no longer significant for pelvis (organ) injuries, and that the sex * age interaction term was no longer significant for liver or large intestine injuries. Therefore, results did not substantially change as there were still significant age and sex differences across most injury regions despite examining only severe injury.

www.frontiersin.org

Table 2 . Relationships of injury type with shock index via multiple linear regression.

www.frontiersin.org

Figure 3 . Shock index relative to injury severity scores (ISS) by sex for each injury region. Dashed horizontal line at 1.0 indicates critical clinical status where heart rate value has exceeded systolic blood pressure value. Shaded bands depict 95% confidence intervals with respective linear regression equation shown in each panel for each group.

In the 56,839 patients meeting study criteria, women and men were represented fairly evenly in overall proportion of crashes. For all patients, age group 16–25 was 23.2% of the total and the next highest group was 26–34 at 16.9%. These two groups combined were 40.1%, with age groups 35–44, 45–54, 55–64, and 65–74 represented nearly equally at 13.6, 12.4, 13.7, and 11.7%, respectively. There was a sharp drop off in crash numbers from ages 75 and above, with those groups comprising only 9.1% of the total number of crash patients. Within age groups, men were more highly represented from ages 16–44, but this equalized from ages 45–54 and then reversed, with women as the greater number of crash victims above age 55 (increasing proportions with each jump in age group). Injuries (by number or total counts) were similarly evenly distributed across male/female patients, with some exceptions: women had greater number of liver injuries in all AIS categories 2–5, greater number of bony pelvis injuries in all AIS categories 2–5, greater number of pelvis organ injuries in AIS category 5, and greater number of stomach injuries in AIS categories 3 and 4.

A critical piece of information uncovered by this group is the presence of elevated shock index in female crash victims at rates greater than in male crash victims. Women crossed the shock threshold (SI > 1.0) at a lower ISS score in all injury regions. Women also crossed the shock threshold at fewer number of total injuries than men. However, the difference in SI converged as number of total injuries increased. This difference was most pronounced in patients under the age of 30 and in patients with pelvis injury.

Shock index is a useful tool in the rapid assessment of a trauma patient ( 14 ). A quick look at HR and SBP and simple calculation (e.g., is the ratio HR:SBP > 1?) can orient the clinician to the hemodynamic status of the patient, even in an austere environment with minimal equipment. Since hemorrhagic shock is a leading cause of death during initial trauma intervention, early recognition of shock is key to timely treatment ( 14 ). Because patients can appear “normal” despite significant hemorrhagic loss due to physiologic compensation, an SI > 1.0 can provide an early alert for those patients likely to need mass transfusion, ICU admission, or other interventions to prevent morbidity and mortality ( 14 ).

The normal ratio of heart rate to systolic blood pressure ranges between 0.5 and 0.7, with some sources accepting up to < 0.9 as within the range of normal. SI is a better predictor of shock than HR and SBP separately, and since not all blood loss is visible it is critical to identify hemorrhagic shock quickly ( 15 ). With hypovolemia (less than normal amount of blood and fluid in the body's circulatory system) caused by blood loss due to injury, the initial physiological response in trauma patients is increased heart rate, which compensates for the reduction in stroke volume (how much blood the heart pumps out to the body with each heartbeat). Heart rate (beats per minute) multiplied by stroke volume (mL) equals cardiac output, which is expressed in liters per minute. This cardiac output is what supplies the body's tissues with oxygen, and perfusion of these tissues with oxygen is the normal state of a healthy patient ( 16 ).

As a traumatically injured patient's cardiac output declines due to blood loss, the body's tissues receive less oxygen than needed. This mismatch leads to multiple other compensatory mechanisms (peripheral vasoconstriction, anaerobic respiration, diversion of blood from non-critical organs to heart and brain) ( 16 ). A patient in early or class I shock (blood loss up to 15% of circulating blood volume) may demonstrate very few clinical signs that they are in trouble; elevation in HR is usually the first clue ( 17 ). As the HR rises, the SI will increase; this will continue and become more pronounced as systolic blood pressure begins to drop (which first happens consistently at 30% or greater blood volume loss) ( 17 ). It is dangerous to wait until a patient falls into a precise category of shock before taking action (hemorrhage control, transfusion, operative intervention); hemorrhagic shock is a clinical emergency which requires immediate treatment as soon as it is detected ( 17 ).

Higher SI was associated with worse injuries and with youth—the ability to compensate physiologically by increasing heart rate is stronger in younger patients. In addition, many patients above the age of 65 use medications inhibiting their ability to increase HR regardless of need (i.e., beta blockers), a variable not available in this database. Of note, even with lower injury severity scores, females had significantly higher shock indices regardless of injury region, except for stomach. This indicates either an inaccuracy in widely held standards in vital signs due to inattention to sex differences or a greater physiologic response to trauma in females, or another factor as yet unknown.

If female crash injury victims are entering shock states with lesser injury severity than male crash injury victims, this may have serious clinical implications. It suggests that female patients are somehow more susceptible to compromised hemodynamics and elevated SI after injury, which indicates a higher likelihood of transfusion, ICU stay, and mortality ( 14 ). In trauma care, clinicians consider the mechanism of injury and injury severity as context for expected patient hemodynamic status; if sex is not considered in this calculus, clinicians could be delaying identification of shock in female patients. No current trauma assessment uses sex as a data element or directive, except in injuries of pregnant people (which is focused on pregnant physiology and complications, not sex specifically). It is possible that clinical practice may need to adapt to sex-related hemodynamic differences, which will become clearer with further research.

The implications of evidence showing greater physiologic distress in women with equivalent or lesser injuries from vehicle crash than men are potentially meaningful across a number of domains. From a clinician standpoint, increasing use of shock index as a tool to assess impending worsening of clinical status may result in earlier identification of those individuals in need of intervention or may signal a need to EMS personnel for greater haste, higher level of care, or preparation of a trauma center for their arrival ( 16 ). Evaluating crash injury with the knowledge that a female patient may have higher likelihood of shock could allow trauma clinicians to risk stratify and recognize and treat shock more aggressively, potentially resulting in fewer complications. As we develop our understanding of the relationship between sex and shock index, it is possible that clinicians caring for patients in the ICU or inpatient unit may need to consider the impact of sex on hemodynamics in their treatment plans.

From a vehicle safety and design standpoint, the implication of either lesser physiologic reserve or lower resilience in female occupants, greater vulnerability to injury, and increased likelihood of increasing shock index in certain injuries/injury patterns may require a re-evaluation of current practice in how safety and design are conceived and developed. Aside from the clear imperative to move forward in average-sized female dummy evolution and use, current standards in vehicle safety must include review of the impact of current equipment on those injuries for which females have increased risk of shock. Transparent, equity-focused research and design will require a commitment to eliminating disparities in crash injury from both manufacturing and legislation, areas which have, until recently, allowed these disparities to remain unaddressed.

Prior research and knowledge

Though research has begun to evaluate the differences in crash injury between male and female occupants, the majority of the work arises from the engineering field. This cross-disciplinary group sought to combine engineering and clinical expertise for a new perspective on how to approach reducing sex-related disparities in crash injury. By including a clinical standpoint, novel elements are integrated into existing research strategies, driving both disciplines to expand upon and amplify their understanding of the problem.

Prior research discusses the statistics related to differences in male/female crash injuries, but does not explain why they occur, nor does it examine the impact or significance of these differences, nor does it attempt to define the clinical relevance of sex-related crash injury disparities. By combining clinical and engineering-related data, contextualizing specific abdominal and pelvic injury patterns (chosen due to potential injury relationship with seat belts) could help narrow this gap in understanding. The unexpected finding of sex-related disparity in shock index provides incredible weight to the need for a convening of expertise directed toward the problem of sex-related crash injury differences, with the goal of parsing which elements of crash dynamics, human physiology, and current vehicle safety equipment are interacting to create these disparities. Further implications of investigating the clinical perspective of vehicle crash injury include discovery of how other differences in body types may contribute to unequal protection by vehicle safety equipment. There is a dearth of literature describing how height, weight, weight for height (BMI), age, and disability can be represented in current crash testing practices.

To understand possible root causes of disparity in crash injury, it is essential to briefly review the structure of legislation surrounding vehicle safety. Vehicle crash testing only requires two variations of adult-representative dummies, an average sized (50th percentile) male, and a small female (5th percentile) ( 18 ). Furthermore, the female dummy is a scaled down version of the male dummy, which means it does not account for differences in body composition, mass distribution, or muscle/bone mass, density, and strength ( 19 ). There is no requirement that vehicles are tested using an average sized female dummy, nor is there a requirement for sex differentiation in dummy design and construction, nor are there sex-differentiated injury criteria for use in testing.

Limitations

The initial frequency analysis used to begin the process of sorting through a large dataset does not reflect the full picture of individual injury pattern. This study focused on abdominal and pelvis trauma, but there are multiple other injuries that will need consideration in the context of shock index. Trauma center participation in the NTDB is voluntary and therefore these data do not constitute registry information from all trauma centers in the U.S. Patients who died prior to emergency room presentation as well as those who did not seek or receive care from a trauma center would also not be captured in this database. Data quality for study variables are limited by the NTDB data standard. In the NTDB use of beta blockers (or other medications and conditions affecting heart rate and blood pressure) is not documented which almost certainly affected the analysis of shock indices (the likely skew of beta blockers, for instance, would be in reducing the mean of SI in higher age groups). We were also not able to use the modified shock index, due to unavailability of diastolic blood pressure to calculate Mean Arterial Pressure. We were unable to account for any contextual information regarding the vehicle crash that may affect injury patterns such as direction, speed, or force of the crash, and the position of the patient in the vehicle relative to impact.

Conclusion/interpretation of findings

• Women and men appear to have some differences in crash-related abdomen and pelvis injury, both in actual injury pattern and in their physiologic response to the trauma.

• Though many of these differences are attenuated in different age groups, the finding that women have greater risk of shock across multiple injury types, severities, and ages indicates that even in comparable situations, women may be more vulnerable to the injuries they experience.

• Injuries to bony pelvis, pelvic organs, liver, and stomach were more frequent in women than men, which may indicate a starting point for safety equipment evaluation.

With greater numbers of pelvis (bony, AIS 2–5), pelvis (organ, AIS 5), liver (AIS 2–5), and stomach injuries (AIS 3–4) in women, it is possible that some element of anatomical difference between male and female bodies is interacting with safety equipment in a way that increases these injuries. Since current vehicle safety equipment is designed for a standard male figure, the question of whether a female occupant may somehow be under-protected, either due to equipment fit (i.e., seat belt positioning) or because female occupants make out-of-standard adjustments to accommodate their size, proportions, or weight distribution (i.e., distance to steering wheel) needs further investigation. Female drivers are often considered “out-of-position,” but this designation of women as non-standard is the primary root of multiple inequities in research and design of daily-use, safety-related, or health-affecting equipment. The absence of female bodies as their own standard clearly has consequences, which in the case of vehicle crashes, can be serious and life-altering.

By elucidating the differences between male/female injury patterns and connecting them to male-preferential safety equipment, we may clear a path for research to pursue any number of vehicle occupant variations in injury and vehicle design; whether this will require improved dummy technology, computer modeling, or a combination of the two remains to be seen. Findings from the clinical approach described in this study can be used to address sex-based discrepancies in a critical area of injury-related public health, and can be used to prioritize future directions for sex-related crash injury research.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Ethics statement

The studies involving humans were approved by IRB Medical College of Wisconsin. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants' legal guardians/next of kin in accordance with the national legislation and institutional requirements.

Author contributions

SC: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Writing—original draft, Resources, Supervision, Writing—review & editing. KS: Formal analysis, Validation, Writing—review & editing. KD: Data curation, Formal analysis, Methodology, Writing—review & editing. CT: Data curation, Formal analysis, Methodology, Validation, Writing—original draft, Writing—review & editing. FP: Conceptualization, Formal analysis, Methodology, Supervision, Validation, Writing—review & editing.

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This project was supported by the CTSI Team Science-Guided Integrated Clinical and Research Ensemble, National Center for Advancing Translational Sciences, National Institutes of Health, Award Number 2UL1 TR001436.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. Atwood J, Noh EY, Craig MJ. Female crash fatality risk relative to males for similar physical impacts. Traffic Inj Prev. (2023) 24(Suppl 1):S1–8. doi: 10.1080/15389588.2023.2177845

PubMed Abstract | Crossref Full Text | Google Scholar

2. Stigson H, Boström M, Kullgren A. Health status and quality of life among road users with permanent medical impairment several years after the crash. Traffic Inj Prev. (2020) 21(Suppl 1):S43–8. doi: 10.1080/15389588.2020.1817416

3. Kullgren A, Stigson H, Krafft M. Development of whiplash associated disorders for male and female car occupants in cars launched since the 80s in different impact directions. In: IRCOBI Conference Proceedings. (2013). https://research.chalmers.se/publication/196568

Google Scholar

4. Nutbeam T, Weekes L, Heidari S, Fenwick R, Bouamra O, Smith J, et al. Sex-disaggregated analysis of the injury patterns, outcome data and trapped status of major trauma patients injured in motor vehicle collisions: a prespecified analysis of the UK trauma registry (TARN). BMJ Open. (2022) 12:e061076. doi: 10.1136/bmjopen-2022-061076

5. Brumbelow M. Sex-related vehicle and crash differences and their potential to confound relative injury risk analyses. In: IRCOBI Conference. Cambridge, UK (2023).

6. Liu C, Subramian R. National Center for Statistics and Analysis . Washington, DC: US Department of Transportation, National Highway Traffic Safety Administration (2020).

7. Brumbelow M, Jermakian J. Injury risks and crashworthiness benefits for females and males: which differences are physiological? Traff Inj Prev. (2022) 23:11–6. doi: 10.1080/15389588.2021.2004312

8. Craig M, Liu C, Zhang F, Enriquez J. Sex-based differences in odds of motor vehicle crash injury outcomes. Accid Anal Prev. (2024) 195:107100. doi: 10.1016/j.aap.2023.107100

9. Abrams M, Bass C. Female vs. male relative fatality risk in fatal crashes. In: Proceedings of the 2020 IRCOBI Conference . Beijing (2020), p. 11–14.

PubMed Abstract | Google Scholar

10. American College of Surgeons. National Trauma Data Bank: NTDB Research Data Set Admission Years 2018-2021 . (2021). Available online at: NTDB: www.facs.org/quality-programs/trauma/quality/national-trauma-data-bank/

11. World Health Organization. International Statistical Classification of Diseases and Related Health Problems . 11th ed. World Health Organization (2019). Available online at: https://icd.who.int/

12. Association for the Advancement of Automotive Medicine. Abbreviated Injury Scale (c) 2005 Update 2008 . In: Gennarelli T, Woodzin E, editors. Chicago, IL (2016).

13. Bolorunduro OB, Villegas C, Oyetunji TA, Haut ER, Stevens KA, Chang DC, et al. Validating the Injury Severity Score (ISS) in different populations: ISS predicts mortality better among hispanics and females. J Surg Res. (2011) 166:40–4. doi: 10.1016/j.jss.2010.04.012

14. Koch E, Lovett S, Nghiem T, Riggs RA, Rech MA. Shock index in the emergency department: utility and limitations. Open Access Emerg Med. (2019) 11:179–99. doi: 10.2147/OAEM.S178358

15. Yang YC, Lin PC, Liu CY, Tzeng IS, Lee SJ, Hou YT, et al. Prehospital shock index multiplied by AVPU scale as a predictor of clinical outcomes in traumatic injury. Shock . (2022) 58:524–33. doi: 10.1097/SHK.0000000000002018

16. Fecher A, Stimpson A, Ferrigno L, Pohlman TH. The pathophysiology and management of hemorrhagic shock in the polytrauma patient. J Clin Med. (2021) 10:4793. doi: 10.3390/jcm10204793

17. American College of Surgeons Committee on Trauma. Advanced Trauma Life Support Student Course Manual . 10th ed. Chicago, IL (2018).

18. Linder A, Svedberg W. Review of average sized male and female occupant models in European regulatory safety assessment tests and European laws: gaps and bridging suggestions. Accid Anal Prev . (2019) 127:156–62. doi: 10.1016/j.aap.2019.02.030

19. Frye H, Ko D, Kotnik E, Zelt N. Motor vehicle crash testing regulations for more inclusive populations. J Sci Policy Gov. (2021) 18:e410. doi: 10.38126/JSPG180410

Crossref Full Text | Google Scholar

Keywords: crash safety, equity in research, traumatic injury, vehicle crash, shock index, sex differences

Citation: Cronn S, Somasundaram K, Driesslein K, Tomas CW and Pintar F (2024) Sex-related disparities in vehicle crash injury and hemodynamics. Front. Public Health 12:1331313. doi: 10.3389/fpubh.2024.1331313

Received: 31 October 2023; Accepted: 05 February 2024; Published: 15 March 2024.

Reviewed by:

Copyright © 2024 Cronn, Somasundaram, Driesslein, Tomas and Pintar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Susan Cronn, scronn@mcw.edu

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

  1. 5 Steps of the Data Analysis Process

    data analysis research design

  2. CHOOSING A QUALITATIVE DATA ANALYSIS (QDA) PLAN

    data analysis research design

  3. What Is Research Design And Method

    data analysis research design

  4. Basic Mixed Methods Research Designs

    data analysis research design

  5. What is Data Analysis in Research

    data analysis research design

  6. Tools for data analysis in research

    data analysis research design

VIDEO

  1. Qualitative Research Data Analysis

  2. Data Analysis

  3. Data Analysis

  4. Quantitative Data Analysis

  5. Epidata version 3.1 for data entry

  6. Research Methodology and Data Analysis-Refresher Course

COMMENTS

  1. What Is a Research Design

    A research design is a strategy for answering your research question using empirical data. Creating a research design means making decisions about: Your overall research objectives and approach. Whether you'll rely on primary research or secondary research. Your sampling methods or criteria for selecting subjects. Your data collection methods.

  2. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. Three essential things occur during the data ...

  3. Research Design

    Step 2: Choose a type of research design. Step 3: Identify your population and sampling method. Step 4: Choose your data collection methods. Step 5: Plan your data collection procedures. Step 6: Decide on your data analysis strategies. Frequently asked questions. Introduction. Step 1. Step 2.

  4. What Is Research Design? 8 Types + Examples

    Research design refers to the overall plan, structure or strategy that guides a research project, from its conception to the final analysis of data. Research designs for quantitative studies include descriptive, correlational, experimental and quasi-experimenta l designs. Research designs for qualitative studies include phenomenological ...

  5. Research Design: Decide on your Data Analysis Strategy

    The last step of designing your research is planning your data analysis strategies. In this video, we'll take a look at some common approaches for both quant...

  6. What is data analysis? Methods, techniques, types & how-to

    A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.

  7. Full article: Design Principles for Data Analysis

    Our primary focus in this article is to (i) introduce a set of data analytic design principles ( Section 2 ), (ii) describe an example of how the design principles can be used to measure different characteristics of a data analysis ( Section 3 ), and (iii) present data on the variation in principles within and between producers of data analyses ...

  8. Qualitative Research Design and Data Analysis: Deductive and Inductive

    The key purpose of inductive analysis is to really dig into what is happening in the data, to understand the themes present in the data and to produce findings to answer your research questions. In my analysis process, I identify themes from the pattern codes through memoing and further condensing the pattern codes where I can.

  9. Data Analysis in Quantitative Research

    Abstract. Quantitative data analysis serves as part of an essential process of evidence-making in health and social sciences. It is adopted for any types of research question and design whether it is descriptive, explanatory, or causal. However, compared with qualitative counterpart, quantitative data analysis has less flexibility.

  10. What Is Data Analysis? (With Examples)

    Written by Coursera Staff • Updated on Nov 20, 2023. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...

  11. Research Design

    It is an important planning phase that outlines the research methodology, data collection methods, and data analysis techniques that will be used to investigate a research question or problem. The research design helps to ensure that the research is conducted in a systematic and logical manner, and that the data collected is relevant and reliable.

  12. Research Methods

    Research methods are specific procedures for collecting and analyzing data. Developing your research methods is an integral part of your research design. When planning your methods, there are two key decisions you will make. First, decide how you will collect data. Your methods depend on what type of data you need to answer your research question:

  13. RESEARCH DESIGN: DATA ANALYSIS

    RESEARCH DESIGN: DATA ANA. This course examines the basic methods data analysis and statistics that political scientists use in quantitative research that attempts to make causal inferences about how the political world works. The same methods apply to other kinds of problems about cause and effect relationships more generally.

  14. Planning Qualitative Research: Design and Decision Making for New

    While many books and articles guide various qualitative research methods and analyses, there is currently no concise resource that explains and differentiates among the most common qualitative approaches. We believe novice qualitative researchers, students planning the design of a qualitative study or taking an introductory qualitative research course, and faculty teaching such courses can ...

  15. PDF General Approaches to Designing and Analysing Data

    Preliminary data analysis is a technique which can be undertaken on most data as. each segment is collected. It serves to summarise issues emerging and to identify further questions which need to be asked in order to gain holistic data. Thematic analysis is commonly used in qualitative research and occurs when all the data are in.

  16. What is Research Design?

    By integrating various data analysis tools, you can approach research questions from multiple perspectives to enhance the depth and breadth of your analysis. Considerations for research design. A meticulous and thorough research design is essential to maintain the quality, reliability, and overall value of your study results. Consider these tips:

  17. Ensuring Rigor in Qualitative Data Analysis: A Design Research Approach

    This article will first outline and describe grounded theory analysis. Charmaz (2006) version of grounded theory is adopted here as it most closely aligns with a design approach to research enquiry incorporating pragmatism, symbolic interactionism, and an interpretivist view of research. The focus in this article will be on the nature and depth of interaction with the data required and the ...

  18. Design, data analysis and sampling techniques for clinical research

    Statistical analysis is an essential technique that enables a medical research practitioner to draw meaningful inference from their data analysis. Improper application of study design and data analysis may render insufficient and improper results and conclusion. Converting a medical problem into a statistical hypothesis with appropriate ...

  19. Study/Experimental/Research Design: Much More Than Statistics

    Study, experimental, or research design is the backbone of good research. It directs the experiment by orchestrating data collection, defines the statistical analysis of the resultant data, and guides the interpretation of the results. When properly described in the written report of the experiment, it serves as a road map to readers, 1 helping ...

  20. Research Design and Process

    Research design aims to provide a rationale, framework and structure before engaging with data collection and data analysis (De Vaus, 2001).A reasonable research design defines the structure of the research process, arrangement of the different methods required to respond to the research questions and the different outputs at each of the stages established.

  21. Data Analysis

    Data Analysis. Different statistics and methods used to describe the characteristics of the members of a sample or population, explore the relationships between variables, to test research hypotheses, and to visually represent data are described. Terms relating to the topics covered are defined in the Research Glossary. Descriptive Statistics.

  22. (PDF) Chapter 3 Research Design and Methodology

    Abstract. Chapter 3 consists of three parts: (1) Purpose of the study and research design, (2) Methods, and (3) Statistical Data analysis procedure. Part one, Purpose of the study and Research ...

  23. (PDF) Basics of Research Design: A Guide to selecting appropriate

    The essence of research design is to translate a research problem into data for analysis so as to provide relevant answers to research questions at a minimum cost. This paper investigates what ...

  24. Data Science and Analytics: An Overview from Data-Driven Smart

    Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods . This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct ...

  25. 25 year trends in cancer incidence and mortality among adults aged 35

    Objective To examine and interpret trends in UK cancer incidence and mortality for all cancers combined and for the most common cancer sites in adults aged 35-69 years. Design Retrospective secondary data analysis. Data sources Cancer registration data, cancer mortality and national population data from the Office for National Statistics, Public Health Wales, Public Health Scotland, Northern ...

  26. Frontiers

    Inclusion in the NTDB is based on clinical coding for traumatic injuries. The data are audited as part of the ACS trauma center verification program, which ensures data integrity and quality . Study design. Data from the NTDB from 2018 to 2021 for patients 16 years and older were considered for analysis.

  27. Pharmaceuticals

    Adverse drug reactions continue to be not only one of the most urgent problems in clinical medicine, but also a social problem. The aim of this study was a bibliometric analysis of the use of digital technologies to prevent adverse drug reactions and an overview of their main applications to improve the safety of pharmacotherapy. The search was conducted using the Web of Science database for ...

  28. Model-Based Design of Experiments for Temporal Analysis of Products

    Temporal analysis of products (TAP) reactors enable experiments that probe numerous kinetic processes within a single set of experimental data through variations in pulse intensity, delay, or temperature. Selecting additional TAP experiments often involves an arbitrary selection of reaction conditions or the use of chemical intuition. To make experiment selection in TAP more robust, we explore ...

  29. NVIDIA Blackwell Platform Arrives to Power a New Era of Computing

    Decompression Engine — A dedicated decompression engine supports the latest formats, accelerating database queries to deliver the highest performance in data analytics and data science. In the coming years, data processing, on which companies spend tens of billions of dollars annually, will be increasingly GPU-accelerated. A Massive Superchip

  30. Experimental Analysis of Style Transfer and Target Detection in

    JiaRui Zhang, male, PhD student, majoring in space design at College of Arts and Sports, Hoseo University, South Korea. Has been engaged in space and product design research, published many papers, focusing on exploring the application of software algorithm technology in space and product design. Corresponding author. Email: [email protected]