data analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case NPS+ Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

research papers on data analysis

Home Market Research

Data Analysis in Research: Types & Methods

data-analysis-in-research

Content Index

Why analyze data in research?

Types of data in research, finding patterns in the qualitative data, methods used for data analysis in qualitative research, preparing data for analysis, methods used for data analysis in quantitative research, considerations in research data analysis, what is data analysis in research.

Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense. 

Three essential things occur during the data analysis process — the first is data organization . Summarization and categorization together contribute to becoming the second known method used for data reduction. It helps find patterns and themes in the data for easy identification and linking. The third and last way is data analysis – researchers do it in both top-down and bottom-up fashion.

LEARN ABOUT: Research Process Steps

On the other hand, Marshall and Rossman describe data analysis as a messy, ambiguous, and time-consuming but creative and fascinating process through which a mass of collected data is brought to order, structure and meaning.

We can say that “the data analysis and data interpretation is a process representing the application of deductive and inductive logic to the research and data analysis.”

Researchers rely heavily on data as they have a story to tell or research problems to solve. It starts with a question, and data is nothing but an answer to that question. But, what if there is no question to ask? Well! It is possible to explore data even without a problem – we call it ‘Data Mining’, which often reveals some interesting patterns within the data that are worth exploring.

Irrelevant to the type of data researchers explore, their mission and audiences’ vision guide them to find the patterns to shape the story they want to tell. One of the essential things expected from researchers while analyzing data is to stay open and remain unbiased toward unexpected patterns, expressions, and results. Remember, sometimes, data analysis tells the most unforeseen yet exciting stories that were not expected when initiating data analysis. Therefore, rely on the data you have at hand and enjoy the journey of exploratory research. 

Create a Free Account

Every kind of data has a rare quality of describing things after assigning a specific value to it. For analysis, you need to organize these values, processed and presented in a given context, to make it useful. Data can be in different forms; here are the primary data types.

  • Qualitative data: When the data presented has words and descriptions, then we call it qualitative data . Although you can observe this data, it is subjective and harder to analyze data in research, especially for comparison. Example: Quality data represents everything describing taste, experience, texture, or an opinion that is considered quality data. This type of data is usually collected through focus groups, personal qualitative interviews , qualitative observation or using open-ended questions in surveys.
  • Quantitative data: Any data expressed in numbers of numerical figures are called quantitative data . This type of data can be distinguished into categories, grouped, measured, calculated, or ranked. Example: questions such as age, rank, cost, length, weight, scores, etc. everything comes under this type of data. You can present such data in graphical format, charts, or apply statistical analysis methods to this data. The (Outcomes Measurement Systems) OMS questionnaires in surveys are a significant source of collecting numeric data.
  • Categorical data: It is data presented in groups. However, an item included in the categorical data cannot belong to more than one group. Example: A person responding to a survey by telling his living style, marital status, smoking habit, or drinking habit comes under the categorical data. A chi-square test is a standard method used to analyze this data.

Learn More : Examples of Qualitative Data in Education

Data analysis in qualitative research

Data analysis and qualitative data research work a little differently from the numerical data as the quality data is made up of words, descriptions, images, objects, and sometimes symbols. Getting insight from such complicated information is a complicated process. Hence it is typically used for exploratory research and data analysis .

Although there are several ways to find patterns in the textual information, a word-based method is the most relied and widely used global technique for research and data analysis. Notably, the data analysis process in qualitative research is manual. Here the researchers usually read the available data and find repetitive or commonly used words. 

For example, while studying data collected from African countries to understand the most pressing issues people face, researchers might find  “food”  and  “hunger” are the most commonly used words and will highlight them for further analysis.

LEARN ABOUT: Level of Analysis

The keyword context is another widely used word-based technique. In this method, the researcher tries to understand the concept by analyzing the context in which the participants use a particular keyword.  

For example , researchers conducting research and data analysis for studying the concept of ‘diabetes’ amongst respondents might analyze the context of when and how the respondent has used or referred to the word ‘diabetes.’

The scrutiny-based technique is also one of the highly recommended  text analysis  methods used to identify a quality data pattern. Compare and contrast is the widely used method under this technique to differentiate how a specific text is similar or different from each other. 

For example: To find out the “importance of resident doctor in a company,” the collected data is divided into people who think it is necessary to hire a resident doctor and those who think it is unnecessary. Compare and contrast is the best method that can be used to analyze the polls having single-answer questions types .

Metaphors can be used to reduce the data pile and find patterns in it so that it becomes easier to connect data with theory.

Variable Partitioning is another technique used to split variables so that researchers can find more coherent descriptions and explanations from the enormous data.

LEARN ABOUT: Qualitative Research Questions and Questionnaires

There are several techniques to analyze the data in qualitative research, but here are some commonly used methods,

  • Content Analysis:  It is widely accepted and the most frequently employed technique for data analysis in research methodology. It can be used to analyze the documented information from text, images, and sometimes from the physical items. It depends on the research questions to predict when and where to use this method.
  • Narrative Analysis: This method is used to analyze content gathered from various sources such as personal interviews, field observation, and  surveys . The majority of times, stories, or opinions shared by people are focused on finding answers to the research questions.
  • Discourse Analysis:  Similar to narrative analysis, discourse analysis is used to analyze the interactions with people. Nevertheless, this particular method considers the social context under which or within which the communication between the researcher and respondent takes place. In addition to that, discourse analysis also focuses on the lifestyle and day-to-day environment while deriving any conclusion.
  • Grounded Theory:  When you want to explain why a particular phenomenon happened, then using grounded theory for analyzing quality data is the best resort. Grounded theory is applied to study data about the host of similar cases occurring in different settings. When researchers are using this method, they might alter explanations or produce new ones until they arrive at some conclusion.

LEARN ABOUT: 12 Best Tools for Researchers

Data analysis in quantitative research

The first stage in research and data analysis is to make it for the analysis so that the nominal data can be converted into something meaningful. Data preparation consists of the below phases.

Phase I: Data Validation

Data validation is done to understand if the collected data sample is per the pre-set standards, or it is a biased data sample again divided into four different stages

  • Fraud: To ensure an actual human being records each response to the survey or the questionnaire
  • Screening: To make sure each participant or respondent is selected or chosen in compliance with the research criteria
  • Procedure: To ensure ethical standards were maintained while collecting the data sample
  • Completeness: To ensure that the respondent has answered all the questions in an online survey. Else, the interviewer had asked all the questions devised in the questionnaire.

Phase II: Data Editing

More often, an extensive research data sample comes loaded with errors. Respondents sometimes fill in some fields incorrectly or sometimes skip them accidentally. Data editing is a process wherein the researchers have to confirm that the provided data is free of such errors. They need to conduct necessary checks and outlier checks to edit the raw edit and make it ready for analysis.

Phase III: Data Coding

Out of all three, this is the most critical phase of data preparation associated with grouping and assigning values to the survey responses . If a survey is completed with a 1000 sample size, the researcher will create an age bracket to distinguish the respondents based on their age. Thus, it becomes easier to analyze small data buckets rather than deal with the massive data pile.

LEARN ABOUT: Steps in Qualitative Research

After the data is prepared for analysis, researchers are open to using different research and data analysis methods to derive meaningful insights. For sure, statistical analysis plans are the most favored to analyze numerical data. In statistical analysis, distinguishing between categorical data and numerical data is essential, as categorical data involves distinct categories or labels, while numerical data consists of measurable quantities. The method is again classified into two groups. First, ‘Descriptive Statistics’ used to describe data. Second, ‘Inferential statistics’ that helps in comparing the data .

Descriptive statistics

This method is used to describe the basic features of versatile types of data in research. It presents the data in such a meaningful way that pattern in the data starts making sense. Nevertheless, the descriptive analysis does not go beyond making conclusions. The conclusions are again based on the hypothesis researchers have formulated so far. Here are a few major types of descriptive analysis methods.

Measures of Frequency

  • Count, Percent, Frequency
  • It is used to denote home often a particular event occurs.
  • Researchers use it when they want to showcase how often a response is given.

Measures of Central Tendency

  • Mean, Median, Mode
  • The method is widely used to demonstrate distribution by various points.
  • Researchers use this method when they want to showcase the most commonly or averagely indicated response.

Measures of Dispersion or Variation

  • Range, Variance, Standard deviation
  • Here the field equals high/low points.
  • Variance standard deviation = difference between the observed score and mean
  • It is used to identify the spread of scores by stating intervals.
  • Researchers use this method to showcase data spread out. It helps them identify the depth until which the data is spread out that it directly affects the mean.

Measures of Position

  • Percentile ranks, Quartile ranks
  • It relies on standardized scores helping researchers to identify the relationship between different scores.
  • It is often used when researchers want to compare scores with the average count.

For quantitative research use of descriptive analysis often give absolute numbers, but the in-depth analysis is never sufficient to demonstrate the rationale behind those numbers. Nevertheless, it is necessary to think of the best method for research and data analysis suiting your survey questionnaire and what story researchers want to tell. For example, the mean is the best way to demonstrate the students’ average scores in schools. It is better to rely on the descriptive statistics when the researchers intend to keep the research or outcome limited to the provided  sample  without generalizing it. For example, when you want to compare average voting done in two different cities, differential statistics are enough.

Descriptive analysis is also called a ‘univariate analysis’ since it is commonly used to analyze a single variable.

Inferential statistics

Inferential statistics are used to make predictions about a larger population after research and data analysis of the representing population’s collected sample. For example, you can ask some odd 100 audiences at a movie theater if they like the movie they are watching. Researchers then use inferential statistics on the collected  sample  to reason that about 80-90% of people like the movie. 

Here are two significant areas of inferential statistics.

  • Estimating parameters: It takes statistics from the sample research data and demonstrates something about the population parameter.
  • Hypothesis test: I t’s about sampling research data to answer the survey research questions. For example, researchers might be interested to understand if the new shade of lipstick recently launched is good or not, or if the multivitamin capsules help children to perform better at games.

These are sophisticated analysis methods used to showcase the relationship between different variables instead of describing a single variable. It is often used when researchers want something beyond absolute numbers to understand the relationship between variables.

Here are some of the commonly used methods for data analysis in research.

  • Correlation: When researchers are not conducting experimental research or quasi-experimental research wherein the researchers are interested to understand the relationship between two or more variables, they opt for correlational research methods.
  • Cross-tabulation: Also called contingency tables,  cross-tabulation  is used to analyze the relationship between multiple variables.  Suppose provided data has age and gender categories presented in rows and columns. A two-dimensional cross-tabulation helps for seamless data analysis and research by showing the number of males and females in each age category.
  • Regression analysis: For understanding the strong relationship between two variables, researchers do not look beyond the primary and commonly used regression analysis method, which is also a type of predictive analysis used. In this method, you have an essential factor called the dependent variable. You also have multiple independent variables in regression analysis. You undertake efforts to find out the impact of independent variables on the dependent variable. The values of both independent and dependent variables are assumed as being ascertained in an error-free random manner.
  • Frequency tables: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Analysis of variance: The statistical procedure is used for testing the degree to which two or more vary or differ in an experiment. A considerable degree of variation means research findings were significant. In many contexts, ANOVA testing and variance analysis are similar.
  • Researchers must have the necessary research skills to analyze and manipulation the data , Getting trained to demonstrate a high standard of research practice. Ideally, researchers must possess more than a basic understanding of the rationale of selecting one statistical method over the other to obtain better data insights.
  • Usually, research and data analytics projects differ by scientific discipline; therefore, getting statistical advice at the beginning of analysis helps design a survey questionnaire, select data collection  methods, and choose samples.

LEARN ABOUT: Best Data Collection Tools

  • The primary aim of data research and analysis is to derive ultimate insights that are unbiased. Any mistake in or keeping a biased mind to collect data, selecting an analysis method, or choosing  audience  sample il to draw a biased inference.
  • Irrelevant to the sophistication used in research data and analysis is enough to rectify the poorly defined objective outcome measurements. It does not matter if the design is at fault or intentions are not clear, but lack of clarity might mislead readers, so avoid the practice.
  • The motive behind data analysis in research is to present accurate and reliable data. As far as possible, avoid statistical errors, and find a way to deal with everyday challenges like outliers, missing data, data altering, data mining , or developing graphical representation.

LEARN MORE: Descriptive Research vs Correlational Research The sheer amount of data generated daily is frightening. Especially when data analysis has taken center stage. in 2018. In last year, the total data supply amounted to 2.8 trillion gigabytes. Hence, it is clear that the enterprises willing to survive in the hypercompetitive world must possess an excellent capability to analyze complex research data, derive actionable insights, and adapt to the new market needs.

LEARN ABOUT: Average Order Value

QuestionPro is an online survey platform that empowers organizations in data analysis and research and provides them a medium to collect data by creating appealing surveys.

MORE LIKE THIS

A feedback culture boosts workplace communication by promoting inclusiveness and collaboration. Let's discuss 10 techniques in this article.

Feedback Culture: What It Is, How to Build It, Challenges.

Feb 27, 2024

voice of the customer tools

Top Voice of the Customer Tools to Use in 2024 | QuestionPro

Feb 26, 2024

customer experience management in banking

How to Use Customer Experience Management in Banking

research papers on data analysis

A Tale of Evolution: Cristina Ortega’s Life@QuestionPro

Feb 23, 2024

Other categories

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Uncategorized
  • Video Learning Series
  • What’s Coming Up
  • Workforce Intelligence
  • Open access
  • Published: 06 January 2022

The use of Big Data Analytics in healthcare

  • Kornelia Batko   ORCID: orcid.org/0000-0001-6561-3826 1 &
  • Andrzej Ślęzak 2  

Journal of Big Data volume  9 , Article number:  3 ( 2022 ) Cite this article

62k Accesses

77 Citations

28 Altmetric

Metrics details

The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.

Introduction

The main contribution of this paper is to present an analytical overview of using structured and unstructured data (Big Data) analytics in medical facilities in Poland. Medical facilities use both structured and unstructured data in their practice. Structured data has a predetermined schema, it is extensive, freeform, and comes in variety of forms [ 27 ]. In contrast, unstructured data, referred to as Big Data (BD), does not fit into the typical data processing format. Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools. It remains stored but not analyzed. Due to the lack of a well-defined schema, it is difficult to search and analyze such data and, therefore, it requires a specific technology and method to transform it into value [ 20 , 68 ]. Integrating data stored in both structured and unstructured formats can add significant value to an organization [ 27 ]. Organizations must approach unstructured data in a different way. Therefore, the potential is seen in Big Data Analytics (BDA). Big Data Analytics are techniques and tools used to analyze and extract information from Big Data. The results of Big Data analysis can be used to predict the future. They also help in creating trends about the past. When it comes to healthcare, it allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 60 ].

This paper is the first study to consolidate and characterize the use of Big Data from different perspectives. The first part consists of a brief literature review of studies on Big Data (BD) and Big Data Analytics (BDA), while the second part presents results of direct research aimed at diagnosing the use of big data analyses in medical facilities in Poland.

Healthcare is a complex system with varied stakeholders: patients, doctors, hospitals, pharmaceutical companies and healthcare decision-makers. This sector is also limited by strict rules and regulations. However, worldwide one may observe a departure from the traditional doctor-patient approach. The doctor becomes a partner and the patient is involved in the therapeutic process [ 14 ]. Healthcare is no longer focused solely on the treatment of patients. The priority for decision-makers should be to promote proper health attitudes and prevent diseases that can be avoided [ 81 ]. This became visible and important especially during the Covid-19 pandemic [ 44 ].

The next challenges that healthcare will have to face is the growing number of elderly people and a decline in fertility. Fertility rates in the country are found below the reproductive minimum necessary to keep the population stable [ 10 ]. The reflection of both effects, namely the increase in age and lower fertility rates, are demographic load indicators, which is constantly growing. Forecasts show that providing healthcare in the form it is provided today will become impossible in the next 20 years [ 70 ]. It is especially visible now during the Covid-19 pandemic when healthcare faced quite a challenge related to the analysis of huge data amounts and the need to identify trends and predict the spread of the coronavirus. The pandemic showed it even more that patients should have access to information about their health condition, the possibility of digital analysis of this data and access to reliable medical support online. Health monitoring and cooperation with doctors in order to prevent diseases can actually revolutionize the healthcare system. One of the most important aspects of the change necessary in healthcare is putting the patient in the center of the system.

Technology is not enough to achieve these goals. Therefore, changes should be made not only at the technological level but also in the management and design of complete healthcare processes and what is more, they should affect the business models of service providers. The use of Big Data Analytics is becoming more and more common in enterprises [ 17 , 54 ]. However, medical enterprises still cannot keep up with the information needs of patients, clinicians, administrators and the creator’s policy. The adoption of a Big Data approach would allow the implementation of personalized and precise medicine based on personalized information, delivered in real time and tailored to individual patients.

To achieve this goal, it is necessary to implement systems that will be able to learn quickly about the data generated by people within clinical care and everyday life. This will enable data-driven decision making, receiving better personalized predictions about prognosis and responses to treatments; a deeper understanding of the complex factors and their interactions that influence health at the patient level, the health system and society, enhanced approaches to detecting safety problems with drugs and devices, as well as more effective methods of comparing prevention, diagnostic, and treatment options [ 40 ].

In the literature, there is a lot of research showing what opportunities can be offered to companies by big data analysis and what data can be analyzed. However, there are few studies showing how data analysis in the area of healthcare is performed, what data is used by medical facilities and what analyses and in which areas they carry out. This paper aims to fill this gap by presenting the results of research carried out in medical facilities in Poland. The goal is to analyze the possibilities of using Big Data Analytics in healthcare, especially in Polish conditions. In particular, the paper is aimed at determining what data is processed by medical facilities in Poland, what analyses they perform and in what areas, and how they assess their analytical maturity. In order to achieve this goal, a critical analysis of the literature was performed, and the direct research was based on a research questionnaire conducted on a sample of 217 medical facilities in Poland. It was hypothesized that medical facilities in Poland are working on both structured and unstructured data and moving towards data-based healthcare and its benefits. Examining the maturity of healthcare facilities in the use of Big Data and Big Data Analytics is crucial in determining the potential future benefits that the healthcare sector can gain from Big Data Analytics. There is also a pressing need to predicate whether, in the coming years, healthcare will be able to cope with the threats and challenges it faces.

This paper is divided into eight parts. The first is the introduction which provides background and the general problem statement of this research. In the second part, this paper discusses considerations on use of Big Data and Big Data Analytics in Healthcare, and then, in the third part, it moves on to challenges and potential benefits of using Big Data Analytics in healthcare. The next part involves the explanation of the proposed method. The result of direct research and discussion are presented in the fifth part, while the following part of the paper is the conclusion. The seventh part of the paper presents practical implications. The final section of the paper provides limitations and directions for future research.

Considerations on use Big Data and Big Data Analytics in the healthcare

In recent years one can observe a constantly increasing demand for solutions offering effective analytical tools. This trend is also noticeable in the analysis of large volumes of data (Big Data, BD). Organizations are looking for ways to use the power of Big Data to improve their decision making, competitive advantage or business performance [ 7 , 54 ]. Big Data is considered to offer potential solutions to public and private organizations, however, still not much is known about the outcome of the practical use of Big Data in different types of organizations [ 24 ].

As already mentioned, in recent years, healthcare management worldwide has been changed from a disease-centered model to a patient-centered model, even in value-based healthcare delivery model [ 68 ]. In order to meet the requirements of this model and provide effective patient-centered care, it is necessary to manage and analyze healthcare Big Data.

The issue often raised when it comes to the use of data in healthcare is the appropriate use of Big Data. Healthcare has always generated huge amounts of data and nowadays, the introduction of electronic medical records, as well as the huge amount of data sent by various types of sensors or generated by patients in social media causes data streams to constantly grow. Also, the medical industry generates significant amounts of data, including clinical records, medical images, genomic data and health behaviors. Proper use of the data will allow healthcare organizations to support clinical decision-making, disease surveillance, and public health management. The challenge posed by clinical data processing involves not only the quantity of data but also the difficulty in processing it.

In the literature one can find many different definitions of Big Data. This concept has evolved in recent years, however, it is still not clearly understood. Nevertheless, despite the range and differences in definitions, Big Data can be treated as a: large amount of digital data, large data sets, tool, technology or phenomenon (cultural or technological.

Big Data can be considered as massive and continually generated digital datasets that are produced via interactions with online technologies [ 53 ]. Big Data can be defined as datasets that are of such large sizes that they pose challenges in traditional storage and analysis techniques [ 28 ]. A similar opinion about Big Data was presented by Ohlhorst who sees Big Data as extremely large data sets, possible neither to manage nor to analyze with traditional data processing tools [ 57 ]. In his opinion, the bigger the data set, the more difficult it is to gain any value from it.

In turn, Knapp perceived Big Data as tools, processes and procedures that allow an organization to create, manipulate and manage very large data sets and storage facilities [ 38 ]. From this point of view, Big Data is identified as a tool to gather information from different databases and processes, allowing users to manage large amounts of data.

Similar perception of the term ‘Big Data’ is shown by Carter. According to him, Big Data technologies refer to a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [ 13 ].

Jordan combines these two approaches by identifying Big Data as a complex system, as it needs data bases for data to be stored in, programs and tools to be managed, as well as expertise and personnel able to retrieve useful information and visualization to be understood [ 37 ].

Following the definition of Laney for Big Data, it can be state that: it is large amount of data generated in very fast motion and it contains a lot of content [ 43 ]. Such data comes from unstructured sources, such as stream of clicks on the web, social networks (Twitter, blogs, Facebook), video recordings from the shops, recording of calls in a call center, real time information from various kinds of sensors, RFID, GPS devices, mobile phones and other devices that identify and monitor something [ 8 ]. Big Data is a powerful digital data silo, raw, collected with all sorts of sources, unstructured and difficult, or even impossible, to analyze using conventional techniques used so far to relational databases.

While describing Big Data, it cannot be overlooked that the term refers more to a phenomenon than to specific technology. Therefore, instead of defining this phenomenon, trying to describe them, more authors are describing Big Data by giving them characteristics included a collection of V’s related to its nature [ 2 , 3 , 23 , 25 , 58 ]:

Volume (refers to the amount of data and is one of the biggest challenges in Big Data Analytics),

Velocity (speed with which new data is generated, the challenge is to be able to manage data effectively and in real time),

Variety (heterogeneity of data, many different types of healthcare data, the challenge is to derive insights by looking at all available heterogenous data in a holistic manner),

Variability (inconsistency of data, the challenge is to correct the interpretation of data that can vary significantly depending on the context),

Veracity (how trustworthy the data is, quality of the data),

Visualization (ability to interpret data and resulting insights, challenging for Big Data due to its other features as described above).

Value (the goal of Big Data Analytics is to discover the hidden knowledge from huge amounts of data).

Big Data is defined as an information asset with high volume, velocity, and variety, which requires specific technology and method for its transformation into value [ 21 , 77 ]. Big Data is also a collection of information about high-volume, high volatility or high diversity, requiring new forms of processing in order to support decision-making, discovering new phenomena and process optimization [ 5 , 7 ]. Big Data is too large for traditional data-processing systems and software tools to capture, store, manage and analyze, therefore it requires new technologies [ 28 , 50 , 61 ] to manage (capture, aggregate, process) its volume, velocity and variety [ 9 ].

Undoubtedly, Big Data differs from the data sources used so far by organizations. Therefore, organizations must approach this type of unstructured data in a different way. First of all, organizations must start to see data as flows and not stocks—this entails the need to implement the so-called streaming analytics [ 48 ]. The mentioned features make it necessary to use new IT tools that allow the fullest use of new data [ 58 ]. The Big Data idea, inseparable from the huge increase in data available to various organizations or individuals, creates opportunities for access to valuable analyses, conclusions and enables making more accurate decisions [ 6 , 11 , 59 ].

The Big Data concept is constantly evolving and currently it does not focus on huge amounts of data, but rather on the process of creating value from this data [ 52 ]. Big Data is collected from various sources that have different data properties and are processed by different organizational units, resulting in creation of a Big Data chain [ 36 ]. The aim of the organizations is to manage, process and analyze Big Data. In the healthcare sector, Big Data streams consist of various types of data, namely [ 8 , 51 ]:

clinical data, i.e. data obtained from electronic medical records, data from hospital information systems, image centers, laboratories, pharmacies and other organizations providing health services, patient generated health data, physician’s free-text notes, genomic data, physiological monitoring data [ 4 ],

biometric data provided from various types of devices that monitor weight, pressure, glucose level, etc.,

financial data, constituting a full record of economic operations reflecting the conducted activity,

data from scientific research activities, i.e. results of research, including drug research, design of medical devices and new methods of treatment,

data provided by patients, including description of preferences, level of satisfaction, information from systems for self-monitoring of their activity: exercises, sleep, meals consumed, etc.

data from social media.

These data are provided not only by patients but also by organizations and institutions, as well as by various types of monitoring devices, sensors or instruments [ 16 ]. Data that has been generated so far in the healthcare sector is stored in both paper and digital form. Thus, the essence and the specificity of the process of Big Data analyses means that organizations need to face new technological and organizational challenges [ 67 ]. The healthcare sector has always generated huge amounts of data and this is connected, among others, with the need to store medical records of patients. However, the problem with Big Data in healthcare is not limited to an overwhelming volume but also an unprecedented diversity in terms of types, data formats and speed with which it should be analyzed in order to provide the necessary information on an ongoing basis [ 3 ]. It is also difficult to apply traditional tools and methods for management of unstructured data [ 67 ]. Due to the diversity and quantity of data sources that are growing all the time, advanced analytical tools and technologies, as well as Big Data analysis methods which can meet and exceed the possibilities of managing healthcare data, are needed [ 3 , 68 ].

Therefore, the potential is seen in Big Data analyses, especially in the aspect of improving the quality of medical care, saving lives or reducing costs [ 30 ]. Extracting from this tangle of given association rules, patterns and trends will allow health service providers and other stakeholders in the healthcare sector to offer more accurate and more insightful diagnoses of patients, personalized treatment, monitoring of the patients, preventive medicine, support of medical research and health population, as well as better quality of medical services and patient care while, at the same time, the ability to reduce costs (Fig.  1 ).

figure 1

(Source: Own elaboration)

Healthcare Big Data Analytics applications

The main challenge with Big Data is how to handle such a large amount of information and use it to make data-driven decisions in plenty of areas [ 64 ]. In the context of healthcare data, another major challenge is to adjust big data storage, analysis, presentation of analysis results and inference basing on them in a clinical setting. Data analytics systems implemented in healthcare are designed to describe, integrate and present complex data in an appropriate way so that it can be understood better (Fig.  2 ). This would improve the efficiency of acquiring, storing, analyzing and visualizing big data from healthcare [ 71 ].

figure 2

Process of Big Data Analytics

The result of data processing with the use of Big Data Analytics is appropriate data storytelling which may contribute to making decisions with both lower risk and data support. This, in turn, can benefit healthcare stakeholders. To take advantage of the potential massive amounts of data in healthcare and to ensure that the right intervention to the right patient is properly timed, personalized, and potentially beneficial to all components of the healthcare system such as the payer, patient, and management, analytics of large datasets must connect communities involved in data analytics and healthcare informatics [ 49 ]. Big Data Analytics can provide insight into clinical data and thus facilitate informed decision-making about the diagnosis and treatment of patients, prevention of diseases or others. Big Data Analytics can also improve the efficiency of healthcare organizations by realizing the data potential [ 3 , 62 ].

Big Data Analytics in medicine and healthcare refers to the integration and analysis of a large amount of complex heterogeneous data, such as various omics (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenetics, deasomics), biomedical data, talemedicine data (sensors, medical equipment data) and electronic health records data [ 46 , 65 ].

When analyzing the phenomenon of Big Data in the healthcare sector, it should be noted that it can be considered from the point of view of three areas: epidemiological, clinical and business.

From a clinical point of view, the Big Data analysis aims to improve the health and condition of patients, enable long-term predictions about their health status and implementation of appropriate therapeutic procedures. Ultimately, the use of data analysis in medicine is to allow the adaptation of therapy to a specific patient, that is personalized medicine (precision, personalized medicine).

From an epidemiological point of view, it is desirable to obtain an accurate prognosis of morbidity in order to implement preventive programs in advance.

In the business context, Big Data analysis may enable offering personalized packages of commercial services or determining the probability of individual disease and infection occurrence. It is worth noting that Big Data means not only the collection and processing of data but, most of all, the inference and visualization of data necessary to obtain specific business benefits.

In order to introduce new management methods and new solutions in terms of effectiveness and transparency, it becomes necessary to make data more accessible, digital, searchable, as well as analyzed and visualized.

Erickson and Rothberg state that the information and data do not reveal their full value until insights are drawn from them. Data becomes useful when it enhances decision making and decision making is enhanced only when analytical techniques are used and an element of human interaction is applied [ 22 ].

Thus, healthcare has experienced much progress in usage and analysis of data. A large-scale digitalization and transparency in this sector is a key statement of almost all countries governments policies. For centuries, the treatment of patients was based on the judgment of doctors who made treatment decisions. In recent years, however, Evidence-Based Medicine has become more and more important as a result of it being related to the systematic analysis of clinical data and decision-making treatment based on the best available information [ 42 ]. In the healthcare sector, Big Data Analytics is expected to improve the quality of life and reduce operational costs [ 72 , 82 ]. Big Data Analytics enables organizations to improve and increase their understanding of the information contained in data. It also helps identify data that provides insightful insights for current as well as future decisions [ 28 ].

Big Data Analytics refers to technologies that are grounded mostly in data mining: text mining, web mining, process mining, audio and video analytics, statistical analysis, network analytics, social media analytics and web analytics [ 16 , 25 , 31 ]. Different data mining techniques can be applied on heterogeneous healthcare data sets, such as: anomaly detection, clustering, classification, association rules as well as summarization and visualization of those Big Data sets [ 65 ]. Modern data analytics techniques explore and leverage unique data characteristics even from high-speed data streams and sensor data [ 15 , 16 , 31 , 55 ]. Big Data can be used, for example, for better diagnosis in the context of comprehensive patient data, disease prevention and telemedicine (in particular when using real-time alerts for immediate care), monitoring patients at home, preventing unnecessary hospital visits, integrating medical imaging for a wider diagnosis, creating predictive analytics, reducing fraud and improving data security, better strategic planning and increasing patients’ involvement in their own health.

Big Data Analytics in healthcare can be divided into [ 33 , 73 , 74 ]:

descriptive analytics in healthcare is used to understand past and current healthcare decisions, converting data into useful information for understanding and analyzing healthcare decisions, outcomes and quality, as well as making informed decisions [ 33 ]. It can be used to create reports (i.e. about patients’ hospitalizations, physicians’ performance, utilization management), visualization, customized reports, drill down tables, or running queries on the basis of historical data.

predictive analytics operates on past performance in an effort to predict the future by examining historical or summarized health data, detecting patterns of relationships in these data, and then extrapolating these relationships to forecast. It can be used to i.e. predict the response of different patient groups to different drugs (dosages) or reactions (clinical trials), anticipate risk and find relationships in health data and detect hidden patterns [ 62 ]. In this way, it is possible to predict the epidemic spread, anticipate service contracts and plan healthcare resources. Predictive analytics is used in proper diagnosis and for appropriate treatments to be given to patients suffering from certain diseases [ 39 ].

prescriptive analytics—occurs when health problems involve too many choices or alternatives. It uses health and medical knowledge in addition to data or information. Prescriptive analytics is used in many areas of healthcare, including drug prescriptions and treatment alternatives. Personalized medicine and evidence-based medicine are both supported by prescriptive analytics.

discovery analytics—utilizes knowledge about knowledge to discover new “inventions” like drugs (drug discovery), previously unknown diseases and medical conditions, alternative treatments, etc.

Although the models and tools used in descriptive, predictive, prescriptive, and discovery analytics are different, many applications involve all four of them [ 62 ]. Big Data Analytics in healthcare can help enable personalized medicine by identifying optimal patient-specific treatments. This can influence the improvement of life standards, reduce waste of healthcare resources and save costs of healthcare [ 56 , 63 , 71 ]. The introduction of large data analysis gives new analytical possibilities in terms of scope, flexibility and visualization. Techniques such as data mining (computational pattern discovery process in large data sets) facilitate inductive reasoning and analysis of exploratory data, enabling scientists to identify data patterns that are independent of specific hypotheses. As a result, predictive analysis and real-time analysis becomes possible, making it easier for medical staff to start early treatments and reduce potential morbidity and mortality. In addition, document analysis, statistical modeling, discovering patterns and topics in document collections and data in the EHR, as well as an inductive approach can help identify and discover relationships between health phenomena.

Advanced analytical techniques can be used for a large amount of existing (but not yet analytical) data on patient health and related medical data to achieve a better understanding of the information and results obtained, as well as to design optimal clinical pathways [ 62 ]. Big Data Analytics in healthcare integrates analysis of several scientific areas such as bioinformatics, medical imaging, sensor informatics, medical informatics and health informatics [ 65 ]. Big Data Analytics in healthcare allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 65 ]. Discussing all the techniques used for Big Data Analytics goes beyond the scope of a single article [ 25 ].

The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders [ 12 ]. It is believed that the implementation of big data analytics by healthcare organizations could bring many benefits in the upcoming years, including lowering health care costs, better diagnosis and prediction of diseases and their spread, improving patient care and developing protocols to prevent re-hospitalization, optimizing staff, optimizing equipment, forecasting the need for hospital beds, operating rooms, treatments, and improving the drug supply chain [ 71 ].

Challenges and potential benefits of using Big Data Analytics in healthcare

Modern analytics gives possibilities not only to have insight in historical data, but also to have information necessary to generate insight into what may happen in the future. Even when it comes to prediction of evidence-based actions. The emphasis on reform has prompted payers and suppliers to pursue data analysis to reduce risk, detect fraud, improve efficiency and save lives. Everyone—payers, providers, even patients—are focusing on doing more with fewer resources. Thus, some areas in which enhanced data and analytics can yield the greatest results include various healthcare stakeholders (Table 1 ).

Healthcare organizations see the opportunity to grow through investments in Big Data Analytics. In recent years, by collecting medical data of patients, converting them into Big Data and applying appropriate algorithms, reliable information has been generated that helps patients, physicians and stakeholders in the health sector to identify values and opportunities [ 31 ]. It is worth noting that there are many changes and challenges in the structure of the healthcare sector. Digitization and effective use of Big Data in healthcare can bring benefits to every stakeholder in this sector. A single doctor would benefit the same as the entire healthcare system. Potential opportunities to achieve benefits and effects from Big Data in healthcare can be divided into four groups [ 8 ]:

Improving the quality of healthcare services:

assessment of diagnoses made by doctors and the manner of treatment of diseases indicated by them based on the decision support system working on Big Data collections,

detection of more effective, from a medical point of view, and more cost-effective ways to diagnose and treat patients,

analysis of large volumes of data to reach practical information useful for identifying needs, introducing new health services, preventing and overcoming crises,

prediction of the incidence of diseases,

detecting trends that lead to an improvement in health and lifestyle of the society,

analysis of the human genome for the introduction of personalized treatment.

Supporting the work of medical personnel

doctors’ comparison of current medical cases to cases from the past for better diagnosis and treatment adjustment,

detection of diseases at earlier stages when they can be more easily and quickly cured,

detecting epidemiological risks and improving control of pathogenic spots and reaction rates,

identification of patients who are predicted to have the highest risk of specific, life-threatening diseases by collating data on the history of the most common diseases, in healing people with reports entering insurance companies,

health management of each patient individually (personalized medicine) and health management of the whole society,

capturing and analyzing large amounts of data from hospitals and homes in real time, life monitoring devices to monitor safety and predict adverse events,

analysis of patient profiles to identify people for whom prevention should be applied, lifestyle change or preventive care approach,

the ability to predict the occurrence of specific diseases or worsening of patients’ results,

predicting disease progression and its determinants, estimating the risk of complications,

detecting drug interactions and their side effects.

Supporting scientific and research activity

supporting work on new drugs and clinical trials thanks to the possibility of analyzing “all data” instead of selecting a test sample,

the ability to identify patients with specific, biological features that will take part in specialized clinical trials,

selecting a group of patients for which the tested drug is likely to have the desired effect and no side effects,

using modeling and predictive analysis to design better drugs and devices.

Business and management

reduction of costs and counteracting abuse and counseling practices,

faster and more effective identification of incorrect or unauthorized financial operations in order to prevent abuse and eliminate errors,

increase in profitability by detecting patients generating high costs or identifying doctors whose work, procedures and treatment methods cost the most and offering them solutions that reduce the amount of money spent,

identification of unnecessary medical activities and procedures, e.g. duplicate tests.

According to research conducted by Wang, Kung and Byrd, Big Data Analytics benefits can be classified into five categories: IT infrastructure benefits (reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, better use of healthcare systems, processing standardization among various healthcare IT systems, reducing IT maintenance costs regarding data storage), operational benefits (improving the quality and accuracy of clinical decisions, processing a large number of health records in seconds, reducing the time of patient travel, immediate access to clinical data to analyze, shortening the time of diagnostic test, reductions in surgery-related hospitalizations, exploring inconceivable new research avenues), organizational benefits (detecting interoperability problems much more quickly than traditional manual methods, improving cross-functional communication and collaboration among administrative staffs, researchers, clinicians and IT staffs, enabling data sharing with other institutions and adding new services, content sources and research partners), managerial benefits (gaining quick insights about changing healthcare trends in the market, providing members of the board and heads of department with sound decision-support information on the daily clinical setting, optimizing business growth-related decisions) and strategic benefits (providing a big picture view of treatment delivery for meeting future need, creating high competitive healthcare services) [ 73 ].

The above specification does not constitute a full list of potential areas of use of Big Data Analysis in healthcare because the possibilities of using analysis are practically unlimited. In addition, advanced analytical tools allow to analyze data from all possible sources and conduct cross-analyses to provide better data insights [ 26 ]. For example, a cross-analysis can refer to a combination of patient characteristics, as well as costs and care results that can help identify the best, in medical terms, and the most cost-effective treatment or treatments and this may allow a better adjustment of the service provider’s offer [ 62 ].

In turn, the analysis of patient profiles (e.g. segmentation and predictive modeling) allows identification of people who should be subject to prophylaxis, prevention or should change their lifestyle [ 8 ]. Shortened list of benefits for Big Data Analytics in healthcare is presented in paper [ 3 ] and consists of: better performance, day-to-day guides, detection of diseases in early stages, making predictive analytics, cost effectiveness, Evidence Based Medicine and effectiveness in patient treatment.

Summarizing, healthcare big data represents a huge potential for the transformation of healthcare: improvement of patients’ results, prediction of outbreaks of epidemics, valuable insights, avoidance of preventable diseases, reduction of the cost of healthcare delivery and improvement of the quality of life in general [ 1 ]. Big Data also generates many challenges such as difficulties in data capture, data storage, data analysis and data visualization [ 15 ]. The main challenges are connected with the issues of: data structure (Big Data should be user-friendly, transparent, and menu-driven but it is fragmented, dispersed, rarely standardized and difficult to aggregate and analyze), security (data security, privacy and sensitivity of healthcare data, there are significant concerns related to confidentiality), data standardization (data is stored in formats that are not compatible with all applications and technologies), storage and transfers (especially costs associated with securing, storing, and transferring unstructured data), managerial skills, such as data governance, lack of appropriate analytical skills and problems with Real-Time Analytics (health care is to be able to utilize Big Data in real time) [ 4 , 34 , 41 ].

The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities in Poland.

Presented research results are part of a larger questionnaire form on Big Data Analytics. The direct research was based on an interview questionnaire which contained 100 questions with 5-point Likert scale (1—strongly disagree, 2—I rather disagree, 3—I do not agree, nor disagree, 4—I rather agree, 5—I definitely agree) and 4 metrics questions. The study was conducted in December 2018 on a sample of 217 medical facilities (110 private, 107 public). The research was conducted by a specialized market research agency: Center for Research and Expertise of the University of Economics in Katowice.

When it comes to direct research, the selected entities included entities financed from public sources—the National Health Fund (23.5%), and entities operating commercially (11.5%). In the surveyed group of entities, more than a half (64.9%) are hybrid financed, both from public and commercial sources. The diversity of the research sample also applies to the size of the entities, defined by the number of employees. Taking into account proportions of the surveyed entities, it should be noted that in the sector structure, medium-sized (10–50 employees—34% of the sample) and large (51–250 employees—27%) entities dominate. The research was of all-Poland nature, and the entities included in the research sample come from all of the voivodships. The largest group were entities from Łódzkie (32%), Śląskie (18%) and Mazowieckie (18%) voivodships, as these voivodships have the largest number of medical institutions. Other regions of the country were represented by single units. The selection of the research sample was random—layered. As part of medical facilities database, groups of private and public medical facilities have been identified and the ones to which the questionnaire was targeted were drawn from each of these groups. The analyses were performed using the GNU PSPP 0.10.2 software.

The aim of the study was to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Characteristics of the research sample is presented in Table 2 .

The research is non-exhaustive due to the incomplete and uneven regional distribution of the samples, overrepresented in three voivodeships (Łódzkie, Mazowieckie and Śląskie). The size of the research sample (217 entities) allows the authors of the paper to formulate specific conclusions on the use of Big Data in the process of its management.

For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both structured and unstructured data (2) medical facilities in Poland are moving towards data-based healthcare and its benefits.

The paper poses the following research questions and statements that coincide with the selected questions from the research questionnaire:

From what sources do medical facilities obtain data? What types of data are used by the particular organization, whether structured or unstructured, and to what extent?

From what sources do medical facilities obtain data?

In which area organizations are using data and analytical systems (clinical or business)?

Is data analytics performed based on historical data or are predictive analyses also performed?

Determining whether administrative and medical staff receive complete, accurate and reliable data in a timely manner?

Determining whether real-time analyses are performed to support the particular organization’s activities.

Results and discussion

On the basis of the literature analysis and research study, a set of questions and statements related to the researched area was formulated. The results from the surveys show that medical facilities use a variety of data sources in their operations. These sources are both structured and unstructured data (Table 3 ).

According to the data provided by the respondents, considering the first statement made in the questionnaire, almost half of the medical institutions (47.58%) agreed that they rather collect and use structured data (e.g. databases and data warehouses, reports to external entities) and 10.57% entirely agree with this statement. As much as 23.35% of representatives of medical institutions stated “I agree or disagree”. Other medical facilities do not collect and use structured data (7.93%) and 6.17% strongly disagree with the first statement. Also, the median calculated based on the obtained results (median: 4), proves that medical facilities in Poland collect and use structured data (Table 4 ).

In turn, 28.19% of the medical institutions agreed that they rather collect and use unstructured data and as much as 9.25% entirely agree with this statement. The number of representatives of medical institutions that stated “I agree or disagree” was 27.31%. Other medical facilities do not collect and use structured data (17.18%) and 13.66% strongly disagree with the first statement. In the case of unstructured data the median is 3, which means that the collection and use of this type of data by medical facilities in Poland is lower.

In the further part of the analysis, it was checked whether the size of the medical facility and form of ownership have an impact on whether it analyzes unstructured data (Tables 4 and 5 ). In order to find this out, correlation coefficients were calculated.

Based on the calculations, it can be concluded that there is a small statistically monotonic correlation between the size of the medical facility and its collection and use of structured data (p < 0.001; τ = 0.16). This means that the use of structured data is slightly increasing in larger medical facilities. The size of the medical facility is more important according to use of unstructured data (p < 0.001; τ = 0.23) (Table 4 .).

To determine whether the form of medical facility ownership affects data collection, the Mann–Whitney U test was used. The calculations show that the form of ownership does not affect what data the organization collects and uses (Table 5 ).

Detailed information on the sources of from which medical facilities collect and use data is presented in the Table 6 .

The questionnaire results show that medical facilities are especially using information published in databases, reports to external units and transaction data, but they also use unstructured data from e-mails, medical devices, sensors, phone calls, audio and video data (Table 6 ). Data from social media, RFID and geolocation data are used to a small extent. Similar findings are concluded in the literature studies.

From the analysis of the answers given by the respondents, more than half of the medical facilities have integrated hospital system (HIS) implemented. As much as 43.61% use integrated hospital system and 16.30% use it extensively (Table 7 ). 19.38% of exanimated medical facilities do not use it at all. Moreover, most of the examined medical facilities (34.80% use it, 32.16% use extensively) conduct medical documentation in an electronic form, which gives an opportunity to use data analytics. Only 4.85% of medical facilities don’t use it at all.

Other problems that needed to be investigated were: whether medical facilities in Poland use data analytics? If so, in what form and in what areas? (Table 8 ). The analysis of answers given by the respondents about the potential of data analytics in medical facilities shows that a similar number of medical facilities use data analytics in administration and business (31.72% agreed with the statement no. 5 and 12.33% strongly agreed) as in the clinical area (33.04% agreed with the statement no. 6 and 12.33% strongly agreed). When considering decision-making issues, 35.24% agree with the statement "the organization uses data and analytical systems to support business decisions” and 8.37% of respondents strongly agree. Almost 40.09% agree with the statement that “the organization uses data and analytical systems to support clinical decisions (in the field of diagnostics and therapy)” and 15.42% of respondents strongly agree. Exanimated medical facilities use in their activity analytics based both on historical data (33.48% agree with statement 7 and 12.78% strongly agree) and predictive analytics (33.04% agrees with the statement number 8 and 15.86% strongly agree). Detailed results are presented in Table 8 .

Medical facilities focus on development in the field of data processing, as they confirm that they conduct analytical planning processes systematically and analyze new opportunities for strategic use of analytics in business and clinical activities (38.33% rather agree and 10.57% strongly agree with this statement). The situation is different with real-time data analysis, here, the situation is not so optimistic. Only 28.19% rather agree and 14.10% strongly agree with the statement that real-time analyses are performed to support an organization’s activities.

When considering whether a facility’s performance in the clinical area depends on the form of ownership, it can be concluded that taking the average and the Mann–Whitney U test depends. A higher degree of use of analyses in the clinical area can be observed in public institutions.

Whether a medical facility performs a descriptive or predictive analysis do not depend on the form of ownership (p > 0.05). It can be concluded that when analyzing the mean and median, they are higher in public facilities, than in private ones. What is more, the Mann–Whitney U test shows that these variables are dependent from each other (p < 0.05) (Table 9 ).

When considering whether a facility’s performance in the clinical area depends on its size, it can be concluded that taking the Kendall’s Tau (τ) it depends (p < 0.001; τ = 0.22), and the correlation is weak but statistically important. This means that the use of data and analytical systems to support clinical decisions (in the field of diagnostics and therapy) increases with the increase of size of the medical facility. A similar relationship, but even less powerful, can be found in the use of descriptive and predictive analyses (Table 10 ).

Considering the results of research in the area of analytical maturity of medical facilities, 8.81% of medical facilities stated that they are at the first level of maturity, i.e. an organization has developed analytical skills and does not perform analyses. As much as 13.66% of medical facilities confirmed that they have poor analytical skills, while 38.33% of the medical facility has located itself at level 3, meaning that “there is a lot to do in analytics”. On the other hand, 28.19% believe that analytical capabilities are well developed and 6.61% stated that analytics are at the highest level and the analytical capabilities are very well developed. Detailed data is presented in Table 11 . Average amounts to 3.11 and Median to 3.

The results of the research have enabled the formulation of following conclusions. Medical facilities in Poland are working on both structured and unstructured data. This data comes from databases, transactions, unstructured content of emails and documents, devices and sensors. However, the use of data from social media is smaller. In their activity, they reach for analytics in the administrative and business, as well as in the clinical area. Also, the decisions made are largely data-driven.

In summary, analysis of the literature that the benefits that medical facilities can get using Big Data Analytics in their activities relate primarily to patients, physicians and medical facilities. It can be confirmed that: patients will be better informed, will receive treatments that will work for them, will have prescribed medications that work for them and not be given unnecessary medications [ 78 ]. Physician roles will likely change to more of a consultant than decision maker. They will advise, warn, and help individual patients and have more time to form positive and lasting relationships with their patients in order to help people. Medical facilities will see changes as well, for example in fewer unnecessary hospitalizations, resulting initially in less revenue, but after the market adjusts, also the accomplishment [ 78 ]. The use of Big Data Analytics can literally revolutionize the way healthcare is practiced for better health and disease reduction.

The analysis of the latest data reveals that data analytics increase the accuracy of diagnoses. Physicians can use predictive algorithms to help them make more accurate diagnoses [ 45 ]. Moreover, it could be helpful in preventive medicine and public health because with early intervention, many diseases can be prevented or ameliorated [ 29 ]. Predictive analytics also allows to identify risk factors for a given patient, and with this knowledge patients will be able to change their lives what, in turn, may contribute to the fact that population disease patterns may dramatically change, resulting in savings in medical costs. Moreover, personalized medicine is the best solution for an individual patient seeking treatment. It can help doctors decide the exact treatments for those individuals. Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including doctors’ time.

The quantitative analysis of the research carried out and presented in this article made it possible to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Thanks to the results obtained it was possible to formulate the following conclusions. Medical facilities are working on both structured and unstructured data, which comes from databases, transactions, unstructured content of emails and documents, devices and sensors. According to analytics, they reach for analytics in the administrative and business, as well as in the clinical area. It clearly showed that the decisions made are largely data-driven. The results of the study confirm what has been analyzed in the literature. Medical facilities are moving towards data-based healthcare and its benefits.

In conclusion, Big Data Analytics has the potential for positive impact and global implications in healthcare. Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen.

Practical implications

This work sought to narrow the gap that exists in analyzing the possibility of using Big Data Analytics in healthcare. Showing how medical facilities in Poland are doing in this respect is an element that is part of global research carried out in this area, including [ 29 , 32 , 60 ].

Limitations and future directions

The research described in this article does not fully exhaust the questions related to the use of Big Data Analytics in Polish healthcare facilities. Only some of the dimensions characterizing the use of data by medical facilities in Poland have been examined. In order to get the full picture, it would be necessary to examine the results of using structured and unstructured data analytics in healthcare. Future research may examine the benefits that medical institutions achieve as a result of the analysis of structured and unstructured data in the clinical and management areas and what limitations they encounter in these areas. For this purpose, it is planned to conduct in-depth interviews with chosen medical facilities in Poland. These facilities could give additional data for empirical analyses based more on their suggestions. Further research should also include medical institutions from beyond the borders of Poland, enabling international comparative analyses.

Future research in the healthcare field has virtually endless possibilities. These regard the use of Big Data Analytics to diagnose specific conditions [ 47 , 66 , 69 , 76 ], propose an approach that can be used in other healthcare applications and create mechanisms to identify “patients like me” [ 75 , 80 ]. Big Data Analytics could also be used for studies related to the spread of pandemics, the efficacy of covid treatment [ 18 , 79 ], or psychology and psychiatry studies, e.g. emotion recognition [ 35 ].

Availability of data and materials

The datasets for this study are available on request to the corresponding author.

Abouelmehdi K, Beni-Hessane A, Khaloufi H. Big healthcare data: preserving security and privacy. J Big Data. 2018. https://doi.org/10.1186/s40537-017-0110-7 .

Article   Google Scholar  

Agrawal A, Choudhary A. Health services data: big data analytics for deriving predictive healthcare insights. Health Serv Eval. 2019. https://doi.org/10.1007/978-1-4899-7673-4_2-1 .

Al Mayahi S, Al-Badi A, Tarhini A. Exploring the potential benefits of big data analytics in providing smart healthcare. In: Miraz MH, Excell P, Ware A, Ali M, Soomro S, editors. Emerging technologies in computing—first international conference, iCETiC 2018, proceedings (Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST). Cham: Springer; 2018. p. 247–58. https://doi.org/10.1007/978-3-319-95450-9_21 .

Bainbridge M. Big data challenges for clinical and precision medicine. In: Househ M, Kushniruk A, Borycki E, editors. Big data, big challenges: a healthcare perspective: background, issues, solutions and research directions. Cham: Springer; 2019. p. 17–31.

Google Scholar  

Bartuś K, Batko K, Lorek P. Business intelligence systems: barriers during implementation. In: Jabłoński M, editor. Strategic performance management new concept and contemporary trends. New York: Nova Science Publishers; 2017. p. 299–327. ISBN: 978-1-53612-681-5.

Bartuś K, Batko K, Lorek P. Diagnoza wykorzystania big data w organizacjach-wybrane wyniki badań. Informatyka Ekonomiczna. 2017;3(45):9–20.

Bartuś K, Batko K, Lorek P. Wykorzystanie rozwiązań business intelligence, competitive intelligence i big data w przedsiębiorstwach województwa śląskiego. Przegląd Organizacji. 2018;2:33–9.

Batko K. Możliwości wykorzystania Big Data w ochronie zdrowia. Roczniki Kolegium Analiz Ekonomicznych. 2016;42:267–82.

Bi Z, Cochran D. Big data analytics with applications. J Manag Anal. 2014;1(4):249–65. https://doi.org/10.1080/23270012.2014.992985 .

Boerma T, Requejo J, Victora CG, Amouzou A, Asha G, Agyepong I, Borghi J. Countdown to 2030: tracking progress towards universal coverage for reproductive, maternal, newborn, and child health. Lancet. 2018;391(10129):1538–48.

Bollier D, Firestone CM. The promise and peril of big data. Washington, D.C: Aspen Institute, Communications and Society Program; 2010. p. 1–66.

Bose R. Competitive intelligence process and tools for intelligence analysis. Ind Manag Data Syst. 2008;108(4):510–28.

Carter P. Big data analytics: future architectures, skills and roadmaps for the CIO: in white paper, IDC sponsored by SAS. 2011. p. 1–16.

Castro EM, Van Regenmortel T, Vanhaecht K, Sermeus W, Van Hecke A. Patient empowerment, patient participation and patient-centeredness in hospital care: a concept analysis based on a literature review. Patient Educ Couns. 2016;99(12):1923–39.

Chen H, Chiang RH, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Q. 2012;36(4):1165–88.

Chen CP, Zhang CY. Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci. 2014;275:314–47.

Chomiak-Orsa I, Mrozek B. Główne perspektywy wykorzystania big data w mediach społecznościowych. Informatyka Ekonomiczna. 2017;3(45):44–54.

Corsi A, de Souza FF, Pagani RN, et al. Big data analytics as a tool for fighting pandemics: a systematic review of literature. J Ambient Intell Hum Comput. 2021;12:9163–80. https://doi.org/10.1007/s12652-020-02617-4 .

Davenport TH, Harris JG. Competing on analytics, the new science of winning. Boston: Harvard Business School Publishing Corporation; 2007.

Davenport TH. Big data at work: dispelling the myths, uncovering the opportunities. Boston: Harvard Business School Publishing; 2014.

De Cnudde S, Martens D. Loyal to your city? A data mining analysis of a public service loyalty program. Decis Support Syst. 2015;73:74–84.

Erickson S, Rothberg H. Data, information, and intelligence. In: Rodriguez E, editor. The analytics process. Boca Raton: Auerbach Publications; 2017. p. 111–26.

Fang H, Zhang Z, Wang CJ, Daneshmand M, Wang C, Wang H. A survey of big data research. IEEE Netw. 2015;29(5):6–9.

Fredriksson C. Organizational knowledge creation with big data. A case study of the concept and practical use of big data in a local government context. 2016. https://www.abo.fi/fakultet/media/22103/fredriksson.pdf .

Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manag. 2015;35(2):137–44.

Groves P, Kayyali B, Knott D, Van Kuiken S. The ‘big data’ revolution in healthcare. Accelerating value and innovation. 2015. http://www.pharmatalents.es/assets/files/Big_Data_Revolution.pdf (Reading: 10.04.2019).

Gupta V, Rathmore N. Deriving business intelligence from unstructured data. Int J Inf Comput Technol. 2013;3(9):971–6.

Gupta V, Singh VK, Ghose U, Mukhija P. A quantitative and text-based characterization of big data research. J Intell Fuzzy Syst. 2019;36:4659–75.

Hampel HOBS, O’Bryant SE, Castrillo JI, Ritchie C, Rojkova K, Broich K, Escott-Price V. PRECISION MEDICINE-the golden gate for detection, treatment and prevention of Alzheimer’s disease. J Prev Alzheimer’s Dis. 2016;3(4):243.

Harerimana GB, Jang J, Kim W, Park HK. Health big data analytics: a technology survey. IEEE Access. 2018;6:65661–78. https://doi.org/10.1109/ACCESS.2018.2878254 .

Hu H, Wen Y, Chua TS, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Hussain S, Hussain M, Afzal M, Hussain J, Bang J, Seung H, Lee S. Semantic preservation of standardized healthcare documents in big data. Int J Med Inform. 2019;129:133–45. https://doi.org/10.1016/j.ijmedinf.2019.05.024 .

Islam MS, Hasan MM, Wang X, Germack H. A systematic review on healthcare analytics: application and theoretical perspective of data mining. In: Healthcare. Basel: Multidisciplinary Digital Publishing Institute; 2018. p. 54.

Ismail A, Shehab A, El-Henawy IM. Healthcare analysis in smart big data analytics: reviews, challenges and recommendations. In: Security in smart cities: models, applications, and challenges. Cham: Springer; 2019. p. 27–45.

Jain N, Gupta V, Shubham S, et al. Understanding cartoon emotion using integrated deep neural network on large dataset. Neural Comput Appl. 2021. https://doi.org/10.1007/s00521-021-06003-9 .

Janssen M, van der Voort H, Wahyudi A. Factors influencing big data decision-making quality. J Bus Res. 2017;70:338–45.

Jordan SR. Beneficence and the expert bureaucracy. Public Integr. 2014;16(4):375–94. https://doi.org/10.2753/PIN1099-9922160404 .

Knapp MM. Big data. J Electron Resourc Med Libr. 2013;10(4):215–22.

Koti MS, Alamma BH. Predictive analytics techniques using big data for healthcare databases. In: Smart intelligent computing and applications. New York: Springer; 2019. p. 679–86.

Krumholz HM. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 2014;33(7):1163–70.

Kruse CS, Goswamy R, Raval YJ, Marawi S. Challenges and opportunities of big data in healthcare: a systematic review. JMIR Med Inform. 2016;4(4):e38.

Kyoungyoung J, Gang HK. Potentiality of big data in the medical sector: focus on how to reshape the healthcare system. Healthc Inform Res. 2013;19(2):79–85.

Laney D. Application delivery strategies 2011. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

Lee IK, Wang CC, Lin MC, Kung CT, Lan KC, Lee CT. Effective strategies to prevent coronavirus disease-2019 (COVID-19) outbreak in hospital. J Hosp Infect. 2020;105(1):102.

Lerner I, Veil R, Nguyen DP, Luu VP, Jantzen R. Revolution in health care: how will data science impact doctor-patient relationships? Front Public Health. 2018;6:99.

Lytras MD, Papadopoulou P, editors. Applying big data analytics in bioinformatics and medicine. IGI Global: Hershey; 2017.

Ma K, et al. Big data in multiple sclerosis: development of a web-based longitudinal study viewer in an imaging informatics-based eFolder system for complex data analysis and management. In: Proceedings volume 9418, medical imaging 2015: PACS and imaging informatics: next generation and innovations. 2015. p. 941809. https://doi.org/10.1117/12.2082650 .

Mach-Król M. Analiza i strategia big data w organizacjach. In: Studia i Materiały Polskiego Stowarzyszenia Zarządzania Wiedzą. 2015;74:43–55.

Madsen LB. Data-driven healthcare: how analytics and BI are transforming the industry. Hoboken: Wiley; 2014.

Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung BA. Big data: the next frontier for innovation, competition, and productivity. Washington: McKinsey Global Institute; 2011.

Marconi K, Dobra M, Thompson C. The use of big data in healthcare. In: Liebowitz J, editor. Big data and business analytics. Boca Raton: CRC Press; 2012. p. 229–48.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Michel M, Lupton D. Toward a manifesto for the ‘public understanding of big data.’ Public Underst Sci. 2016;25(1):104–16. https://doi.org/10.1177/0963662515609005 .

Mikalef P, Krogstie J. Big data analytics as an enabler of process innovation capabilities: a configurational approach. In: International conference on business process management. Cham: Springer; 2018. p. 426–41.

Mohammadi M, Al-Fuqaha A, Sorour S, Guizani M. Deep learning for IoT big data and streaming analytics: a survey. IEEE Commun Surv Tutor. 2018;20(4):2923–60.

Nambiar R, Bhardwaj R, Sethi A, Vargheese R. A look at challenges and opportunities of big data analytics in healthcare. In: 2013 IEEE international conference on big data; 2013. p. 17–22.

Ohlhorst F. Big data analytics: turning big data into big money, vol. 65. Hoboken: Wiley; 2012.

Olszak C, Mach-Król M. A conceptual framework for assessing an organization’s readiness to adopt big data. Sustainability. 2018;10(10):3734.

Olszak CM. Toward better understanding and use of business intelligence in organizations. Inf Syst Manag. 2016;33(2):105–23.

Palanisamy V, Thirunavukarasu R. Implications of big data analytics in developing healthcare frameworks—a review. J King Saud Univ Comput Inf Sci. 2017;31(4):415–25.

Provost F, Fawcett T. Data science and its relationship to big data and data-driven decisionmaking. Big Data. 2013;1(1):51–9.

Raghupathi W, Raghupathi V. An overview of health analytics. J Health Med Inform. 2013;4:132. https://doi.org/10.4172/2157-7420.1000132 .

Raghupathi W, Raghupathi V. Big data analytics in healthcare: promise and potential. Health Inf Sci Syst. 2014;2(1):3.

Ratia M, Myllärniemi J. Beyond IC 4.0: the future potential of BI-tool utilization in the private healthcare, conference: proceedings IFKAD, 2018 at: Delft, The Netherlands.

Ristevski B, Chen M. Big data analytics in medicine and healthcare. J Integr Bioinform. 2018. https://doi.org/10.1515/jib-2017-0030 .

Rumsfeld JS, Joynt KE, Maddox TM. Big data analytics to improve cardiovascular care: promise and challenges. Nat Rev Cardiol. 2016;13(6):350–9. https://doi.org/10.1038/nrcardio.2016.42 .

Schmarzo B. Big data: understanding how data powers big business. Indianapolis: Wiley; 2013.

Senthilkumar SA, Rai BK, Meshram AA, Gunasekaran A, Chandrakumarmangalam S. Big data in healthcare management: a review of literature. Am J Theor Appl Bus. 2018;4:57–69.

Shubham S, Jain N, Gupta V, et al. Identify glomeruli in human kidney tissue images using a deep learning approach. Soft Comput. 2021. https://doi.org/10.1007/s00500-021-06143-z .

Thuemmler C. The case for health 4.0. In: Thuemmler C, Bai C, editors. Health 4.0: how virtualization and big data are revolutionizing healthcare. New York: Springer; 2017.

Tsai CW, Lai CF, Chao HC, et al. Big data analytics: a survey. J Big Data. 2015;2:21. https://doi.org/10.1186/s40537-015-0030-3 .

Wamba SF, Gunasekaran A, Akter S, Ji-fan RS, Dubey R, Childe SJ. Big data analytics and firm performance: effects of dynamic capabilities. J Bus Res. 2017;70:356–65.

Wang Y, Byrd TA. Business analytics-enabled decision-making effectiveness through knowledge absorptive capacity in health care. J Knowl Manag. 2017;21(3):517–39.

Wang Y, Kung L, Wang W, Yu C, Cegielski CG. An integrated big data analytics-enabled transformation model: application to healthcare. Inf Manag. 2018;55(1):64–79.

Wicks P, et al. Scaling PatientsLikeMe via a “generalized platform” for members with chronic illness: web-based survey study of benefits arising. J Med Internet Res. 2018;20(5):e175.

Willems SM, et al. The potential use of big data in oncology. Oral Oncol. 2019;98:8–12. https://doi.org/10.1016/j.oraloncology.2019.09.003 .

Williams N, Ferdinand NP, Croft R. Project management maturity in the age of big data. Int J Manag Proj Bus. 2014;7(2):311–7.

Winters-Miner LA. Seven ways predictive analytics can improve healthcare. Medical predictive analytics have the potential to revolutionize healthcare around the world. 2014. https://www.elsevier.com/connect/seven-ways-predictive-analytics-can-improve-healthcare (Reading: 15.04.2019).

Wu J, et al. Application of big data technology for COVID-19 prevention and control in China: lessons and recommendations. J Med Internet Res. 2020;22(10): e21980.

Yan L, Peng J, Tan Y. Network dynamics: how can we find patients like us? Inf Syst Res. 2015;26(3):496–512.

Yang JJ, Li J, Mulder J, Wang Y, Chen S, Wu H, Pan H. Emerging information technologies for enhanced healthcare. Comput Ind. 2015;69:3–11.

Zhang Q, Yang LT, Chen Z, Li P. A survey on deep learning for big data. Inf Fusion. 2018;42:146–57.

Download references

Acknowledgements

We would like to thank those who have touched our science paths.

This research was fully funded as statutory activity—subsidy of Ministry of Science and Higher Education granted for Technical University of Czestochowa on maintaining research potential in 2018. Research Number: BS/PB–622/3020/2014/P. Publication fee for the paper was financed by the University of Economics in Katowice.

Author information

Authors and affiliations.

Department of Business Informatics, University of Economics in Katowice, Katowice, Poland

Kornelia Batko

Department of Biomedical Processes and Systems, Institute of Health and Nutrition Sciences, Częstochowa University of Technology, Częstochowa, Poland

Andrzej Ślęzak

You can also search for this author in PubMed   Google Scholar

Contributions

KB proposed the concept of research and its design. The manuscript was prepared by KB with the consultation of AŚ. AŚ reviewed the manuscript for getting its fine shape. KB prepared the manuscript in the contexts such as definition of intellectual content, literature search, data acquisition, data analysis, and so on. AŚ obtained research funding. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Kornelia Batko .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Batko, K., Ślęzak, A. The use of Big Data Analytics in healthcare. J Big Data 9 , 3 (2022). https://doi.org/10.1186/s40537-021-00553-4

Download citation

Received : 28 August 2021

Accepted : 19 December 2021

Published : 06 January 2022

DOI : https://doi.org/10.1186/s40537-021-00553-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big Data Analytics
  • Data-driven healthcare

research papers on data analysis

This paper is in the following e-collection/theme issue:

Published on 22.2.2024 in Vol 26 (2024)

Living Lab Data of Patient Needs and Expectations for eHealth-Based Cardiac Rehabilitation in Germany and Spain From the TIMELY Study: Cross-Sectional Analysis

Authors of this article:

Author Orcid Image

Original Paper

  • Boris Schmitz 1, 2 , PhD   ; 
  • Svenja Wirtz 1, 2 , MSc   ; 
  • Manuela Sestayo-Fernández 3 , BSc   ; 
  • Hendrik Schäfer 1, 2 , MSc   ; 
  • Emma R Douma 4 , MSc   ; 
  • Marta Alonso Vazquez 3 , MSc   ; 
  • Violeta González-Salvado 5 , MD   ; 
  • Mirela Habibovic 4 , PhD   ; 
  • Dimitris Gatsios 6 , PhD   ; 
  • Willem Johan Kop 4 , PhD   ; 
  • Carlos Peña-Gil 5 , MD   ; 
  • Frank Mooren 1, 2 , MD  

1 Department of Rehabilitation Sciences, Faculty of Health, University of Witten/Herdecke, Witten, Germany

2 Center for Medical Rehabilitation, DRV Clinic Königsfeld, Ennepetal, Germany

3 Health Research Institute of Santiago de Compostela, Santiago de Compostela, Spain

4 Center of Research on Psychological Disorders and Somatic Diseases, Tilburg University, Tilburg, Netherlands

5 Cardiology and Coronary Care Department, IDIS, CIBER CV, University Hospital of Santiago de Compostela, Santiago de Compostela, Spain

6 Capemed, Ioannina, Greece

Corresponding Author:

Boris Schmitz, PhD

Department of Rehabilitation Sciences

Faculty of Health

University of Witten/Herdecke

Alfred-Herrhausen-Straße 50

Witten, 58455

Phone: 49 23339888 ext 156

Email: [email protected]

Background: The use of eHealth technology in cardiac rehabilitation (CR) is a promising approach to enhance patient outcomes since adherence to healthy lifestyles and risk factor management during phase III CR maintenance is often poorly supported. However, patients’ needs and expectations have not been extensively analyzed to inform the design of such eHealth solutions.

Objective: The goal of this study was to provide a detailed patient perspective on the most important functionalities to include in an eHealth solution to assist them in phase III CR maintenance.

Methods: A guided survey as part of a Living Lab approach was conducted in Germany (n=49) and Spain (n=30) involving women (16/79, 20%) and men (63/79, 80%) with coronary artery disease (mean age 57 years, SD 9 years) participating in a structured center-based CR program. The survey covered patients’ perceived importance of different CR components in general, current usage of technology/technical devices, and helpfulness of the potential features of eHealth in CR. Questionnaires were used to identify personality traits (psychological flexibility, optimism/pessimism, positive/negative affect), potentially predisposing patients to acceptance of an app/monitoring devices.

Results: All the patients in this study owned a smartphone, while 30%-40% used smartwatches and fitness trackers. Patients expressed the need for an eHealth platform that is user-friendly, personalized, and easily accessible, and 71% (56/79) of the patients believed that technology could help them to maintain health goals after CR. Among the offered components, support for regular physical exercise, including updated schedules and progress documentation, was rated the highest. In addition, patients rated the availability of information on diagnosis, current medication, test results, and risk scores as (very) useful. Of note, for each item, except smoking cessation, 35%-50% of the patients indicated a high need for support to achieve their long-term health goals, suggesting the need for individualized care. No major differences were detected between Spanish and German patients (all P >.05) and only younger age ( P =.03) but not sex, education level, or personality traits (all P >.05) were associated with the acceptance of eHealth components.

Conclusions: The patient perspectives collected in this study indicate high acceptance of personalized user-friendly eHealth platforms with remote monitoring to improve adherence to healthy lifestyles among patients with coronary artery disease during phase III CR maintenance. The identified patient needs comprise support in physical exercise, including regular updates on personalized training recommendations. Availability of diagnoses, laboratory results, and medications, as part of a mobile electronic health record were also rated as very useful.

Trial Registration: ClinicalTrials.gov NCT05461729; https://clinicaltrials.gov/study/NCT05461729

Introduction

The application of eHealth technology in cardiac rehabilitation (CR) is being increasingly adopted to enhance patient outcomes. eHealth, which involves the use of digital health technologies, has the potential to facilitate CR programs to offer better, more efficient, and cost-effective care. CR is a crucial aspect of the recovery process after a cardiac event, aiming to reduce the risk of future events and improve the quality of life of patients [ 1 , 2 ]. The European Society of Cardiology defines CR as a multifactorial intervention with core components in patient assessment, physical activity, diet/nutritional counselling, risk factor control, patient education, psychosocial management, vocational advice, and lifestyle behavior change, including patients’ adherence and self-management [ 3 ]. The CR process is typically divided into 3 stages. During phase I, patients discuss their cardiovascular risk factors and health situation in the acute clinic after a coronary intervention or surgery with their treating physician or a CR nurse. This brief phase lasts only a few days and aims to get patients moving as soon as possible, encouraging mild levels of physical activity [ 4 ]. Phase II, the reconditioning phase, occurs at inpatient or outpatient CR centers or even in the home environment with various levels of support. This multidisciplinary phase includes education on risk factors, supervised exercise training, and psychological support, with the goal of improving patients’ exercise capacity, functional mobility, and self-management skills [ 5 ]. In phase III, also referred to as the maintenance phase, patients continue their care in a community or home-based setting. Phase III is the longest and least structured phase of CR, aiming at lifelong self-care with continuous risk factor management and regular physical activity to maintain the achievements made during phase II [ 4 , 6 ]. However, adherence to a healthy lifestyle, including regular physical activity and risk factor management, during phase III maintenance is challenging and often poorly supported [ 7 , 8 ]. The main reasons for suboptimal adherence to phase III CR include patient-related factors (eg, motivation) and unsustainable costs for lifelong patient support in addition to usual care by general practitioners or cardiologists [ 9 , 10 ]. In addition, patient barriers such as time and travel burden may add to lower adherence and uptake of maintenance programs.

Information and communication technology in the form of eHealth applications has undergone recent developments by targeting reduction of possible barriers of initiation and continued engagement in CR [ 11 ]. The advantages of eHealth include less time investment and constraints due to the absence of travel, option of continuous monitoring, and possibility for patients to manage their disease independently [ 12 , 13 ]. The use of eHealth technologies allows for personalization and tailoring of CR programs to individual needs, leading to higher effectiveness and improved outcomes for patients. Furthermore, eHealth applications allow for different CR aspects to be targeted independently or in a combined and synergistic manner and may have positive effects on physical activity, medication adherence, mood states, anxiety, and depression in cardiac patients [ 14 ]. However, there is no uniform eHealth platform available combining all aspects of CR for patients with cardiovascular disease over the continuum of care, including phase III maintenance. Although challenging on a technological level, user acceptance and applicability in day-to-day setting are key for implementation and success of such a solution. In addition, factors such as technological skills, trustworthiness, and overall individual attitude toward eHealth need to be considered [ 15 - 17 ].

Based on this background, the goal of this study was to provide a detailed description of the patient perspective on the most important aspects to be included in an eHealth solution to assist phase III CR maintenance. This report is part of the multistakeholder project TIMELY, which aims at developing a personalized eHealth platform to assist patients over the continuum of the disease according to recent coronary artery disease (CAD) guidelines [ 18 ]. TIMELY employs artificial intelligence–powered CR components in a patient app connected with a patient management platform and decision support tools for case managers and clinicians. Additionally, artificial intelligence–powered conversational agents (chatbots) will be provided to engage in motivational conversations with patients based on behavior change techniques with the goal of optimizing program and exercise adherence. The development of the TIMELY eHealth solution is guided by a Living Lab approach that allows researchers to co-design innovations such as TIMELY with patients in a real-life context to increase acceptance [ 19 ]. Multiple feedback loops are included at pivotal developing stages, incorporating patients and clinicians in a modified Delphi approach [ 20 , 21 ]. Within the TIMELY prospective study, patients are equipped with different devices as part of the envisioned solution, including a long-term 3-channel electrocardiogram (ECG) patch, a hemodynamic monitor for blood pressure measurement and pulse wave analysis, and a wrist-worn activity tracker. This report describes patients’ needs and expectations for eHealth-based CR collected within the TIMELY Living Lab in CR centers from Germany and Spain.

Approach and Participants

To characterize patients’ needs and expectations for an eHealth-based phase III CR maintenance system, a guided survey was conducted at medical rehabilitation centers Clinic Königsfeld, Germany, and University Hospital of Santiago de Compostela, Spain, between July 2021 and March 2022, aiming at a representative sample of ~80 participants. Patients were asked to participate during their inpatient (Germany) or outpatient (Spain) CR program, and participants were recruited consecutively without further selection. Patients diagnosed with CAD were eligible while participating in a structured center-based CR program.

Ethics Approval

This study complied with the Helsinki Declaration “Ethical Principles for Medical Research Involving Human Subjects” and was approved by the ethics committee of University Witten/Herdecke (115/2020) and Servizo Galego de Saúde (2021/190). All participants gave their written informed consent before participating in this study. This study is part of the TIMELY observational trial (ClinicalTrials.gov: NCT05461729), which aims to characterize the progress of patients with CAD during phase II and phase III CR.

Patients’ Characteristics

Patients’ anthropometric and clinical data, including severity of CAD, type of intervention, and comorbidities (rated using the D’Hoore comorbidity index [ 22 ]) were extracted from electronic health records by clinical personnel. Patients’ highest level of education was documented and specified by country. Hauptschule and Educación primaria were defined as primary, Realschule and Educación secundaria obligatoria or vocational training as secondary, and Abitur or Bachillerato as tertiary education in Germany (DE) and Spain (ES), respectively. A university degree was classified as the highest educational category. For comparability and due to differing educational systems in Germany and Spain, the level of education was categorized as “lower/equal to high school” (first two levels) or “higher than high school” (all other higher levels).

Interview-Based Survey

This survey was developed with experts from a clinical and theoretical perspective by using the Delphi method until consensus was reached. The survey (20 items) was composed of 3 parts: (1) importance of different CR components in general, (2) digital literacy and current usage of technology/technical devices, and (3) helpfulness of the potential features of eHealth in CR ( Multimedia Appendix 1 ). Closed questions were used with a list of provided answers rated on a 5-point Likert scale (1=unimportant/not useful; 5=very important/very useful). A filter question was used, which optionally exempted participants who indicated that they would never use an eHealth platform linked to devices. These participants were asked for their reasons for refusing to use an eHealth platform. The survey was pretested with selected patients in Clinic Königsfeld, and adaptations for wordings were made, where necessary. The final version of the survey was translated to German (SW and BS) and Spanish (MSF and MA) by at least 2 researchers for each translation. The survey was conducted by researchers of the local rehabilitation center. Questions were read to the patients, and further explanation was provided if needed. Investigators documented the answers by using a paper-pencil version or an electronic version of the survey ( Multimedia Appendix 1 ).

Questionnaires

In a subset of 40 German patients with CAD, questionnaires were used to identify personal traits potentially predisposing patients for acceptance of an app or monitoring devices to document the progress of CR (ie, questions Q12 and Q13 of the survey). Psychological flexibility was assessed using the Acceptance and Action Questionnaire version 2 (AAQ-2) [ 23 ], and the Revised Life Orientation Test (LOT-R) [ 24 ] was used to identify patients’ optimism/pessimism. The Type D scale for social inhibition (DS-14) [ 25 ] was used to assess negative affectivity, social inhibition, and type D personality. In addition, the Positive and Negative Affect Schedule (PANAS) was applied [ 26 ].

Statistical Analysis

Statistical analyses were performed using the open access program Jamovi (version 2.2.2, The Jamovi project) and SPSS (version 29, IBM Corp). Data are presented as mean and standard deviation, median and range for the Likert rating scales, or n (%) as indicated. Normality was tested using Shapiro-Wilks test. Between group differences were tested using independent 2-sided t -test or analysis of variance. Nonparametric tests were used to investigate group differences in Likert scale data (Mann Whitney U and Kruskal Wallis test). The associations of sex, age, education level as well as different psychological constructs with openness to using eHealth were analyzed between groups (general willingness [yes/maybe] and patients not willing to use eHealth [no]) by using chi-square test or Mann Whitney U test as indicated. To analyze the combined predictive values of multiple patients’ characteristics on eHealth acceptance, we used multivariate linear regression and naïve Bayes classification. The statistical significance level was set at P <.05.

Seventy-nine patients participated in the guided survey (Germany, n=49; Spain, n=30; 16/79, 20% females). The mean age (in years) of the patients was 57 (SD 7; range 37-79) ( Table 1 ). In Germany, our sample population was comparable in terms of sex and age to patients with CAD in general (registry data) [ 27 ] and to patients with CAD undergoing CR in particular (mean 54.9, SD 7.0 years, in-house data). Further comparison of the study sample to German patients with CAD undergoing CR showed considerable similarity also in terms of ST-elevation myocardial infarction/non–ST-elevation myocardial infarction (~75%), number of affected vessels (1 vessel disease, ~30%-40%), and performed intervention (bypass, ~20%; all in-house data). For Spain, our study sample was comparable to patients with CAD undergoing CR in terms of age (~61 years), ST-elevation myocardial infarction/non–ST-elevation myocardial infarction (~85%), number of affected vessels (1 vessel disease, ~60%), and performed intervention (bypass, ~5%; all in-house data, region Galicia). Overall, in terms of the education level, 87% (69/79) of the participants were ≤high school and 13% (10/79) were >high school ( Table 1 ). Comparisons between countries suggested good comparability even though the age (in years) of the Spanish participants (mean 62, SD 10) was higher than that of the German participants (mean 56, SD 6; P <.001), which was associated with a significantly higher burden of comorbidities (median ES 2.3, IQR 1-8; median DE 1.6, IQR 0-7; P =.03). The percentage of former smokers among patients with CAD in Germany was significantly higher than that in Spain (27/49, 55% vs 7/30, 24%; P <.001). Overall, 30% (24/79) of the included participants were active smokers. Of the 79 participants, >85% (67/79) indicated that they (highly) appreciated being involved in the planning of a future eHealth solution.

a P values were calculated using independent 2-sided t test (nonnormally distributed data were analyzed by Mann Whitney U test) and analysis of variance (nonnormally distributed variables were analyzed by Kruskal-Wallis rank sum test).

b P <.05 for within-group comparison.

c Comorbidity index was calculated according to the modified D’Hoore comorbidity index.

d Primary education is known as Hauptschule in Germany (DE) and educación primaria in Spain (ES).

e Secondary education is known as Realschule in Germany (DE) and educación secundaria obligatoria or vocational training in Spain (ES).

f Tertiary education is known as Abitur in Germany and Bachillerato in Spain.

Digital Literacy and Current Usage of Technology

For the assessment of the use of technology among patients and their associated digital literacy, participants were asked what devices they owned, for which purpose the devices were used, and how experienced they were with health/fitness apps. All patients owned a smartphone, while a significantly lower proportion of Spanish patients owned a tablet (ES: 11/30, 37%; DE: 34/49, 69%; P =.005) ( Figure 1 ). The majority of patients also owned a notebook or PC (ES: 18/30, 60%; DE: 25/30, 84%). Smartwatches (ES: 10/30, 33%; DE: 16/49, 33%) and fitness trackers (ES: 9/30, 30%; DE: 21/49, 43%) were used by a significant proportion of the participants with no differences between centers. Although smartphone, tablet, and notebook/PC were predominantly used for communication and information by the patients, a difference for smartwatch/fitness trackers was recorded in that up to 40% (12/30) of the Spanish patients used those devices also for entertainment. This was only reported by 6% (3/49) of the German patients ( P =.06). Instead, 50% (25/49) of the German patients used wearables and associated apps for documentation (including physical activity), which was only reported by 20% (6/30) of the Spanish patients ( P >.05). In terms of experience with automatic blood pressure monitors, 62% (49/79) of the patients reported their level of experience as “experienced” to “very experienced,” and 29% (23/79) and 13% (10/79) reported this level of experience for fitness trackers and health apps, respectively ( Multimedia Appendix 1 ). Of note, more than 40% (32/79) of the patients reported at least some experience with health or fitness apps.

research papers on data analysis

Rating of CR Components

To assess how patients rated the importance of different CR components for disease management, we recorded their feedback on separate aspects of CR (using 5-point rating scales). Patients’ overall rating of the importance of CR components along the continuum of care for risk reduction was very high, including regular physical exercise (median 5, IQR 3-5), healthy diet (median 5, IQR 3-5), stress management (median 5, IQR 1-5), smoking cessation (median 5, IQR 1-5), optimal medication (median 5, IQR 3-5), motivation for lifestyle changes (median 5, IQR 3-5), and overall risk factor management (median 5, IQR 2-5), with no significant difference between the 2 centers. Patients also rated their individual need for support during phase III CR maintenance in the beforementioned areas, revealing large interindividual differences with all items ranging from 1 to 5. In general, patients expressed a high need for support for regular physical exercise (median 4, range 1-5), less need for support for smoking cessation (median 1, range 1-5; only active smokers were asked), and less support for healthy diet (median 3, range 1-5), stress management (median 3, range 1-5), medication (median 3, range 1-5), motivation for lifestyle changes (median 3, range 1-5), and risk factor management (median 3, range 1-5). Of note, for each item except from smoking cessation, 35%-50% of the patients indicated a high need for support (≥4) to achieve their long-term health goals, suggesting a need for individualized care. The subgroup of patients expressing low perceived smoking cessation support needs was analyzed further to investigate if it includes patients with high-risk phenotypes. However, this analysis did not suggest an elevated risk for these patients, as age, sex, BMI, disease severity (bypass performed [yes/no]), and comorbidity index were similar to those of the group of smokers indicating need for smoking cessation support.

Rating of eHealth Components to Assist in Phase III CR Maintenance

Overall, 71% (56/79) of the patients reported that they considered technology, including mobile apps, to be helpful in maintaining health goals after phase II CR. To investigate the specific needs and expectations for an eHealth system to assist in phase III CR maintenance, we asked patients about the features that would be the most helpful for reaching their individual health goals if they were free to choose from a predefined set of options. The presented features were selected by the TIMELY investigators involving cardiologists, rehabilitation experts, behavioral change experts, sports scientists, and by considering recent literature on eHealth in CR [ 6 ]. Selected features were grouped into 3 categories for the presentation of results, including exercise-related features, clinical/medical components, and motivational/other features ( Figure 2 ) and were analyzed for differences between nationality, age groups, and men versus women. No significant differences between nationalities were detected for exercise-related features or medical-related entities. In the domain of other CR components, overall progression documentation was significantly rated as more useful/more needed by German patients (median 5, range 1-5) than by Spanish patients (median 4, range 1-5; P <.001). German patients also rated “individual feedback of a real person” more useful than Spanish patients (median 5, range 1-5 vs median 4, range 3-5; P =.005, respectively). With respect to motivational features, Spanish patients rated the possibility to “share progress with friends and family” as more useful than German patients (median 4, range 1-5 vs median 2, range 1-5]; P =.02, respectively). When asked about the preferred frequency for motivational messages, only 5% of the patients answered “several times a day.” Approximately 27% (21/79) preferred to receive messages once a day, 26% (20/79) every other day, and 9% (7/79) did not want to receive messages. Approximately 32% (25/79) indicated that they would prefer a flexible schedule for messages. Of note, no differences in preference for any suggested features were detected between women and men or among age groups. However, the score for most items ranged from 1 to 5, highlighting that perceived usefulness of potential eHealth features differs substantially between individuals.

research papers on data analysis

Factors Associated With Acceptance of eHealth in CR Maintenance

To investigate the factors associated with the acceptance of eHealth, we used questionnaires to analyze factors such as sex, age, clinical data, educational as well as psychological factors. Questionnaires involved LOT-R for optimism/pessimism, AAQ-2 for psychological flexibility, DS-14 for social inhibition, and PANAS for positive/negative affectivity. Education level was not associated with the acceptance of eHealth components ( Table 2 ). No differences were observed with regard to acceptance between women and men, but younger age was significantly associated with more acceptance of monitoring devices ( P =.03), while only a tendency was seen for willingness to use a mobile app ( P =.11). Of note, only 6% (3/49) of the patients who accepted eHealth indicated they would likely not use eHealth components because of privacy concerns, and 8% (4/49) of the patients did not like the idea of being monitored. Although multivariate linear regression analysis did not identify a combination of factors associated with eHealth acceptance, naïve Bayes classification suggested that eHealth acceptance may potentially be predicted based on younger age, a lower AAQ-2 score indicating psychological flexibility, and the index event (having experienced myocardial infarction). Willingness to use a mobile app was predicted with an overall accuracy of 97.9% (using age and AAQ-2), and the acceptance of monitoring devices was predicted with an overall accuracy of 91.7% (using age, AAQ-2, and myocardial infarction). However, validation in an independent data set was not performed.

a Data are given as n (%) and median and range. Patients were asked if they would use a mobile app for their cardiac rehabilitation maintenance support and if they would use monitoring devices (eg, blood pressure monitor, electrocardiogram, activity tracker) during maintenance. Options provided were yes/maybe or no. Between-group comparison was performed using chi-square test or Mann-Whitney U test.

b Three missing. Only German patients (n=40) were involved.

c LOT-R: Revised Life Orientation Test; 2 dimensions; range 0-12 (higher = larger optimism/pessimism).

d AAQ-2: Acceptance and Action Questionnaire version 2; range 7-49 (higher = greater psychological inflexibility).

e DS-14: Type D scale for social inhibition; 2 dimensions; range 0-28 (higher = larger negative affectivity/social inhibition).

f PANAS: Positive and Negative Affect Schedule; 2 dimensions; range 0-10 (higher = larger affect).

Principal Findings

This study aimed to define patients’ needs and expectations for eHealth-based CR to assist them during the lifelong maintenance phase. A Living Lab approach was used for German and Spanish patients with CAD to characterize their use of technology, their preferences and rating of importance for different components of a future eHealth solution for CR maintenance, as well as their general willingness to use eHealth. In brief, our main findings are (1) patients with CAD appreciated being involved in the planning of a future eHealth system, and they had sufficient levels of digital literacy, (2) patients rated the importance of CR components along the continuum of care for risk reduction as very high, (3) 71% (56/79) of the patients expected that technology could help them to maintain health goals after center-based CR, and (4) a large intraindividual heterogeneity was detected in terms of reported needs and perceived usefulness for different eHealth components.

CAD is a chronic disease, necessitating innovative approaches for effective management and support over the lifelong maintenance phase after successful intervention and rehabilitation [ 1 - 3 ]. In recent years, telemedicine and eHealth solutions have emerged as promising tools for improving the care of patients with CAD [ 6 ]. In this regard, eHealth has already been shown to be an effective alternative to phase II CR, and a recent meta-analysis suggested that telehealth-based phase II CR may be even superior to center-based programs at least for enhancing physical activity levels [ 28 - 30 ]. In addition, eHealth may have the potential to involve a large number of patients since it may also be an option for patients who cannot or do not want to attend a center-based CR. In terms of cost efficiency, Frederix et al [ 30 ] estimated that a 6-month internet-based program consisting of exercise training with telemonitoring support, text messages, and web service can be cost-efficient for up to 2 years after the end of the intervention [ 30 ]. However, the development of eHealth solutions tailored for patients with CAD requires a dynamic and patient-centered approach since low user acceptance is one of the largest barriers for success of these solutions. The European Society of Cardiology e-Cardiology Working Group reported that digital health developments are often technically driven and not based on the needs and expectations of patients, thereby calling for cocreation with patient involvement in the design [ 15 ]. The European Society of Cardiology position paper strongly emphasized that patient-related barriers and user characteristics may hinder the large-scale deployment of eHealth services. Thus, the TIMELY project includes a Living Lab as means to involve patients and patient organizations, and our analyses reflect part of this patient-centered approach.

Per definition, Living Labs represent open innovation ecosystems to cocreate, assess, and refine innovative (technical) solutions [ 19 ]. To achieve a user-centric design, Living Labs prioritize the engagement of patients together with health care professionals to ensure that the resulting applications align with the needs, preferences, and challenges faced by the specific needs of a patient group. It is however important to place Living Labs in authentic settings, as implemented in this study, where patients with CAD undergoing center-based phase II CR are involved. These patients had received comprehensive information on the etiology and treatment of their disease as well as lifestyle factors that modify CAD. The majority of the involved patients indicated that they liked the approach and appreciated being involved in the conception and development of an eHealth solution to assist them during the maintenance phase even though some indicated that too much effort might keep them from using such a solution. In terms of predictors of eHealth use, previous research on sociodemographic factors among US adult internet users suggested that patients with lower education levels had lower odds of using certain features, including web-based tracking of personal health information, using a website to support physical activity, or downloading health information to a mobile device [ 31 ]. That study also indicated that being female was a predictor of eHealth use across health care and user-generated content, while age influenced health information–seeking [ 31 ]. In comparison, our data also suggest that younger age was associated with the indicated acceptance of technology, but women were as likely as men to accept eHealth for managing their disease, and the education level was not identified as a predictor. These findings might be based on the fact that smartphones, device hardware, and mobile apps are rapidly advancing, and daily exposure lowers the barriers for patients to use technology [ 32 ]. Although our study was performed among a selected group of patients with CAD participating in a prospective study, it is interesting to compare our cohort also in terms of the necessary hardware availability, that is, smartphone ownership in this patient group in general. Between 2019 and 2020, a large cross-sectional study among cardiac inpatients in Australia reported a high frequency of smartphone ownership (85%-89%) among patients aged 50-69 years and lower ownership (~60%) in patients aged 70-79 years [ 33 ]. In our sample (mean age 57 years, SD 9 years), every patient owned a smartphone and one-third also used activity trackers/smartwatches, which might also be explained by the differences between countries (Australia vs Germany/Spain). Percentage of technology ownership as well as usage and expectations for eHealth were not different between Germany and Spain, even though the Spanish population was significantly older ( P =.001) and clinical characteristics differed to some extent. Further, CR in Spain is based on outpatient care, which, while equally effective in terms of reaching the main CR outcomes, could have affected the estimated need for eHealth in this population. Of the analyzed psychological factors, only psychological flexibility showed some predictive value for eHealth acceptance. This result partly contradicts previous findings among older (>60 years) residents of Hong Kong, wherein optimism was significantly related to perceived eHealth usefulness [ 34 ]. To what extent these differences are caused by differences in age or cultural background warrant further investigations.

State-of-the-art digital health care programs face numerous technical and interoperability hurdles that make implementation difficult. This includes transmitting physiological measurements from ECGs and blood pressure monitors as well as data from activity trackers and other wearables to a centralized platform. Respective solutions rely on wireless networks; different hardware, software, and algorithms for capturing and processing data; as well as connected dashboards. Challenges include system reliability, data quality, interoperability, and overall, the highest level of data security. We have not asked the involved patients about their opinions on system availability and stability, as these aspects as well as data security and privacy need to meet the highest standards as conditio sine qua non when providing eHealth to patients. However, information regarding these aspects needs to be provided to patients in sufficient detail, since privacy-related concerns represent considerable barriers [ 15 , 35 ]. These technical requirements and interdependencies result in high costs for any eHealth solution targeting to improve patients’ self-care. Foreseen functionalities should thus not only be based on current guidelines but should be aligned with patient needs and expectations. This study shows that patients with CAD expected considerable merit in the documentation and availability of their diagnosis, laboratory results, and current medication—all details that would be part of an electronic health record. Patients also showed interest in their overall risk score, which TIMELY will base on a biomarker score to predict the 10-year mortality risk [ 36 , 37 ]. The majority of patients rated the usefulness of blood pressure and ECG monitors as high or very high. Functionalities related to support daily physical activities and physical exercise were perceived as (very) useful, with most patients indicating a high need for progress documentation and regular updates on personalized training recommendations. This observation is relevant since commercial activity trackers have been reported to significantly increase the daily step count and aerobic capacity in patients undergoing CR [ 38 , 39 ], and a considerable number of patients were already relying on commercial solutions, which, however, do not always provide the necessary level of data protection and have not been tested sufficiently in patient populations. Functionalities related to other important parts of CR, including smoking cessation, stress management, advice on heart-healthy eating, as well as self-education, were perceived as less useful or rated neutral, likely depending on the individual perceived needs of the patients. This aspect was pronounced for smoking cessation, which was perceived as an important part of CR, but 50% of the smokers indicated that they did not want support with this health-related aspect.

Limitations

Although reporting on 2 samples of participants undergoing CR from Germany and Spain with cultural and socioeconomic differences is a strength of this study, this report may be affected by the potential study selection bias since patients participating in scientific research studies differ in terms of motivational aspects. However, our sample population did not differ with respect to the sociodemographic characteristics of the samples of patients with CAD undergoing CR who were analyzed in previous reports [ 22 ]. It should be noted that health literacy, a central factor in eHealth usage and a pivotal determinant of health in general, is a complex construct and was not assessed in all dimensions in our study population. The results of naïve Bayes classification should be interpreted with care since validation in an independent data set was not performed. The timepoint and situation of this survey may also have affected the results since patients may answer differently when asked in their home environment or with greater time interval after an acute event. Focus groups may allow for more and detailed information on the reasoning underlying the reported answers to this guided survey, and the results of focus groups within TIMELY will be reported elsewhere.

This survey involving patients undergoing CR in Germany and Spain revealed that eHealth for CR maintenance should emphasize on support for regular physical activity and physical exercise, including patient feedback on achievements and renewal of training recommendations. Devices for physiological measurements, including blood pressure and ECG monitors, were considered useful, and most patients expressed a need for the documentation of diagnosis, medication, and laboratory results in terms of an electronic health record. In general, the patients who took part in this project showed a sufficient level of digital literacy and current usage of technology to make good use of even more advanced eHealth solutions. Although only minor differences were observed among Spanish and German patients as well as between female and male patients and educational status did not appear to be a contributing factor, it is crucial to note substantial variability in patients’ individual needs and expectations. Consequently, eHealth solutions should prioritize personalization to enhance user acceptance. Next steps of the TIMELY Living Lab will involve analyses of details on the implementation of the individual CR functionalities and feedback on the mobile app design.

Acknowledgments

We thank all the patients involved in this study for participating and appreciate the help of our colleagues in answering the Delphi questions to develop the survey used in this project. BS, FM, MH, CP-G, and WJK received funding from the European Commission within the H2020 framework (project TIMELY, grant agreement number 101017424).

Data Availability

The data generated during this study are available from the corresponding author upon reasonable request.

Authors' Contributions

BS, SW, and FM designed this study. SW, MSF, HS, MAV, and VG-S performed the survey and collected the data. SW, MSF, and BS analyzed the data. BS, WJK, and MH interpreted the results. BS, SW, and ERD wrote the manuscript. FM, WJK, MH, CP-G, and DG provided important intellectual content. All authors contributed to the revision of the manuscript and approved the final version of the manuscript.

Conflicts of Interest

BS is the Associate Editor of JMIR Rehabilitation and Assistive Technologies . The other authors declare no conflicts of interest.

Details of the survey.

  • Graham I, Atar D, Borch-Johnsen K, Boysen G, Burell G, Cifkova R, et al. European Society of Cardiology (ESC) Committee for Practice Guidelines (CPG). European guidelines on cardiovascular disease prevention in clinical practice: executive summary: Fourth Joint Task Force of the European Society of Cardiology and Other Societies on Cardiovascular Disease Prevention in Clinical Practice (Constituted by representatives of nine societies and by invited experts). Eur Heart J. Oct 2007;28 (19):2375-2414. [ CrossRef ] [ Medline ]
  • Piepoli MF, Corrà U, Adamopoulos S, Benzer W, Bjarnason-Wehrens B, Cupples M, et al. Secondary prevention in the clinical management of patients with cardiovascular diseases. Core components, standards and outcome measures for referral and delivery: A Policy Statement from the Cardiac Rehabilitation Section of the European Association for Cardiovascular Prevention & Rehabilitation. Endorsed by the Committee for Practice Guidelines of the European Society of Cardiology. Eur J Prev Cardiol. Jun 2014;21 (6):664-681. [ CrossRef ] [ Medline ]
  • Ambrosetti M, Abreu A, Corrà U, Davos CH, Hansen D, Frederix I, et al. Secondary prevention through comprehensive cardiovascular rehabilitation: From knowledge to implementation. 2020 update. A position paper from the Secondary Prevention and Rehabilitation Section of the European Association of Preventive Cardiology. Eur J Prev Cardiol. May 14, 2021;28 (5):460-495. [ CrossRef ] [ Medline ]
  • Simon M, Korn K, Cho L, Blackburn GG, Raymond C. Cardiac rehabilitation: A class 1 recommendation. Cleve Clin J Med. Jul 2018;85 (7):551-558. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Leon AS, Franklin BA, Costa F, Balady GJ, Berra KA, Stewart KJ, American Heart Association; Council on Clinical Cardiology (Subcommittee on Exercise‚ Cardiac Rehabilitation‚Prevention); Council on Nutrition‚ Physical Activity‚Metabolism (Subcommittee on Physical Activity); et al. American association of CardiovascularPulmonary Rehabilitation. Cardiac rehabilitation and secondary prevention of coronary heart disease: an American Heart Association scientific statement from the Council on Clinical Cardiology (Subcommittee on Exercise, Cardiac Rehabilitation, and Prevention) and the Council on Nutrition, Physical Activity, and Metabolism (Subcommittee on Physical Activity), in collaboration with the American association of Cardiovascular and Pulmonary Rehabilitation. Circulation. Jan 25, 2005;111 (3):369-376. [ CrossRef ] [ Medline ]
  • Heimer M, Schmitz S, Teschler M, Schäfer H, Douma ER, Habibovic M, et al. eHealth for maintenance cardiovascular rehabilitation: a systematic review and meta-analysis. Eur J Prev Cardiol. Oct 26, 2023;30 (15):1634-1651. [ CrossRef ] [ Medline ]
  • Denton F, Waddell A, Kite C, Hesketh K, Atkinson L, Cocks M, et al. Remote maintenance cardiac rehabilitation (MAINTAIN): A protocol for a randomised feasibility study. Digit Health. 2023;9:20552076231152176. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Andersen RM, Skou ST, Clausen MB, Jäger M, Zangger G, Grøntved A, et al. Maintenance of physical activity after cardiac rehabilitation (FAIR): study protocol for a feasibility trial. BMJ Open. Apr 05, 2022;12 (4):e060157. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bjarnason-Wehrens B, Grande G, Loewel H, Völler H, Mittag O. Gender-specific issues in cardiac rehabilitation: do women with ischaemic heart disease need specially tailored programmes? Eur J Cardiovasc Prev Rehabil. Apr 2007;14 (2):163-171. [ CrossRef ] [ Medline ]
  • Russell MW, Huse DM, Drowns S, Hamel EC, Hartz SC. Direct medical costs of coronary artery disease in the United States. Am J Cardiol. May 01, 1998;81 (9):1110-1115. [ CrossRef ] [ Medline ]
  • Brørs G, Pettersen TR, Hansen TB, Fridlund B, Hølvold LB, Lund H, et al. Modes of e-Health delivery in secondary prevention programmes for patients with coronary artery disease: a systematic review. BMC Health Serv Res. Jun 10, 2019;19 (1):364. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Walsh DMJ, Moran K, Cornelissen V, Buys R, Claes J, Zampognaro P, et al. The development and codesign of the PATHway intervention: a theory-driven eHealth platform for the self-management of cardiovascular disease. Transl Behav Med. Jan 01, 2019;9 (1):76-98. [ CrossRef ] [ Medline ]
  • Silva-Cardoso J, Juanatey JRG, Comin-Colet J, Sousa JM, Cavalheiro A, Moreira E. The future of telemedicine in the management of heart failure patients. Card Fail Rev. Mar 2021;7:e11. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Verburg A, Selder JL, Schalij MJ, Schuuring MJ, Treskes RW. eHealth to improve patient outcome in rehabilitating myocardial infarction patients. Expert Rev Cardiovasc Ther. Mar 2019;17 (3):185-192. [ CrossRef ] [ Medline ]
  • Frederix I, Caiani EG, Dendale P, Anker S, Bax J, Böhm A, et al. ESC e-Cardiology Working Group Position Paper: Overcoming challenges in digital health implementation in cardiovascular medicine. Eur J Prev Cardiol. Jul 2019;26 (11):1166-1177. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Schreiweis B, Pobiruchin M, Strotbaum V, Suleder J, Wiesner M, Bergh B. Barriers and facilitators to the implementation of ehealth services: systematic literature analysis. J Med Internet Res. Nov 22, 2019;21 (11):e14197. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Herrera S, Salazar A, Nazar G. Barriers and supports in eHealth implementation among people with chronic cardiovascular ailments: integrative review. Int J Environ Res Public Health. Jul 07, 2022;19 (14):8296. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kop W, Schmitz B, Gatsios D, Peña-Gil C, Gonzalez Juanatey J, Cantarero Prieto D, et al. Patient-centered lifestyle intervention using artificial intelligence methodologies: The TIMELY project for cardiac rehabilitation. Journal of Psychosomatic Research. Jun 2022;157:110872. [ CrossRef ]
  • Zipfel N, Horreh B, Hulshof CTJ, de Boer AGEM, van der Burg-Vermeulen SJ. The relationship between the living lab approach and successful implementation of healthcare innovations: an integrative review. BMJ Open. Jun 28, 2022;12 (6):e058630. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Murphy MK, Black NA, Lamping DL, McKee CM, Sanderson CF, Askham J, et al. Consensus development methods, and their use in clinical guideline development. Health Technol Assess. 1998;2 (3):i-iv, 1. [ FREE Full text ] [ Medline ]
  • Schmitz B, De Maria R, Gatsios D, Chrysanthakopoulou T, Landolina M, Gasparini M, et al. Identification of genetic markers for treatment success in heart failure patients: insight from cardiac resynchronization therapy. Circ Cardiovasc Genet. Dec 2014;7 (6):760-770. [ CrossRef ] [ Medline ]
  • Teschler M, Heimer M, Schmitz B, Kemmler W, Mooren FC. Four weeks of electromyostimulation improves muscle function and strength in sarcopenic patients: a three-arm parallel randomized trial. J Cachexia Sarcopenia Muscle. Aug 2021;12 (4):843-854. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bond FW, Hayes SC, Baer RA, Carpenter KM, Guenole N, Orcutt HK, et al. Preliminary psychometric properties of the Acceptance and Action Questionnaire-II: a revised measure of psychological inflexibility and experiential avoidance. Behav Ther. Dec 2011;42 (4):676-688. [ CrossRef ] [ Medline ]
  • Scheier MF, Carver CS, Bridges MW. Distinguishing optimism from neuroticism (and trait anxiety, self-mastery, and self-esteem): a reevaluation of the Life Orientation Test. J Pers Soc Psychol. Dec 1994;67 (6):1063-1078. [ CrossRef ] [ Medline ]
  • Denollet J. DS14: standard assessment of negative affectivity, social inhibition, and Type D personality. Psychosom Med. 2005;67 (1):89-97. [ CrossRef ] [ Medline ]
  • Watson D, Clark LA, Tellegen A. Development and validation of brief measures of positive and negative affect: the PANAS scales. J Pers Soc Psychol. Jun 1988;54 (6):1063-1070. [ CrossRef ] [ Medline ]
  • Reinecke H, Breithardt G, Engelbertz C, Schmieder RE, Fobker M, Pinnschmidt HO, et al. Baseline characteristics and prescription patterns of standard drugs in patients with angiographically determined coronary artery disease and renal failure (CAD-REF registry). PLoS One. 2016;11 (2):e0148057. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kraal JJ, Van den Akker-Van Marle ME, Abu-Hanna A, Stut W, Peek N, Kemps HM. Clinical and cost-effectiveness of home-based cardiac rehabilitation compared to conventional, centre-based cardiac rehabilitation: Results of the FIT@Home study. Eur J Prev Cardiol. Aug 2017;24 (12):1260-1273. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Rawstorn JC, Gant N, Direito A, Beckmann C, Maddison R. Telehealth exercise-based cardiac rehabilitation: a systematic review and meta-analysis. Heart. Aug 01, 2016;102 (15):1183-1192. [ CrossRef ] [ Medline ]
  • Frederix I, Solmi F, Piepoli MF, Dendale P. Cardiac telerehabilitation: A novel cost-efficient care delivery strategy that can induce long-term health benefits. Eur J Prev Cardiol. Nov 2017;24 (16):1708-1717. [ CrossRef ] [ Medline ]
  • Kontos E, Blake KD, Chou WS, Prestin A. Predictors of eHealth usage: insights on the digital divide from the Health Information National Trends Survey 2012. J Med Internet Res. Jul 16, 2014;16 (7):e172. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Gray R, Indraratna P, Lovell N, Ooi S. Digital health technology in the prevention of heart failure and coronary artery disease. Cardiovasc Digit Health J. Dec 2022;3 (6 Suppl):S9-S16. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Indraratna P, Magdy J, Li J, et al. Patterns and predictors of smartphone ownership in a cardiology inpatient population. Eur Heart J. Oct 14, 2021.:3110-3111. [ CrossRef ]
  • Kim S, Chow BC, Park S, Liu H. The usage of digital health technology among older adults in Hong Kong and the role of technology readiness and eHealth literacy: path analysis. J Med Internet Res. Apr 12, 2023;25:e41915. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Coorey GM, Neubeck L, Mulley J, Redfern J. Effectiveness, acceptability and usefulness of mobile applications for cardiovascular disease self-management: Systematic review with meta-synthesis of quantitative and qualitative data. Eur J Prev Cardiol. Mar 2018;25 (5):505-521. [ CrossRef ] [ Medline ]
  • Goliasch G, Kleber ME, Richter B, Plischke M, Hoke M, Haschemi A, et al. Routinely available biomarkers improve prediction of long-term mortality in stable coronary artery disease: the Vienna and Ludwigshafen Coronary Artery Disease (VILCAD) risk score. Eur Heart J. Sep 2012;33 (18):2282-2289. [ CrossRef ] [ Medline ]
  • Kleber ME, Goliasch G, Grammer TB, Pilz S, Tomaschitz A, Silbernagel G, et al. Evolving biomarkers improve prediction of long-term mortality in patients with stable coronary artery disease: the BIO-VILCAD score. J Intern Med. Aug 2014;276 (2):184-194. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ashur C, Cascino TM, Lewis C, Townsend W, Sen A, Pekmezi D, et al. Do wearable activity trackers increase physical activity among cardiac rehabilitation participants? A systematic review and meta-analysis. J Cardiopulm Rehabil Prev. Jul 01, 2021;41 (4):249-256. [ CrossRef ] [ Medline ]
  • Su JJ, Yu DSF, Paguio JT. Effect of eHealth cardiac rehabilitation on health outcomes of coronary heart disease patients: A systematic review and meta-analysis. J Adv Nurs. Mar 2020;76 (3):754-772. [ CrossRef ] [ Medline ]

Abbreviations

Edited by T de Azevedo Cardoso, S He; submitted 26.10.23; peer-reviewed by J Su, D Liu, P Dilaveris; comments to author 20.12.23; revised version received 28.12.23; accepted 30.01.24; published 22.02.24.

©Boris Schmitz, Svenja Wirtz, Manuela Sestayo-Fernández, Hendrik Schäfer, Emma R Douma, Marta Alonso Vazquez, Violeta González-Salvado, Mirela Habibovic, Dimitris Gatsios, Willem Johan Kop, Carlos Peña-Gil, Frank Mooren. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 22.02.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 19 February 2024

Genomic data in the All of Us Research Program

The all of us research program genomics investigators.

Nature ( 2024 ) Cite this article

65k Accesses

1 Citations

666 Altmetric

Metrics details

  • Genetic variation
  • Genome-wide association studies

Comprehensively mapping the genetic basis of human disease across diverse individuals is a long-standing goal for the field of human genetics 1 , 2 , 3 , 4 . The All of Us Research Program is a longitudinal cohort study aiming to enrol a diverse group of at least one million individuals across the USA to accelerate biomedical research and improve human health 5 , 6 . Here we describe the programme’s genomics data release of 245,388 clinical-grade genome sequences. This resource is unique in its diversity as 77% of participants are from communities that are historically under-represented in biomedical research and 46% are individuals from under-represented racial and ethnic minorities. All of Us identified more than 1 billion genetic variants, including more than 275 million previously unreported genetic variants, more than 3.9 million of which had coding consequences. Leveraging linkage between genomic data and the longitudinal electronic health record, we evaluated 3,724 genetic variants associated with 117 diseases and found high replication rates across both participants of European ancestry and participants of African ancestry. Summary-level data are publicly available, and individual-level data can be accessed by researchers through the All of Us Researcher Workbench using a unique data passport model with a median time from initial researcher registration to data access of 29 hours. We anticipate that this diverse dataset will advance the promise of genomic medicine for all.

Comprehensively identifying genetic variation and cataloguing its contribution to health and disease, in conjunction with environmental and lifestyle factors, is a central goal of human health research 1 , 2 . A key limitation in efforts to build this catalogue has been the historic under-representation of large subsets of individuals in biomedical research including individuals from diverse ancestries, individuals with disabilities and individuals from disadvantaged backgrounds 3 , 4 . The All of Us Research Program (All of Us) aims to address this gap by enrolling and collecting comprehensive health data on at least one million individuals who reflect the diversity across the USA 5 , 6 . An essential component of All of Us is the generation of whole-genome sequence (WGS) and genotyping data on one million participants. All of Us is committed to making this dataset broadly useful—not only by democratizing access to this dataset across the scientific community but also to return value to the participants themselves by returning individual DNA results, such as genetic ancestry, hereditary disease risk and pharmacogenetics according to clinical standards, to those who wish to receive these research results.

Here we describe the release of WGS data from 245,388 All of Us participants and demonstrate the impact of this high-quality data in genetic and health studies. We carried out a series of data harmonization and quality control (QC) procedures and conducted analyses characterizing the properties of the dataset including genetic ancestry and relatedness. We validated the data by replicating well-established genotype–phenotype associations including low-density lipoprotein cholesterol (LDL-C) and 117 additional diseases. These data are available through the All of Us Researcher Workbench, a cloud platform that embodies and enables programme priorities, facilitating equitable data and compute access while ensuring responsible conduct of research and protecting participant privacy through a passport data access model.

The All of Us Research Program

To accelerate health research, All of Us is committed to curating and releasing research data early and often 6 . Less than five years after national enrolment began in 2018, this fifth data release includes data from more than 413,000 All of Us participants. Summary data are made available through a public Data Browser, and individual-level participant data are made available to researchers through the Researcher Workbench (Fig. 1a and Data availability).

figure 1

a , The All of Us Research Hub contains a publicly accessible Data Browser for exploration of summary phenotypic and genomic data. The Researcher Workbench is a secure cloud-based environment of participant-level data in a Controlled Tier that is widely accessible to researchers. b , All of Us participants have rich phenotype data from a combination of physical measurements, survey responses, EHRs, wearables and genomic data. Dots indicate the presence of the specific data type for the given number of participants. c , Overall summary of participants under-represented in biomedical research (UBR) with data available in the Controlled Tier. The All of Us logo in a is reproduced with permission of the National Institutes of Health’s All of Us Research Program.

Participant data include a rich combination of phenotypic and genomic data (Fig. 1b ). Participants are asked to complete consent for research use of data, sharing of electronic health records (EHRs), donation of biospecimens (blood or saliva, and urine), in-person provision of physical measurements (height, weight and blood pressure) and surveys initially covering demographics, lifestyle and overall health 7 . Participants are also consented for recontact. EHR data, harmonized using the Observational Medical Outcomes Partnership Common Data Model 8 ( Methods ), are available for more than 287,000 participants (69.42%) from more than 50 health care provider organizations. The EHR dataset is longitudinal, with a quarter of participants having 10 years of EHR data (Extended Data Fig. 1 ). Data include 245,388 WGSs and genome-wide genotyping on 312,925 participants. Sequenced and genotyped individuals in this data release were not prioritized on the basis of any clinical or phenotypic feature. Notably, 99% of participants with WGS data also have survey data and physical measurements, and 84% also have EHR data. In this data release, 77% of individuals with genomic data identify with groups historically under-represented in biomedical research, including 46% who self-identify with a racial or ethnic minority group (Fig. 1c , Supplementary Table 1 and Supplementary Note ).

Scaling the All of Us infrastructure

The genomic dataset generated from All of Us participants is a resource for research and discovery and serves as the basis for return of individual health-related DNA results to participants. Consequently, the US Food and Drug Administration determined that All of Us met the criteria for a significant risk device study. As such, the entire All of Us genomics effort from sample acquisition to sequencing meets clinical laboratory standards 9 .

All of Us participants were recruited through a national network of partners, starting in 2018, as previously described 5 . Participants may enrol through All of Us - funded health care provider organizations or direct volunteer pathways and all biospecimens, including blood and saliva, are sent to the central All of Us Biobank for processing and storage. Genomics data for this release were generated from blood-derived DNA. The programme began return of actionable genomic results in December 2022. As of April 2023, approximately 51,000 individuals were sent notifications asking whether they wanted to view their results, and approximately half have accepted. Return continues on an ongoing basis.

The All of Us Data and Research Center maintains all participant information and biospecimen ID linkage to ensure that participant confidentiality and coded identifiers (participant and aliquot level) are used to track each sample through the All of Us genomics workflow. This workflow facilitates weekly automated aliquot and plating requests to the Biobank, supplies relevant metadata for the sample shipments to the Genome Centers, and contains a feedback loop to inform action on samples that fail QC at any stage. Further, the consent status of each participant is checked before sample shipment to confirm that they are still active. Although all participants with genomic data are consented for the same general research use category, the programme accommodates different preferences for the return of genomic data to participants and only data for those individuals who have consented for return of individual health-related DNA results are distributed to the All of Us Clinical Validation Labs for further evaluation and health-related clinical reporting. All participants in All of Us that choose to get health-related DNA results have the option to schedule a genetic counselling appointment to discuss their results. Individuals with positive findings who choose to obtain results are required to schedule an appointment with a genetic counsellor to receive those findings.

Genome sequencing

To satisfy the requirements for clinical accuracy, precision and consistency across DNA sample extraction and sequencing, the All of Us Genome Centers and Biobank harmonized laboratory protocols, established standard QC methodologies and metrics, and conducted a series of validation experiments using previously characterized clinical samples and commercially available reference standards 9 . Briefly, PCR-free barcoded WGS libraries were constructed with the Illumina Kapa HyperPrep kit. Libraries were pooled and sequenced on the Illumina NovaSeq 6000 instrument. After demultiplexing, initial QC analysis is performed with the Illumina DRAGEN pipeline (Supplementary Table 2 ) leveraging lane, library, flow cell, barcode and sample level metrics as well as assessing contamination, mapping quality and concordance to genotyping array data independently processed from a different aliquot of DNA. The Genome Centers use these metrics to determine whether each sample meets programme specifications and then submits sequencing data to the Data and Research Center for further QC, joint calling and distribution to the research community ( Methods ).

This effort to harmonize sequencing methods, multi-level QC and use of identical data processing protocols mitigated the variability in sequencing location and protocols that often leads to batch effects in large genomic datasets 9 . As a result, the data are not only of clinical-grade quality, but also consistent in coverage (≥30× mean) and uniformity across Genome Centers (Supplementary Figs. 1 – 5 ).

Joint calling and variant discovery

We carried out joint calling across the entire All of Us WGS dataset (Extended Data Fig. 2 ). Joint calling leverages information across samples to prune artefact variants, which increases sensitivity, and enables flagging samples with potential issues that were missed during single-sample QC 10 (Supplementary Table 3 ). Scaling conventional approaches to whole-genome joint calling beyond 50,000 individuals is a notable computational challenge 11 , 12 . To address this, we developed a new cloud variant storage solution, the Genomic Variant Store (GVS), which is based on a schema designed for querying and rendering variants in which the variants are stored in GVS and rendered to an analysable variant file, as opposed to the variant file being the primary storage mechanism (Code availability). We carried out QC on the joint call set on the basis of the approach developed for gnomAD 3.1 (ref.  13 ). This included flagging samples with outlying values in eight metrics (Supplementary Table 4 , Supplementary Fig. 2 and Methods ).

To calculate the sensitivity and precision of the joint call dataset, we included four well-characterized samples. We sequenced the National Institute of Standards and Technology reference materials (DNA samples) from the Genome in a Bottle consortium 13 and carried out variant calling as described above. We used the corresponding published set of variant calls for each sample as the ground truth in our sensitivity and precision calculations 14 . The overall sensitivity for single-nucleotide variants was over 98.7% and precision was more than 99.9%. For short insertions or deletions, the sensitivity was over 97% and precision was more than 99.6% (Supplementary Table 5 and Methods ).

The joint call set included more than 1 billion genetic variants. We annotated the joint call dataset on the basis of functional annotation (for example, gene symbol and protein change) using Illumina Nirvana 15 . We defined coding variants as those inducing an amino acid change on a canonical ENSEMBL transcript and found 272,051,104 non-coding and 3,913,722 coding variants that have not been described previously in dbSNP 16 v153 (Extended Data Table 1 ). A total of 3,912,832 (99.98%) of the coding variants are rare (allelic frequency < 0.01) and the remaining 883 (0.02%) are common (allelic frequency > 0.01). Of the coding variants, 454 (0.01%) are common in one or more of the non-European computed ancestries in All of Us, rare among participants of European ancestry, and have an allelic number greater than 1,000 (Extended Data Table 2 and Extended Data Fig. 3 ). The distributions of pathogenic, or likely pathogenic, ClinVar variant counts per participant, stratified by computed ancestry, filtered to only those variants that are found in individuals with an allele count of <40 are shown in Extended Data Fig. 4 . The potential medical implications of these known and new variants with respect to variant pathogenicity by ancestry are highlighted in a companion paper 17 . In particular, we find that the European ancestry subset has the highest rate of pathogenic variation (2.1%), which was twice the rate of pathogenic variation in individuals of East Asian ancestry 17 .The lower frequency of variants in East Asian individuals may be partially explained by the fact the sample size in that group is small and there may be knowledge bias in the variant databases that is reducing the number of findings in some of the less-studied ancestry groups.

Genetic ancestry and relatedness

Genetic ancestry inference confirmed that 51.1% of the All of Us WGS dataset is derived from individuals of non-European ancestry. Briefly, the ancestry categories are based on the same labels used in gnomAD 18 . We trained a classifier on a 16-dimensional principal component analysis (PCA) space of a diverse reference based on 3,202 samples and 151,159 autosomal single-nucleotide polymorphisms. We projected the All of Us samples into the PCA space of the training data, based on the same single-nucleotide polymorphisms from the WGS data, and generated categorical ancestry predictions from the trained classifier ( Methods ). Continuous genetic ancestry fractions for All of Us samples were inferred using the same PCA data, and participants’ patterns of ancestry and admixture were compared to their self-identified race and ethnicity (Fig. 2 and Methods ). Continuous ancestry inference carried out using genome-wide genotypes yields highly concordant estimates.

figure 2

a , b , Uniform manifold approximation and projection (UMAP) representations of All of Us WGS PCA data with self-described race ( a ) and ethnicity ( b ) labels. c , Proportion of genetic ancestry per individual in six distinct and coherent ancestry groups defined by Human Genome Diversity Project and 1000 Genomes samples.

Kinship estimation confirmed that All of Us WGS data consist largely of unrelated individuals with about 85% (215,107) having no first- or second-degree relatives in the dataset (Supplementary Fig. 6 ). As many genomic analyses leverage unrelated individuals, we identified the smallest set of samples that are required to be removed from the remaining individuals that had first- or second-degree relatives and retained one individual from each kindred. This procedure yielded a maximal independent set of 231,442 individuals (about 94%) with genome sequence data in the current release ( Methods ).

Genetic determinants of LDL-C

As a measure of data quality and utility, we carried out a single-variant genome-wide association study (GWAS) for LDL-C, a trait with well-established genomic architecture ( Methods ). Of the 245,388 WGS participants, 91,749 had one or more LDL-C measurements. The All of Us LDL-C GWAS identified 20 well-established genome-wide significant loci, with minimal genomic inflation (Fig. 3 , Extended Data Table 3 and Supplementary Fig. 7 ). We compared the results to those of a recent multi-ethnic LDL-C GWAS in the National Heart, Lung, and Blood Institute (NHLBI) TOPMed study that included 66,329 ancestrally diverse (56% non-European ancestry) individuals 19 . We found a strong correlation between the effect estimates for NHLBI TOPMed genome-wide significant loci and those of All of Us ( R 2  = 0.98, P  < 1.61 × 10 −45 ; Fig. 3 , inset). Notably, the per-locus effect sizes observed in All of Us are decreased compared to those in TOPMed, which is in part due to differences in the underlying statistical model, differences in the ancestral composition of these datasets and differences in laboratory value ascertainment between EHR-derived data and epidemiology studies. A companion manuscript extended this work to identify common and rare genetic associations for three diseases (atrial fibrillation, coronary artery disease and type 2 diabetes) and two quantitative traits (height and LDL-C) in the All of Us dataset and identified very high concordance with previous efforts across all of these diseases and traits 20 .

figure 3

Manhattan plot demonstrating robust replication of 20 well-established LDL-C genetic loci among 91,749 individuals with 1 or more LDL-C measurements. The red horizontal line denotes the genome wide significance threshold of P = 5 × 10 –8 . Inset, effect estimate ( β ) comparison between NHLBI TOPMed LDL-C GWAS ( x  axis) and All of Us LDL-C GWAS ( y  axis) for the subset of 194 independent variants clumped (window 250 kb, r2 0.5) that reached genome-wide significance in NHLBI TOPMed.

Genotype-by-phenotype associations

As another measure of data quality and utility, we tested replication rates of previously reported phenotype–genotype associations in the five predicted genetic ancestry populations present in the Phenotype/Genotype Reference Map (PGRM): AFR, African ancestry; AMR, Latino/admixed American ancestry; EAS, East Asian ancestry; EUR, European ancestry; SAS, South Asian ancestry. The PGRM contains published associations in the GWAS catalogue in these ancestry populations that map to International Classification of Diseases-based phenotype codes 21 . This replication study specifically looked across 4,947 variants, calculating replication rates for powered associations in each ancestry population. The overall replication rates for associations powered at 80% were: 72.0% (18/25) in AFR, 100% (13/13) in AMR, 46.6% (7/15) in EAS, 74.9% (1,064/1,421) in EUR, and 100% (1/1) in SAS. With the exception of the EAS ancestry results, these powered replication rates are comparable to those of the published PGRM analysis where the replication rates of several single-site EHR-linked biobanks ranges from 76% to 85%. These results demonstrate the utility of the data and also highlight opportunities for further work understanding the specifics of the All of Us population and the potential contribution of gene–environment interactions to genotype–phenotype mapping and motivates the development of methods for multi-site EHR phenotype data extraction, harmonization and genetic association studies.

More broadly, the All of Us resource highlights the opportunities to identify genotype–phenotype associations that differ across diverse populations 22 . For example, the Duffy blood group locus ( ACKR1 ) is more prevalent in individuals of AFR ancestry and individuals of AMR ancestry than in individuals of EUR ancestry. Although the phenome-wide association study of this locus highlights the well-established association of the Duffy blood group with lower white blood cell counts both in individuals of AFR and AMR ancestry 23 , 24 , it also revealed genetic-ancestry-specific phenotype patterns, with minimal phenotypic associations in individuals of EAS ancestry and individuals of EUR ancestry (Fig. 4 and Extended Data Table 4 ). Conversely, rs9273363 in the HLA-DQB1 locus is associated with increased risk of type 1 diabetes 25 , 26 and diabetic complications across ancestries, but only associates with increased risk of coeliac disease in individuals of EUR ancestry (Extended Data Fig. 5 ). Similarly, the TCF7L2 locus 27 strongly associates with increased risk of type 2 diabetes and associated complications across several ancestries (Extended Data Fig. 6 ). Association testing results are available in Supplementary Dataset 1 .

figure 4

Results of genetic-ancestry-stratified phenome-wide association analysis among unrelated individuals highlighting ancestry-specific disease associations across the four most common genetic ancestries of participant. Bonferroni-adjusted phenome-wide significance threshold (<2.88 × 10 −5 ) is plotted as a red horizontal line. AFR ( n  = 34,037, minor allele fraction (MAF) 0.82); AMR ( n  = 28,901, MAF 0.10); EAS ( n  = 32,55, MAF 0.003); EUR ( n  = 101,613, MAF 0.007).

The cloud-based Researcher Workbench

All of Us genomic data are available in a secure, access-controlled cloud-based analysis environment: the All of Us Researcher Workbench. Unlike traditional data access models that require per-project approval, access in the Researcher Workbench is governed by a data passport model based on a researcher’s authenticated identity, institutional affiliation, and completion of self-service training and compliance attestation 28 . After gaining access, a researcher may create a new workspace at any time to conduct a study, provided that they comply with all Data Use Policies and self-declare their research purpose. This information is regularly audited and made accessible publicly on the All of Us Research Projects Directory. This streamlined access model is guided by the principles that: participants are research partners and maintaining their privacy and data security is paramount; their data should be made as accessible as possible for authorized researchers; and we should continually seek to remove unnecessary barriers to accessing and using All of Us data.

For researchers at institutions with an existing institutional data use agreement, access can be gained as soon as they complete the required verification and compliance steps. As of August 2023, 556 institutions have agreements in place, allowing more than 5,000 approved researchers to actively work on more than 4,400 projects. The median time for a researcher from initial registration to completion of these requirements is 28.6 h (10th percentile: 48 min, 90th percentile: 14.9 days), a fraction of the weeks to months it can take to assemble a project-specific application and have it reviewed by an access board with conventional access models.

Given that the size of the project’s phenotypic and genomic dataset is expected to reach 4.75 PB in 2023, the use of a central data store and cloud analysis tools will save funders an estimated US$16.5 million per year when compared to the typical approach of allowing researchers to download genomic data. Storing one copy per institution of this data at 556 registered institutions would cost about US$1.16 billion per year. By contrast, storing a central cloud copy costs about US$1.14 million per year, a 99.9% saving. Importantly, cloud infrastructure also democratizes data access particularly for researchers who do not have high-performance local compute resources.

Here we present the All of Us Research Program’s approach to generating diverse clinical-grade genomic data at an unprecedented scale. We present the data release of about 245,000 genome sequences as part of a scalable framework that will grow to include genetic information and health data for one million or more people living across the USA. Our observations permit several conclusions.

First, the All of Us programme is making a notable contribution to improving the study of human biology through purposeful inclusion of under-represented individuals at scale 29 , 30 . Of the participants with genomic data in All of Us, 45.92% self-identified as a non-European race or ethnicity. This diversity enabled identification of more than 275 million new genetic variants across the dataset not previously captured by other large-scale genome aggregation efforts with diverse participants that have submitted variation to dbSNP v153, such as NHLBI TOPMed 31 freeze 8 (Extended Data Table 1 ). In contrast to gnomAD, All of Us permits individual-level genotype access with detailed phenotype data for all participants. Furthermore, unlike many genomics resources, All of Us is uniformly consented for general research use and enables researchers to go from initial account creation to individual-level data access in as little as a few hours. The All of Us cohort is significantly more diverse than those of other large contemporary research studies generating WGS data 32 , 33 . This enables a more equitable future for precision medicine (for example, through constructing polygenic risk scores that are appropriately calibrated to diverse populations 34 , 35 as the eMERGE programme has done leveraging All of Us data 36 , 37 ). Developing new tools and regulatory frameworks to enable analyses across multiple biobanks in the cloud to harness the unique strengths of each is an active area of investigation addressed in a companion paper to this work 38 .

Second, the All of Us Researcher Workbench embodies the programme’s design philosophy of open science, reproducible research, equitable access and transparency to researchers and to research participants 26 . Importantly, for research studies, no group of data users should have privileged access to All of Us resources based on anything other than data protection criteria. Although the All of Us Researcher Workbench initially targeted onboarding US academic, health care and non-profit organizations, it has recently expanded to international researchers. We anticipate further genomic and phenotypic data releases at regular intervals with data available to all researcher communities. We also anticipate additional derived data and functionality to be made available, such as reference data, structural variants and a service for array imputation using the All of Us genomic data.

Third, All of Us enables studying human biology at an unprecedented scale. The programmatic goal of sequencing one million or more genomes has required harnessing the output of multiple sequencing centres. Previous work has focused on achieving functional equivalence in data processing and joint calling pipelines 39 . To achieve clinical-grade data equivalence, All of Us required protocol equivalence at both sequencing production level and data processing across the sequencing centres. Furthermore, previous work has demonstrated the value of joint calling at scale 10 , 18 . The new GVS framework developed by the All of Us programme enables joint calling at extreme scales (Code availability). Finally, the provision of data access through cloud-native tools enables scalable and secure access and analysis to researchers while simultaneously enabling the trust of research participants and transparency underlying the All of Us data passport access model.

The clinical-grade sequencing carried out by All of Us enables not only research, but also the return of value to participants through clinically relevant genetic results and health-related traits to those who opt-in to receiving this information. In the years ahead, we anticipate that this partnership with All of Us participants will enable researchers to move beyond large-scale genomic discovery to understanding the consequences of implementing genomic medicine at scale.

The All of Us cohort

All of Us aims to engage a longitudinal cohort of one million or more US participants, with a focus on including populations that have historically been under-represented in biomedical research. Details of the All of Us cohort have been described previously 5 . Briefly, the primary objective is to build a robust research resource that can facilitate the exploration of biological, clinical, social and environmental determinants of health and disease. The programme will collect and curate health-related data and biospecimens, and these data and biospecimens will be made broadly available for research uses. Health data are obtained through the electronic medical record and through participant surveys. Survey templates can be found on our public website: https://www.researchallofus.org/data-tools/survey-explorer/ . Adults 18 years and older who have the capacity to consent and reside in the USA or a US territory at present are eligible. Informed consent for all participants is conducted in person or through an eConsent platform that includes primary consent, HIPAA Authorization for Research use of EHRs and other external health data, and Consent for Return of Genomic Results. The protocol was reviewed by the Institutional Review Board (IRB) of the All of Us Research Program. The All of Us IRB follows the regulations and guidance of the NIH Office for Human Research Protections for all studies, ensuring that the rights and welfare of research participants are overseen and protected uniformly.

Data accessibility through a ‘data passport’

Authorization for access to participant-level data in All of Us is based on a ‘data passport’ model, through which authorized researchers do not need IRB review for each research project. The data passport is required for gaining data access to the Researcher Workbench and for creating workspaces to carry out research projects using All of Us data. At present, data passports are authorized through a six-step process that includes affiliation with an institution that has signed a Data Use and Registration Agreement, account creation, identity verification, completion of ethics training, and attestation to a data user code of conduct. Results reported follow the All of Us Data and Statistics Dissemination Policy disallowing disclosure of group counts under 20 to protect participant privacy without seeking prior approval 40 .

At present, All of Us gathers EHR data from about 50 health care organizations that are funded to recruit and enrol participants as well as transfer EHR data for those participants who have consented to provide them. Data stewards at each provider organization harmonize their local data to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, and then submit it to the All of Us Data and Research Center (DRC) so that it can be linked with other participant data and further curated for research use. OMOP is a common data model standardizing health information from disparate EHRs to common vocabularies and organized into tables according to data domains. EHR data are updated from the recruitment sites and sent to the DRC quarterly. Updated data releases to the research community occur approximately once a year. Supplementary Table 6 outlines the OMOP concepts collected by the DRC quarterly from the recruitment sites.

Biospecimen collection and processing

Participants who consented to participate in All of Us donated fresh whole blood (4 ml EDTA and 10 ml EDTA) as a primary source of DNA. The All of Us Biobank managed by the Mayo Clinic extracted DNA from 4 ml EDTA whole blood, and DNA was stored at −80 °C at an average concentration of 150 ng µl −1 . The buffy coat isolated from 10 ml EDTA whole blood has been used for extracting DNA in the case of initial extraction failure or absence of 4 ml EDTA whole blood. The Biobank plated 2.4 µg DNA with a concentration of 60 ng µl −1 in duplicate for array and WGS samples. The samples are distributed to All of Us Genome Centers weekly, and a negative (empty well) control and National Institute of Standards and Technology controls are incorporated every two months for QC purposes.

Genome Center sample receipt, accession and QC

On receipt of DNA sample shipments, the All of Us Genome Centers carry out an inspection of the packaging and sample containers to ensure that sample integrity has not been compromised during transport and to verify that the sample containers correspond to the shipping manifest. QC of the submitted samples also includes DNA quantification, using routine procedures to confirm volume and concentration (Supplementary Table 7 ). Any issues or discrepancies are recorded, and affected samples are put on hold until resolved. Samples that meet quality thresholds are accessioned in the Laboratory Information Management System, and sample aliquots are prepared for library construction processing (for example, normalized with respect to concentration and volume).

WGS library construction, sequencing and primary data QC

The DNA sample is first sheared using a Covaris sonicator and is then size-selected using AMPure XP beads to restrict the range of library insert sizes. Using the PCR Free Kapa HyperPrep library construction kit, enzymatic steps are completed to repair the jagged ends of DNA fragments, add proper A-base segments, and ligate indexed adapter barcode sequences onto samples. Excess adaptors are removed using AMPure XP beads for a final clean-up. Libraries are quantified using quantitative PCR with the Illumina Kapa DNA Quantification Kit and then normalized and pooled for sequencing (Supplementary Table 7 ).

Pooled libraries are loaded on the Illumina NovaSeq 6000 instrument. The data from the initial sequencing run are used to QC individual libraries and to remove non-conforming samples from the pipeline. The data are also used to calibrate the pooling volume of each individual library and re-pool the libraries for additional NovaSeq sequencing to reach an average coverage of 30×.

After demultiplexing, WGS analysis occurs on the Illumina DRAGEN platform. The DRAGEN pipeline consists of highly optimized algorithms for mapping, aligning, sorting, duplicate marking and haplotype variant calling and makes use of platform features such as compression and BCL conversion. Alignment uses the GRCh38dh reference genome. QC data are collected at every stage of the analysis protocol, providing high-resolution metrics required to ensure data consistency for large-scale multiplexing. The DRAGEN pipeline produces a large number of metrics that cover lane, library, flow cell, barcode and sample-level metrics for all runs as well as assessing contamination and mapping quality. The All of Us Genome Centers use these metrics to determine pass or fail for each sample before submitting the CRAM files to the All of Us DRC. For mapping and variant calling, all Genome Centers have harmonized on a set of DRAGEN parameters, which ensures consistency in processing (Supplementary Table 2 ).

Every step through the WGS procedure is rigorously controlled by predefined QC measures. Various control mechanisms and acceptance criteria were established during WGS assay validation. Specific metrics for reviewing and releasing genome data are: mean coverage (threshold of ≥30×), genome coverage (threshold of ≥90% at 20×), coverage of hereditary disease risk genes (threshold of ≥95% at 20×), aligned Q30 bases (threshold of ≥8 × 10 10 ), contamination (threshold of ≤1%) and concordance to independently processed array data.

Array genotyping

Samples are processed for genotyping at three All of Us Genome Centers (Broad, Johns Hopkins University and University of Washington). DNA samples are received from the Biobank and the process is facilitated by the All of Us genomics workflow described above. All three centres used an identical array product, scanners, resource files and genotype calling software for array processing to reduce batch effects. Each centre has its own Laboratory Information Management System that manages workflow control, sample and reagent tracking, and centre-specific liquid handling robotics.

Samples are processed using the Illumina Global Diversity Array (GDA) with Illumina Infinium LCG chemistry using the automated protocol and scanned on Illumina iSCANs with Automated Array Loaders. Illumina IAAP software converts raw data (IDAT files; 2 per sample) into a single GTC file per sample using the BPM file (defines strand, probe sequences and illumicode address) and the EGT file (defines the relationship between intensities and genotype calls). Files used for this data release are: GDA-8v1-0_A5.bpm, GDA-8v1-0_A1_ClusterFile.egt, gentrain v3, reference hg19 and gencall cutoff 0.15. The GDA array assays a total of 1,914,935 variant positions including 1,790,654 single-nucleotide variants, 44,172 indels, 9,935 intensity-only probes for CNV calling, and 70,174 duplicates (same position, different probes). Picard GtcToVcf is used to convert the GTC files to VCF format. Resulting VCF and IDAT files are submitted to the DRC for ingestion and further processing. The VCF file contains assay name, chromosome, position, genotype calls, quality score, raw and normalized intensities, B allele frequency and log R ratio values. Each genome centre is running the GDA array under Clinical Laboratory Improvement Amendments-compliant protocols. The GTC files are parsed and metrics are uploaded to in-house Laboratory Information Management System systems for QC review.

At batch level (each set of 96-well plates run together in the laboratory at one time), each genome centre includes positive control samples that are required to have >98% call rate and >99% concordance to existing data to approve release of the batch of data. At the sample level, the call rate and sex are the key QC determinants 41 . Contamination is also measured using BAFRegress 42 and reported out as metadata. Any sample with a call rate below 98% is repeated one time in the laboratory. Genotyped sex is determined by plotting normalized x versus normalized y intensity values for a batch of samples. Any sample discordant with ‘sex at birth’ reported by the All of Us participant is flagged for further detailed review and repeated one time in the laboratory. If several sex-discordant samples are clustered on an array or on a 96-well plate, the entire array or plate will have data production repeated. Samples identified with sex chromosome aneuploidies are also reported back as metadata (XXX, XXY, XYY and so on). A final processing status of ‘pass’, ‘fail’ or ‘abandon’ is determined before release of data to the All of Us DRC. An array sample will pass if the call rate is >98% and the genotyped sex and sex at birth are concordant (or the sex at birth is not applicable). An array sample will fail if the genotyped sex and the sex at birth are discordant. An array sample will have the status of abandon if the call rate is <98% after at least two attempts at the genome centre.

Data from the arrays are used for participant return of genetic ancestry and non-health-related traits for those who consent, and they are also used to facilitate additional QC of the matched WGS data. Contamination is assessed in the array data to determine whether DNA re-extraction is required before WGS. Re-extraction is prompted by level of contamination combined with consent status for return of results. The arrays are also used to confirm sample identity between the WGS data and the matched array data by assessing concordance at 100 unique sites. To establish concordance, a fingerprint file of these 100 sites is provided to the Genome Centers to assess concordance with the same sites in the WGS data before CRAM submission.

Genomic data curation

As seen in Extended Data Fig. 2 , we generate a joint call set for all WGS samples and make these data available in their entirety and by sample subsets to researchers. A breakdown of the frequencies, stratified by computed ancestries for which we had more than 10,000 participants can be found in Extended Data Fig. 3 . The joint call set process allows us to leverage information across samples to improve QC and increase accuracy.

Single-sample QC

If a sample fails single-sample QC, it is excluded from the release and is not reported in this document. These tests detect sample swaps, cross-individual contamination and sample preparation errors. In some cases, we carry out these tests twice (at both the Genome Center and the DRC), for two reasons: to confirm internal consistency between sites; and to mark samples as passing (or failing) QC on the basis of the research pipeline criteria. The single-sample QC process accepts a higher contamination rate than the clinical pipeline (0.03 for the research pipeline versus 0.01 for the clinical pipeline), but otherwise uses identical thresholds. The list of specific QC processes, passing criteria, error modes addressed and an overview of the results can be found in Supplementary Table 3 .

Joint call set QC

During joint calling, we carry out additional QC steps using information that is available across samples including hard thresholds, population outliers, allele-specific filters, and sensitivity and precision evaluation. Supplementary Table 4 summarizes both the steps that we took and the results obtained for the WGS data. More detailed information about the methods and specific parameters can be found in the All of Us Genomic Research Data Quality Report 36 .

Batch effect analysis

We analysed cross-sequencing centre batch effects in the joint call set. To quantify the batch effect, we calculated Cohen’s d (ref.  43 ) for four metrics (insertion/deletion ratio, single-nucleotide polymorphism count, indel count and single-nucleotide polymorphism transition/transversion ratio) across the three genome sequencing centres (Baylor College of Medicine, Broad Institute and University of Washington), stratified by computed ancestry and seven regions of the genome (whole genome, high-confidence calling, repetitive, GC content of >0.85, GC content of <0.15, low mappability, the ACMG59 genes and regions of large duplications (>1 kb)). Using random batches as a control set, all comparisons had a Cohen’s d of <0.35. Here we report any Cohen’s d results >0.5, which we chose before this analysis and is conventionally the threshold of a medium effect size 44 .

We found that there was an effect size in indel counts (Cohen’s d of 0.53) in the entire genome, between Broad Institute and University of Washington, but this was being driven by repetitive and low-mappability regions. We found no batch effects with Cohen’s d of >0.5 in the ratio metrics or in any metrics in the high-confidence calling, low or high GC content, or ACMG59 regions. A complete list of the batch effects with Cohen’s d of >0.5 are found in Supplementary Table 8 .

Sensitivity and precision evaluation

To determine sensitivity and precision, we included four well-characterized control samples (four National Institute of Standards and Technology Genome in a Bottle samples (HG-001, HG-003, HG-004 and HG-005). The samples were sequenced with the same protocol as All of Us. Of note, these samples were not included in data released to researchers. We used the corresponding published set of variant calls for each sample as the ground truth in our sensitivity and precision calculations. We use the high-confidence calling region, defined by Genome in a Bottle v4.2.1, as the source of ground truth. To be called a true positive, a variant must match the chromosome, position, reference allele, alternate allele and zygosity. In cases of sites with multiple alternative alleles, each alternative allele is considered separately. Sensitivity and precision results are reported in Supplementary Table 5 .

Genetic ancestry inference

We computed categorical ancestry for all WGS samples in All of Us and made these available to researchers. These predictions are also the basis for population allele frequency calculations in the Genomic Variants section of the public Data Browser. We used the high-quality set of sites to determine an ancestry label for each sample. The ancestry categories are based on the same labels used in gnomAD 18 , the Human Genome Diversity Project (HGDP) 45 and 1000 Genomes 1 : African (AFR); Latino/admixed American (AMR); East Asian (EAS); Middle Eastern (MID); European (EUR), composed of Finnish (FIN) and Non-Finnish European (NFE); Other (OTH), not belonging to one of the other ancestries or is an admixture; South Asian (SAS).

We trained a random forest classifier 46 on a training set of the HGDP and 1000 Genomes samples variants on the autosome, obtained from gnomAD 11 . We generated the first 16 principal components (PCs) of the training sample genotypes (using the hwe_normalized_pca in Hail) at the high-quality variant sites for use as the feature vector for each training sample. We used the truth labels from the sample metadata, which can be found alongside the VCFs. Note that we do not train the classifier on the samples labelled as Other. We use the label probabilities (‘confidence’) of the classifier on the other ancestries to determine ancestry of Other.

To determine the ancestry of All of Us samples, we project the All of Us samples into the PCA space of the training data and apply the classifier. As a proxy for the accuracy of our All of Us predictions, we look at the concordance between the survey results and the predicted ancestry. The concordance between self-reported ethnicity and the ancestry predictions was 87.7%.

PC data from All of Us samples and the HGDP and 1000 Genomes samples were used to compute individual participant genetic ancestry fractions for All of Us samples using the Rye program. Rye uses PC data to carry out rapid and accurate genetic ancestry inference on biobank-scale datasets 47 . HGDP and 1000 Genomes reference samples were used to define a set of six distinct and coherent ancestry groups—African, East Asian, European, Middle Eastern, Latino/admixed American and South Asian—corresponding to participant self-identified race and ethnicity groups. Rye was run on the first 16 PCs, using the defined reference ancestry groups to assign ancestry group fractions to individual All of Us participant samples.

Relatedness

We calculated the kinship score using the Hail pc_relate function and reported any pairs with a kinship score above 0.1. The kinship score is half of the fraction of the genetic material shared (ranges from 0.0 to 0.5). We determined the maximal independent set 41 for related samples. We identified a maximally unrelated set of 231,442 samples (94%) for kinship scored greater than 0.1.

LDL-C common variant GWAS

The phenotypic data were extracted from the Curated Data Repository (CDR, Control Tier Dataset v7) in the All of Us Researcher Workbench. The All of Us Cohort Builder and Dataset Builder were used to extract all LDL cholesterol measurements from the Lab and Measurements criteria in EHR data for all participants who have WGS data. The most recent measurements were selected as the phenotype and adjusted for statin use 19 , age and sex. A rank-based inverse normal transformation was applied for this continuous trait to increase power and deflate type I error. Analysis was carried out on the Hail MatrixTable representation of the All of Us WGS joint-called data including removing monomorphic variants, variants with a call rate of <95% and variants with extreme Hardy–Weinberg equilibrium values ( P  < 10 −15 ). A linear regression was carried out with REGENIE 48 on variants with a minor allele frequency >5%, further adjusting for relatedness to the first five ancestry PCs. The final analysis included 34,924 participants and 8,589,520 variants.

Genotype-by-phenotype replication

We tested replication rates of known phenotype–genotype associations in three of the four largest populations: EUR, AFR and EAS. The AMR population was not included because they have no registered GWAS. This method is a conceptual extension of the original GWAS × phenome-wide association study, which replicated 66% of powered associations in a single EHR-linked biobank 49 . The PGRM is an expansion of this work by Bastarache et al., based on associations in the GWAS catalogue 50 in June 2020 (ref.  51 ). After directly matching the Experimental Factor Ontology terms to phecodes, the authors identified 8,085 unique loci and 170 unique phecodes that compose the PGRM. They showed replication rates in several EHR-linked biobanks ranging from 76% to 85%. For this analysis, we used the EUR-, and AFR-based maps, considering only catalogue associations that were P  < 5 × 10 −8 significant.

The main tools used were the Python package Hail for data extraction, plink for genomic associations, and the R packages PheWAS and pgrm for further analysis and visualization. The phenotypes, participant-reported sex at birth, and year of birth were extracted from the All of Us CDR (Controlled Tier Dataset v7). These phenotypes were then loaded into a plink-compatible format using the PheWAS package, and related samples were removed by sub-setting to the maximally unrelated dataset ( n  = 231,442). Only samples with EHR data were kept, filtered by selected loci, annotated with demographic and phenotypic information extracted from the CDR and ancestry prediction information provided by All of Us, ultimately resulting in 181,345 participants for downstream analysis. The variants in the PGRM were filtered by a minimum population-specific allele frequency of >1% or population-specific allele count of >100, leaving 4,986 variants. Results for which there were at least 20 cases in the ancestry group were included. Then, a series of Firth logistic regression tests with phecodes as the outcome and variants as the predictor were carried out, adjusting for age, sex (for non-sex-specific phenotypes) and the first three genomic PC features as covariates. The PGRM was annotated with power calculations based on the case counts and reported allele frequencies. Power of 80% or greater was considered powered for this analysis.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The All of Us Research Hub has a tiered data access data passport model with three data access tiers. The Public Tier dataset contains only aggregate data with identifiers removed. These data are available to the public through Data Snapshots ( https://www.researchallofus.org/data-tools/data-snapshots/ ) and the public Data Browser ( https://databrowser.researchallofus.org/ ). The Registered Tier curated dataset contains individual-level data, available only to approved researchers on the Researcher Workbench. At present, the Registered Tier includes data from EHRs, wearables and surveys, as well as physical measurements taken at the time of participant enrolment. The Controlled Tier dataset contains all data in the Registered Tier and additionally genomic data in the form of WGS and genotyping arrays, previously suppressed demographic data fields from EHRs and surveys, and unshifted dates of events. At present, Registered Tier and Controlled Tier data are available to researchers at academic institutions, non-profit institutions, and both non-profit and for-profit health care institutions. Work is underway to begin extending access to additional audiences, including industry-affiliated researchers. Researchers have the option to register for Registered Tier and/or Controlled Tier access by completing the All of Us Researcher Workbench access process, which includes identity verification and All of Us-specific training in research involving human participants ( https://www.researchallofus.org/register/ ). Researchers may create a new workspace at any time to conduct any research study, provided that they comply with all Data Use Policies and self-declare their research purpose. This information is made accessible publicly on the All of Us Research Projects Directory at https://allofus.nih.gov/protecting-data-and-privacy/research-projects-all-us-data .

Code availability

The GVS code is available at https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore . The LDL GWAS pipeline is available as a demonstration project in the Featured Workspace Library on the Researcher Workbench ( https://workbench.researchallofus.org/workspaces/aou-rw-5981f9dc/aouldlgwasregeniedsubctv6duplicate/notebooks ).

The 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526 , 68–74 (2015).

Article   Google Scholar  

Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577 , 179–189 (2020).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Wojcik, G. L. et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature 570 , 514–518 (2019).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Lewis, A. C. F. et al. Getting genetic ancestry right for science and society. Science 376 , 250–252 (2022).

All of Us Program Investigators. The “All of Us” Research Program. N. Engl. J. Med. 381 , 668–676 (2019).

Ramirez, A. H., Gebo, K. A. & Harris, P. A. Progress with the All of Us Research Program: opening access for researchers. JAMA 325 , 2441–2442 (2021).

Article   PubMed   Google Scholar  

Ramirez, A. H. et al. The All of Us Research Program: data quality, utility, and diversity. Patterns 3 , 100570 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Overhage, J. M., Ryan, P. B., Reich, C. G., Hartzema, A. G. & Stang, P. E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 19 , 54–60 (2012).

Venner, E. et al. Whole-genome sequencing as an investigational device for return of hereditary disease risk and pharmacogenomic results as part of the All of Us Research Program. Genome Med. 14 , 34 (2022).

Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536 , 285–291 (2016).

Tiao, G. & Goodrich, J. gnomAD v3.1 New Content, Methods, Annotations, and Data Availability ; https://gnomad.broadinstitute.org/news/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/ .

Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625 , 92–100 (2022).

Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37 , 561–566 (2019).

Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37 , 555–560 (2019).

Stromberg, M. et al. Nirvana: clinical grade variant annotator. In Proc. 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 596 (Association for Computing Machinery, 2017).

Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29 , 308–311 (2001).

Venner, E. et al. The frequency of pathogenic variation in the All of Us cohort reveals ancestry-driven disparities. Commun. Biol. https://doi.org/10.1038/s42003-023-05708-y (2024).

Karczewski, S. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Selvaraj, M. S. et al. Whole genome sequence analysis of blood lipid levels in >66,000 individuals. Nat. Commun. 13 , 5995 (2022).

Wang, X. et al. Common and rare variants associated with cardiometabolic traits across 98,622 whole-genome sequences in the All of Us research program. J. Hum. Genet. 68 , 565–570 (2023).

Bastarache, L. et al. The phenotype-genotype reference map: improving biobank data science through replication. Am. J. Hum. Genet. 110 , 1522–1533 (2023).

Bianchi, D. W. et al. The All of Us Research Program is an opportunity to enhance the diversity of US biomedical research. Nat. Med. https://doi.org/10.1038/s41591-023-02744-3 (2024).

Van Driest, S. L. et al. Association between a common, benign genotype and unnecessary bone marrow biopsies among African American patients. JAMA Intern. Med. 181 , 1100–1105 (2021).

Chen, M.-H. et al. Trans-ethnic and ancestry-specific blood-cell genetics in 746,667 individuals from 5 global populations. Cell 182 , 1198–1213 (2020).

Chiou, J. et al. Interpreting type 1 diabetes risk with genetics and single-cell epigenomics. Nature 594 , 398–402 (2021).

Hu, X. et al. Additive and interaction effects at three amino acid positions in HLA-DQ and HLA-DR molecules drive type 1 diabetes risk. Nat. Genet. 47 , 898–905 (2015).

Grant, S. F. A. et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 38 , 320–323 (2006).

Article   CAS   PubMed   Google Scholar  

All of Us Research Program. Framework for Access to All of Us Data Resources v1.1 (2021); https://www.researchallofus.org/wp-content/themes/research-hub-wordpress-theme/media/data&tools/data-access-use/AoU_Data_Access_Framework_508.pdf .

Abul-Husn, N. S. & Kenny, E. E. Personalized medicine and the power of electronic health records. Cell 177 , 58–69 (2019).

Mapes, B. M. et al. Diversity and inclusion for the All of Us research program: A scoping review. PLoS ONE 15 , e0234962 (2020).

Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590 , 290–299 (2021).

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562 , 203–209 (2018).

Halldorsson, B. V. et al. The sequences of 150,119 genomes in the UK Biobank. Nature 607 , 732–740 (2022).

Kurniansyah, N. et al. Evaluating the use of blood pressure polygenic risk scores across race/ethnic background groups. Nat. Commun. 14 , 3202 (2023).

Hou, K. et al. Causal effects on complex traits are similar for common variants across segments of different continental ancestries within admixed individuals. Nat. Genet. 55 , 549– 558 (2022).

Linder, J. E. et al. Returning integrated genomic risk and clinical recommendations: the eMERGE study. Genet. Med. 25 , 100006 (2023).

Lennon, N. J. et al. Selection, optimization and validation of ten chronic disease polygenic risk scores for clinical implementation in diverse US populations. Nat. Med. https://doi.org/10.1038/s41591-024-02796-z (2024).

Deflaux, N. et al. Demonstrating paths for unlocking the value of cloud genomics through cross cohort analysis. Nat. Commun. 14 , 5419 (2023).

Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9 , 4038 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

All of Us Research Program. Data and Statistics Dissemination Policy (2020); https://www.researchallofus.org/wp-content/themes/research-hub-wordpress-theme/media/2020/05/AoU_Policy_Data_and_Statistics_Dissemination_508.pdf .

Laurie, C. C. et al. Quality control and quality assurance in genotypic data for genome-wide association studies. Genet. Epidemiol. 34 , 591–602 (2010).

Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91 , 839–848 (2012).

Cohen, J. Statistical Power Analysis for the Behavioral Sciences (Routledge, 2013).

Andrade, C. Mean difference, standardized mean difference (SMD), and their use in meta-analysis. J. Clin. Psychiatry 81 , 20f13681 (2020).

Cavalli-Sforza, L. L. The Human Genome Diversity Project: past, present and future. Nat. Rev. Genet. 6 , 333–340 (2005).

Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition (IEEE Computer Society Press, 2002).

Conley, A. B. et al. Rye: genetic ancestry inference at biobank scale. Nucleic Acids Res. 51 , e44 (2023).

Mbatchou, J. et al. Computationally efficient whole-genome regression for quantitative and binary traits. Nat. Genet. 53 , 1097–1103 (2021).

Denny, J. C. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotech. 31 , 1102–1111 (2013).

Buniello, A. et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47 , D1005–D1012 (2019).

Bastarache, L. et al. The Phenotype-Genotype Reference Map: improving biobank data science through replication. Am. J. Hum. Genet. 10 , 1522–1533 (2023).

Download references

Acknowledgements

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers (OT2 OD026549; OT2 OD026554; OT2 OD026557; OT2 OD026556; OT2 OD026550; OT2 OD 026552; OT2 OD026553; OT2 OD026548; OT2 OD026551; OT2 OD026555); Inter agency agreement AOD 16037; Federally Qualified Health Centers HHSN 263201600085U; Data and Research Center: U2C OD023196; Genome Centers (OT2 OD002748; OT2 OD002750; OT2 OD002751); Biobank: U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: U24 OD023163; Communications and Engagement: OT2 OD023205; OT2 OD023206; and Community Partners (OT2 OD025277; OT2 OD025315; OT2 OD025337; OT2 OD025276). In addition, the All of Us Research Program would not be possible without the partnership of its participants. All of Us and the All of Us logo are service marks of the US Department of Health and Human Services. E.E.E. is an investigator of the Howard Hughes Medical Institute. We acknowledge the foundational contributions of our friend and colleague, the late Deborah A. Nickerson. Debbie’s years of insightful contributions throughout the formation of the All of Us genomics programme are permanently imprinted, and she shares credit for all of the successes of this programme.

Author information

Authors and affiliations.

Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

Alexander G. Bick & Henry R. Condon

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA

Ginger A. Metcalf, Eric Boerwinkle, Richard A. Gibbs, Donna M. Muzny, Eric Venner, Kimberly Walker, Jianhong Hu, Harsha Doddapaneni, Christie L. Kovar, Mullai Murugan, Shannon Dugan, Ziad Khan & Eric Boerwinkle

Vanderbilt Institute of Clinical and Translational Research, Vanderbilt University Medical Center, Nashville, TN, USA

Kelsey R. Mayo, Jodell E. Linder, Melissa Basford, Ashley Able, Ashley E. Green, Robert J. Carroll, Jennifer Zhang & Yuanyuan Wang

Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, MA, USA

Lee Lichtenstein, Anthony Philippakis, Sophie Schwartz, M. Morgan T. Aster, Kristian Cibulskis, Andrea Haessly, Rebecca Asch, Aurora Cremer, Kylee Degatano, Akum Shergill, Laura D. Gauthier, Samuel K. Lee, Aaron Hatcher, George B. Grant, Genevieve R. Brandt, Miguel Covarrubias, Eric Banks & Wail Baalawi

Verily, South San Francisco, CA, USA

Shimon Rura, David Glazer, Moira K. Dillon & C. H. Albach

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA

Robert J. Carroll, Paul A. Harris & Dan M. Roden

All of Us Research Program, National Institutes of Health, Bethesda, MD, USA

Anjene Musick, Andrea H. Ramirez, Sokny Lim, Siddhartha Nambiar, Bradley Ozenberger, Anastasia L. Wise, Chris Lunt, Geoffrey S. Ginsburg & Joshua C. Denny

School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA

I. King Jordan, Shashwat Deepali Nagar & Shivam Sharma

Neuroscience Institute, Institute of Translational Genomic Medicine, Morehouse School of Medicine, Atlanta, GA, USA

Robert Meller

Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA

Mine S. Cicek, Stephen N. Thibodeau & Mine S. Cicek

Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA

Kimberly F. Doheny, Michelle Z. Mawhinney, Sean M. L. Griffith, Elvin Hsu, Hua Ling & Marcia K. Adams

Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA

Evan E. Eichler, Joshua D. Smith, Christian D. Frazar, Colleen P. Davis, Karynne E. Patterson, Marsha M. Wheeler, Sean McGee, Mitzi L. Murray, Valeria Vasta, Dru Leistritz, Matthew A. Richardson, Aparna Radhakrishnan & Brenna W. Ehmen

Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA

Evan E. Eichler

Broad Institute of MIT and Harvard, Cambridge, MA, USA

Stacey Gabriel, Heidi L. Rehm, Niall J. Lennon, Christina Austin-Tse, Eric Banks, Michael Gatzen, Namrata Gupta, Katie Larsson, Sheli McDonough, Steven M. Harrison, Christopher Kachulis, Matthew S. Lebo, Seung Hoan Choi & Xin Wang

Division of Medical Genetics, Department of Medicine, University of Washington School of Medicine, Seattle, WA, USA

Gail P. Jarvik & Elisabeth A. Rosenthal

Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA

Dan M. Roden

Department of Pharmacology, Vanderbilt University Medical Center, Nashville, TN, USA

Center for Individualized Medicine, Biorepository Program, Mayo Clinic, Rochester, MN, USA

Stephen N. Thibodeau, Ashley L. Blegen, Samantha J. Wirkus, Victoria A. Wagner, Jeffrey G. Meyer & Mine S. Cicek

Color Health, Burlingame, CA, USA

Scott Topper, Cynthia L. Neben, Marcie Steeves & Alicia Y. Zhou

School of Public Health, University of Texas Health Science Center at Houston, Houston, TX, USA

Eric Boerwinkle

Laboratory for Molecular Medicine, Massachusetts General Brigham Personalized Medicine, Cambridge, MA, USA

Christina Austin-Tse, Emma Henricks & Matthew S. Lebo

Department of Laboratory Medicine and Pathology, University of Washington School of Medicine, Seattle, WA, USA

Christina M. Lockwood, Brian H. Shirts, Colin C. Pritchard, Jillian G. Buchan & Niklas Krumm

Manuscript Writing Group

  • Alexander G. Bick
  • , Ginger A. Metcalf
  • , Kelsey R. Mayo
  • , Lee Lichtenstein
  • , Shimon Rura
  • , Robert J. Carroll
  • , Anjene Musick
  • , Jodell E. Linder
  • , I. King Jordan
  • , Shashwat Deepali Nagar
  • , Shivam Sharma
  •  & Robert Meller

All of Us Research Program Genomics Principal Investigators

  • Melissa Basford
  • , Eric Boerwinkle
  • , Mine S. Cicek
  • , Kimberly F. Doheny
  • , Evan E. Eichler
  • , Stacey Gabriel
  • , Richard A. Gibbs
  • , David Glazer
  • , Paul A. Harris
  • , Gail P. Jarvik
  • , Anthony Philippakis
  • , Heidi L. Rehm
  • , Dan M. Roden
  • , Stephen N. Thibodeau
  •  & Scott Topper

Biobank, Mayo

  • Ashley L. Blegen
  • , Samantha J. Wirkus
  • , Victoria A. Wagner
  • , Jeffrey G. Meyer
  •  & Stephen N. Thibodeau

Genome Center: Baylor-Hopkins Clinical Genome Center

  • Donna M. Muzny
  • , Eric Venner
  • , Michelle Z. Mawhinney
  • , Sean M. L. Griffith
  • , Elvin Hsu
  • , Marcia K. Adams
  • , Kimberly Walker
  • , Jianhong Hu
  • , Harsha Doddapaneni
  • , Christie L. Kovar
  • , Mullai Murugan
  • , Shannon Dugan
  • , Ziad Khan
  •  & Richard A. Gibbs

Genome Center: Broad, Color, and Mass General Brigham Laboratory for Molecular Medicine

  • Niall J. Lennon
  • , Christina Austin-Tse
  • , Eric Banks
  • , Michael Gatzen
  • , Namrata Gupta
  • , Emma Henricks
  • , Katie Larsson
  • , Sheli McDonough
  • , Steven M. Harrison
  • , Christopher Kachulis
  • , Matthew S. Lebo
  • , Cynthia L. Neben
  • , Marcie Steeves
  • , Alicia Y. Zhou
  • , Scott Topper
  •  & Stacey Gabriel

Genome Center: University of Washington

  • Gail P. Jarvik
  • , Joshua D. Smith
  • , Christian D. Frazar
  • , Colleen P. Davis
  • , Karynne E. Patterson
  • , Marsha M. Wheeler
  • , Sean McGee
  • , Christina M. Lockwood
  • , Brian H. Shirts
  • , Colin C. Pritchard
  • , Mitzi L. Murray
  • , Valeria Vasta
  • , Dru Leistritz
  • , Matthew A. Richardson
  • , Jillian G. Buchan
  • , Aparna Radhakrishnan
  • , Niklas Krumm
  •  & Brenna W. Ehmen

Data and Research Center

  • Lee Lichtenstein
  • , Sophie Schwartz
  • , M. Morgan T. Aster
  • , Kristian Cibulskis
  • , Andrea Haessly
  • , Rebecca Asch
  • , Aurora Cremer
  • , Kylee Degatano
  • , Akum Shergill
  • , Laura D. Gauthier
  • , Samuel K. Lee
  • , Aaron Hatcher
  • , George B. Grant
  • , Genevieve R. Brandt
  • , Miguel Covarrubias
  • , Melissa Basford
  • , Alexander G. Bick
  • , Ashley Able
  • , Ashley E. Green
  • , Jennifer Zhang
  • , Henry R. Condon
  • , Yuanyuan Wang
  • , Moira K. Dillon
  • , C. H. Albach
  • , Wail Baalawi
  •  & Dan M. Roden

All of Us Research Demonstration Project Teams

  • Seung Hoan Choi
  • , Elisabeth A. Rosenthal

NIH All of Us Research Program Staff

  • Andrea H. Ramirez
  • , Sokny Lim
  • , Siddhartha Nambiar
  • , Bradley Ozenberger
  • , Anastasia L. Wise
  • , Chris Lunt
  • , Geoffrey S. Ginsburg
  •  & Joshua C. Denny

Contributions

The All of Us Biobank (Mayo Clinic) collected, stored and plated participant biospecimens. The All of Us Genome Centers (Baylor-Hopkins Clinical Genome Center; Broad, Color, and Mass General Brigham Laboratory for Molecular Medicine; and University of Washington School of Medicine) generated and QCed the whole-genomic data. The All of Us Data and Research Center (Vanderbilt University Medical Center, Broad Institute of MIT and Harvard, and Verily) generated the WGS joint call set, carried out quality assurance and QC analyses and developed the Researcher Workbench. All of Us Research Demonstration Project Teams contributed analyses. The other All of Us Genomics Investigators and NIH All of Us Research Program Staff provided crucial programmatic support. Members of the manuscript writing group (A.G.B., G.A.M., K.R.M., L.L., S.R., R.J.C. and A.M.) wrote the first draft of this manuscript, which was revised with contributions and feedback from all authors.

Corresponding author

Correspondence to Alexander G. Bick .

Ethics declarations

Competing interests.

D.M.M., G.A.M., E.V., K.W., J.H., H.D., C.L.K., M.M., S.D., Z.K., E. Boerwinkle and R.A.G. declare that Baylor Genetics is a Baylor College of Medicine affiliate that derives revenue from genetic testing. Eric Venner is affiliated with Codified Genomics, a provider of genetic interpretation. E.E.E. is a scientific advisory board member of Variant Bio, Inc. A.G.B. is a scientific advisory board member of TenSixteen Bio. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Timothy Frayling and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended data fig. 1 historic availability of ehr records in all of us v7 controlled tier curated data repository (n = 413,457)..

For better visibility, the plot shows growth starting in 2010.

Extended Data Fig. 2 Overview of the Genomic Data Curation Pipeline for WGS samples.

The Data and Research Center (DRC) performs additional single sample quality control (QC) on the data as it arrives from the Genome Centers. The variants from samples that pass this QC are loaded into the Genomic Variant Store (GVS), where we jointly call the variants and apply additional QC. We apply a joint call set QC process, which is stored with the call set. The entire joint call set is rendered as a Hail Variant Dataset (VDS), which can be accessed from the analysis notebooks in the Researcher Workbench. Subsections of the genome are extracted from the VDS and rendered in different formats with all participants. Auxiliary data can also be accessed through the Researcher Workbench. This includes variant functional annotations, joint call set QC results, predicted ancestry, and relatedness. Auxiliary data are derived from GVS (arrow not shown) and the VDS. The Cohort Builder directly queries GVS when researchers request genomic data for subsets of samples. Aligned reads, as cram files, are available in the Researcher Workbench (not shown). The graphics of the dish, gene and computer and the All of Us logo are reproduced with permission of the National Institutes of Health’s All of Us Research Program.

Extended Data Fig. 3 Proportion of allelic frequencies (AF), stratified by computed ancestry with over 10,000 participants.

Bar counts are not cumulative (eg, “pop AF < 0.01” does not include “pop AF < 0.001”).

Extended Data Fig. 4 Distribution of pathogenic, and likely pathogenic ClinVar variants.

Stratified by ancestry filtered to only those variants that are found in allele count (AC) < 40 individuals for 245,388 short read WGS samples.

Extended Data Fig. 5 Ancestry specific HLA-DQB1 ( rs9273363 ) locus associations in 231,442 unrelated individuals.

Phenome-wide (PheWAS) associations highlight ancestry specific consequences across ancestries.

Extended Data Fig. 6 Ancestry specific TCF7L2 ( rs7903146 ) locus associations in 231,442 unrelated individuals.

Phenome-wide (PheWAS) associations highlight diabetic consequences across ancestries.

Supplementary information

Supplementary information.

Supplementary Figs. 1–7, Tables 1–8 and Note.

Reporting Summary

Supplementary dataset 1.

Associations of ACKR1, HLA-DQB1 and TCF7L2 loci with all Phecodes stratified by genetic ancestry.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

The All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature (2024). https://doi.org/10.1038/s41586-023-06957-x

Download citation

Received : 22 July 2022

Accepted : 08 December 2023

Published : 19 February 2024

DOI : https://doi.org/10.1038/s41586-023-06957-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

‘all of us’ genetics chart stirs unease over controversial depiction of race.

Nature (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

research papers on data analysis

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

The use of Big Data Analytics in healthcare

Kornelia batko.

1 Department of Business Informatics, University of Economics in Katowice, Katowice, Poland

Andrzej Ślęzak

2 Department of Biomedical Processes and Systems, Institute of Health and Nutrition Sciences, Częstochowa University of Technology, Częstochowa, Poland

Associated Data

The datasets for this study are available on request to the corresponding author.

The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities. The direct research was carried out based on research questionnaire and conducted on a sample of 217 medical facilities in Poland. Literature studies have shown that the use of Big Data Analytics can bring many benefits to medical facilities, while direct research has shown that medical facilities in Poland are moving towards data-based healthcare because they use structured and unstructured data, reach for analytics in the administrative, business and clinical area. The research positively confirmed that medical facilities are working on both structural data and unstructured data. The following kinds and sources of data can be distinguished: from databases, transaction data, unstructured content of emails and documents, data from devices and sensors. However, the use of data from social media is lower as in their activity they reach for analytics, not only in the administrative and business but also in the clinical area. It clearly shows that the decisions made in medical facilities are highly data-driven. The results of the study confirm what has been analyzed in the literature that medical facilities are moving towards data-based healthcare, together with its benefits.

Introduction

The main contribution of this paper is to present an analytical overview of using structured and unstructured data (Big Data) analytics in medical facilities in Poland. Medical facilities use both structured and unstructured data in their practice. Structured data has a predetermined schema, it is extensive, freeform, and comes in variety of forms [ 27 ]. In contrast, unstructured data, referred to as Big Data (BD), does not fit into the typical data processing format. Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using traditional tools. It remains stored but not analyzed. Due to the lack of a well-defined schema, it is difficult to search and analyze such data and, therefore, it requires a specific technology and method to transform it into value [ 20 , 68 ]. Integrating data stored in both structured and unstructured formats can add significant value to an organization [ 27 ]. Organizations must approach unstructured data in a different way. Therefore, the potential is seen in Big Data Analytics (BDA). Big Data Analytics are techniques and tools used to analyze and extract information from Big Data. The results of Big Data analysis can be used to predict the future. They also help in creating trends about the past. When it comes to healthcare, it allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 60 ].

This paper is the first study to consolidate and characterize the use of Big Data from different perspectives. The first part consists of a brief literature review of studies on Big Data (BD) and Big Data Analytics (BDA), while the second part presents results of direct research aimed at diagnosing the use of big data analyses in medical facilities in Poland.

Healthcare is a complex system with varied stakeholders: patients, doctors, hospitals, pharmaceutical companies and healthcare decision-makers. This sector is also limited by strict rules and regulations. However, worldwide one may observe a departure from the traditional doctor-patient approach. The doctor becomes a partner and the patient is involved in the therapeutic process [ 14 ]. Healthcare is no longer focused solely on the treatment of patients. The priority for decision-makers should be to promote proper health attitudes and prevent diseases that can be avoided [ 81 ]. This became visible and important especially during the Covid-19 pandemic [ 44 ].

The next challenges that healthcare will have to face is the growing number of elderly people and a decline in fertility. Fertility rates in the country are found below the reproductive minimum necessary to keep the population stable [ 10 ]. The reflection of both effects, namely the increase in age and lower fertility rates, are demographic load indicators, which is constantly growing. Forecasts show that providing healthcare in the form it is provided today will become impossible in the next 20 years [ 70 ]. It is especially visible now during the Covid-19 pandemic when healthcare faced quite a challenge related to the analysis of huge data amounts and the need to identify trends and predict the spread of the coronavirus. The pandemic showed it even more that patients should have access to information about their health condition, the possibility of digital analysis of this data and access to reliable medical support online. Health monitoring and cooperation with doctors in order to prevent diseases can actually revolutionize the healthcare system. One of the most important aspects of the change necessary in healthcare is putting the patient in the center of the system.

Technology is not enough to achieve these goals. Therefore, changes should be made not only at the technological level but also in the management and design of complete healthcare processes and what is more, they should affect the business models of service providers. The use of Big Data Analytics is becoming more and more common in enterprises [ 17 , 54 ]. However, medical enterprises still cannot keep up with the information needs of patients, clinicians, administrators and the creator’s policy. The adoption of a Big Data approach would allow the implementation of personalized and precise medicine based on personalized information, delivered in real time and tailored to individual patients.

To achieve this goal, it is necessary to implement systems that will be able to learn quickly about the data generated by people within clinical care and everyday life. This will enable data-driven decision making, receiving better personalized predictions about prognosis and responses to treatments; a deeper understanding of the complex factors and their interactions that influence health at the patient level, the health system and society, enhanced approaches to detecting safety problems with drugs and devices, as well as more effective methods of comparing prevention, diagnostic, and treatment options [ 40 ].

In the literature, there is a lot of research showing what opportunities can be offered to companies by big data analysis and what data can be analyzed. However, there are few studies showing how data analysis in the area of healthcare is performed, what data is used by medical facilities and what analyses and in which areas they carry out. This paper aims to fill this gap by presenting the results of research carried out in medical facilities in Poland. The goal is to analyze the possibilities of using Big Data Analytics in healthcare, especially in Polish conditions. In particular, the paper is aimed at determining what data is processed by medical facilities in Poland, what analyses they perform and in what areas, and how they assess their analytical maturity. In order to achieve this goal, a critical analysis of the literature was performed, and the direct research was based on a research questionnaire conducted on a sample of 217 medical facilities in Poland. It was hypothesized that medical facilities in Poland are working on both structured and unstructured data and moving towards data-based healthcare and its benefits. Examining the maturity of healthcare facilities in the use of Big Data and Big Data Analytics is crucial in determining the potential future benefits that the healthcare sector can gain from Big Data Analytics. There is also a pressing need to predicate whether, in the coming years, healthcare will be able to cope with the threats and challenges it faces.

This paper is divided into eight parts. The first is the introduction which provides background and the general problem statement of this research. In the second part, this paper discusses considerations on use of Big Data and Big Data Analytics in Healthcare, and then, in the third part, it moves on to challenges and potential benefits of using Big Data Analytics in healthcare. The next part involves the explanation of the proposed method. The result of direct research and discussion are presented in the fifth part, while the following part of the paper is the conclusion. The seventh part of the paper presents practical implications. The final section of the paper provides limitations and directions for future research.

Considerations on use Big Data and Big Data Analytics in the healthcare

In recent years one can observe a constantly increasing demand for solutions offering effective analytical tools. This trend is also noticeable in the analysis of large volumes of data (Big Data, BD). Organizations are looking for ways to use the power of Big Data to improve their decision making, competitive advantage or business performance [ 7 , 54 ]. Big Data is considered to offer potential solutions to public and private organizations, however, still not much is known about the outcome of the practical use of Big Data in different types of organizations [ 24 ].

As already mentioned, in recent years, healthcare management worldwide has been changed from a disease-centered model to a patient-centered model, even in value-based healthcare delivery model [ 68 ]. In order to meet the requirements of this model and provide effective patient-centered care, it is necessary to manage and analyze healthcare Big Data.

The issue often raised when it comes to the use of data in healthcare is the appropriate use of Big Data. Healthcare has always generated huge amounts of data and nowadays, the introduction of electronic medical records, as well as the huge amount of data sent by various types of sensors or generated by patients in social media causes data streams to constantly grow. Also, the medical industry generates significant amounts of data, including clinical records, medical images, genomic data and health behaviors. Proper use of the data will allow healthcare organizations to support clinical decision-making, disease surveillance, and public health management. The challenge posed by clinical data processing involves not only the quantity of data but also the difficulty in processing it.

In the literature one can find many different definitions of Big Data. This concept has evolved in recent years, however, it is still not clearly understood. Nevertheless, despite the range and differences in definitions, Big Data can be treated as a: large amount of digital data, large data sets, tool, technology or phenomenon (cultural or technological.

Big Data can be considered as massive and continually generated digital datasets that are produced via interactions with online technologies [ 53 ]. Big Data can be defined as datasets that are of such large sizes that they pose challenges in traditional storage and analysis techniques [ 28 ]. A similar opinion about Big Data was presented by Ohlhorst who sees Big Data as extremely large data sets, possible neither to manage nor to analyze with traditional data processing tools [ 57 ]. In his opinion, the bigger the data set, the more difficult it is to gain any value from it.

In turn, Knapp perceived Big Data as tools, processes and procedures that allow an organization to create, manipulate and manage very large data sets and storage facilities [ 38 ]. From this point of view, Big Data is identified as a tool to gather information from different databases and processes, allowing users to manage large amounts of data.

Similar perception of the term ‘Big Data’ is shown by Carter. According to him, Big Data technologies refer to a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [ 13 ].

Jordan combines these two approaches by identifying Big Data as a complex system, as it needs data bases for data to be stored in, programs and tools to be managed, as well as expertise and personnel able to retrieve useful information and visualization to be understood [ 37 ].

Following the definition of Laney for Big Data, it can be state that: it is large amount of data generated in very fast motion and it contains a lot of content [ 43 ]. Such data comes from unstructured sources, such as stream of clicks on the web, social networks (Twitter, blogs, Facebook), video recordings from the shops, recording of calls in a call center, real time information from various kinds of sensors, RFID, GPS devices, mobile phones and other devices that identify and monitor something [ 8 ]. Big Data is a powerful digital data silo, raw, collected with all sorts of sources, unstructured and difficult, or even impossible, to analyze using conventional techniques used so far to relational databases.

While describing Big Data, it cannot be overlooked that the term refers more to a phenomenon than to specific technology. Therefore, instead of defining this phenomenon, trying to describe them, more authors are describing Big Data by giving them characteristics included a collection of V’s related to its nature [ 2 , 3 , 23 , 25 , 58 ]:

  • Volume (refers to the amount of data and is one of the biggest challenges in Big Data Analytics),
  • Velocity (speed with which new data is generated, the challenge is to be able to manage data effectively and in real time),
  • Variety (heterogeneity of data, many different types of healthcare data, the challenge is to derive insights by looking at all available heterogenous data in a holistic manner),
  • Variability (inconsistency of data, the challenge is to correct the interpretation of data that can vary significantly depending on the context),
  • Veracity (how trustworthy the data is, quality of the data),
  • Visualization (ability to interpret data and resulting insights, challenging for Big Data due to its other features as described above).
  • Value (the goal of Big Data Analytics is to discover the hidden knowledge from huge amounts of data).

Big Data is defined as an information asset with high volume, velocity, and variety, which requires specific technology and method for its transformation into value [ 21 , 77 ]. Big Data is also a collection of information about high-volume, high volatility or high diversity, requiring new forms of processing in order to support decision-making, discovering new phenomena and process optimization [ 5 , 7 ]. Big Data is too large for traditional data-processing systems and software tools to capture, store, manage and analyze, therefore it requires new technologies [ 28 , 50 , 61 ] to manage (capture, aggregate, process) its volume, velocity and variety [ 9 ].

Undoubtedly, Big Data differs from the data sources used so far by organizations. Therefore, organizations must approach this type of unstructured data in a different way. First of all, organizations must start to see data as flows and not stocks—this entails the need to implement the so-called streaming analytics [ 48 ]. The mentioned features make it necessary to use new IT tools that allow the fullest use of new data [ 58 ]. The Big Data idea, inseparable from the huge increase in data available to various organizations or individuals, creates opportunities for access to valuable analyses, conclusions and enables making more accurate decisions [ 6 , 11 , 59 ].

The Big Data concept is constantly evolving and currently it does not focus on huge amounts of data, but rather on the process of creating value from this data [ 52 ]. Big Data is collected from various sources that have different data properties and are processed by different organizational units, resulting in creation of a Big Data chain [ 36 ]. The aim of the organizations is to manage, process and analyze Big Data. In the healthcare sector, Big Data streams consist of various types of data, namely [ 8 , 51 ]:

  • clinical data, i.e. data obtained from electronic medical records, data from hospital information systems, image centers, laboratories, pharmacies and other organizations providing health services, patient generated health data, physician’s free-text notes, genomic data, physiological monitoring data [ 4 ],
  • biometric data provided from various types of devices that monitor weight, pressure, glucose level, etc.,
  • financial data, constituting a full record of economic operations reflecting the conducted activity,
  • data from scientific research activities, i.e. results of research, including drug research, design of medical devices and new methods of treatment,
  • data provided by patients, including description of preferences, level of satisfaction, information from systems for self-monitoring of their activity: exercises, sleep, meals consumed, etc.
  • data from social media.

These data are provided not only by patients but also by organizations and institutions, as well as by various types of monitoring devices, sensors or instruments [ 16 ]. Data that has been generated so far in the healthcare sector is stored in both paper and digital form. Thus, the essence and the specificity of the process of Big Data analyses means that organizations need to face new technological and organizational challenges [ 67 ]. The healthcare sector has always generated huge amounts of data and this is connected, among others, with the need to store medical records of patients. However, the problem with Big Data in healthcare is not limited to an overwhelming volume but also an unprecedented diversity in terms of types, data formats and speed with which it should be analyzed in order to provide the necessary information on an ongoing basis [ 3 ]. It is also difficult to apply traditional tools and methods for management of unstructured data [ 67 ]. Due to the diversity and quantity of data sources that are growing all the time, advanced analytical tools and technologies, as well as Big Data analysis methods which can meet and exceed the possibilities of managing healthcare data, are needed [ 3 , 68 ].

Therefore, the potential is seen in Big Data analyses, especially in the aspect of improving the quality of medical care, saving lives or reducing costs [ 30 ]. Extracting from this tangle of given association rules, patterns and trends will allow health service providers and other stakeholders in the healthcare sector to offer more accurate and more insightful diagnoses of patients, personalized treatment, monitoring of the patients, preventive medicine, support of medical research and health population, as well as better quality of medical services and patient care while, at the same time, the ability to reduce costs (Fig.  1 ).

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig1_HTML.jpg

Healthcare Big Data Analytics applications

(Source: Own elaboration)

The main challenge with Big Data is how to handle such a large amount of information and use it to make data-driven decisions in plenty of areas [ 64 ]. In the context of healthcare data, another major challenge is to adjust big data storage, analysis, presentation of analysis results and inference basing on them in a clinical setting. Data analytics systems implemented in healthcare are designed to describe, integrate and present complex data in an appropriate way so that it can be understood better (Fig.  2 ). This would improve the efficiency of acquiring, storing, analyzing and visualizing big data from healthcare [ 71 ].

An external file that holds a picture, illustration, etc.
Object name is 40537_2021_553_Fig2_HTML.jpg

Process of Big Data Analytics

The result of data processing with the use of Big Data Analytics is appropriate data storytelling which may contribute to making decisions with both lower risk and data support. This, in turn, can benefit healthcare stakeholders. To take advantage of the potential massive amounts of data in healthcare and to ensure that the right intervention to the right patient is properly timed, personalized, and potentially beneficial to all components of the healthcare system such as the payer, patient, and management, analytics of large datasets must connect communities involved in data analytics and healthcare informatics [ 49 ]. Big Data Analytics can provide insight into clinical data and thus facilitate informed decision-making about the diagnosis and treatment of patients, prevention of diseases or others. Big Data Analytics can also improve the efficiency of healthcare organizations by realizing the data potential [ 3 , 62 ].

Big Data Analytics in medicine and healthcare refers to the integration and analysis of a large amount of complex heterogeneous data, such as various omics (genomics, epigenomics, transcriptomics, proteomics, metabolomics, interactomics, pharmacogenetics, deasomics), biomedical data, talemedicine data (sensors, medical equipment data) and electronic health records data [ 46 , 65 ].

When analyzing the phenomenon of Big Data in the healthcare sector, it should be noted that it can be considered from the point of view of three areas: epidemiological, clinical and business.

From a clinical point of view, the Big Data analysis aims to improve the health and condition of patients, enable long-term predictions about their health status and implementation of appropriate therapeutic procedures. Ultimately, the use of data analysis in medicine is to allow the adaptation of therapy to a specific patient, that is personalized medicine (precision, personalized medicine).

From an epidemiological point of view, it is desirable to obtain an accurate prognosis of morbidity in order to implement preventive programs in advance.

In the business context, Big Data analysis may enable offering personalized packages of commercial services or determining the probability of individual disease and infection occurrence. It is worth noting that Big Data means not only the collection and processing of data but, most of all, the inference and visualization of data necessary to obtain specific business benefits.

In order to introduce new management methods and new solutions in terms of effectiveness and transparency, it becomes necessary to make data more accessible, digital, searchable, as well as analyzed and visualized.

Erickson and Rothberg state that the information and data do not reveal their full value until insights are drawn from them. Data becomes useful when it enhances decision making and decision making is enhanced only when analytical techniques are used and an element of human interaction is applied [ 22 ].

Thus, healthcare has experienced much progress in usage and analysis of data. A large-scale digitalization and transparency in this sector is a key statement of almost all countries governments policies. For centuries, the treatment of patients was based on the judgment of doctors who made treatment decisions. In recent years, however, Evidence-Based Medicine has become more and more important as a result of it being related to the systematic analysis of clinical data and decision-making treatment based on the best available information [ 42 ]. In the healthcare sector, Big Data Analytics is expected to improve the quality of life and reduce operational costs [ 72 , 82 ]. Big Data Analytics enables organizations to improve and increase their understanding of the information contained in data. It also helps identify data that provides insightful insights for current as well as future decisions [ 28 ].

Big Data Analytics refers to technologies that are grounded mostly in data mining: text mining, web mining, process mining, audio and video analytics, statistical analysis, network analytics, social media analytics and web analytics [ 16 , 25 , 31 ]. Different data mining techniques can be applied on heterogeneous healthcare data sets, such as: anomaly detection, clustering, classification, association rules as well as summarization and visualization of those Big Data sets [ 65 ]. Modern data analytics techniques explore and leverage unique data characteristics even from high-speed data streams and sensor data [ 15 , 16 , 31 , 55 ]. Big Data can be used, for example, for better diagnosis in the context of comprehensive patient data, disease prevention and telemedicine (in particular when using real-time alerts for immediate care), monitoring patients at home, preventing unnecessary hospital visits, integrating medical imaging for a wider diagnosis, creating predictive analytics, reducing fraud and improving data security, better strategic planning and increasing patients’ involvement in their own health.

Big Data Analytics in healthcare can be divided into [ 33 , 73 , 74 ]:

  • descriptive analytics in healthcare is used to understand past and current healthcare decisions, converting data into useful information for understanding and analyzing healthcare decisions, outcomes and quality, as well as making informed decisions [ 33 ]. It can be used to create reports (i.e. about patients’ hospitalizations, physicians’ performance, utilization management), visualization, customized reports, drill down tables, or running queries on the basis of historical data.
  • predictive analytics operates on past performance in an effort to predict the future by examining historical or summarized health data, detecting patterns of relationships in these data, and then extrapolating these relationships to forecast. It can be used to i.e. predict the response of different patient groups to different drugs (dosages) or reactions (clinical trials), anticipate risk and find relationships in health data and detect hidden patterns [ 62 ]. In this way, it is possible to predict the epidemic spread, anticipate service contracts and plan healthcare resources. Predictive analytics is used in proper diagnosis and for appropriate treatments to be given to patients suffering from certain diseases [ 39 ].
  • prescriptive analytics—occurs when health problems involve too many choices or alternatives. It uses health and medical knowledge in addition to data or information. Prescriptive analytics is used in many areas of healthcare, including drug prescriptions and treatment alternatives. Personalized medicine and evidence-based medicine are both supported by prescriptive analytics.
  • discovery analytics—utilizes knowledge about knowledge to discover new “inventions” like drugs (drug discovery), previously unknown diseases and medical conditions, alternative treatments, etc.

Although the models and tools used in descriptive, predictive, prescriptive, and discovery analytics are different, many applications involve all four of them [ 62 ]. Big Data Analytics in healthcare can help enable personalized medicine by identifying optimal patient-specific treatments. This can influence the improvement of life standards, reduce waste of healthcare resources and save costs of healthcare [ 56 , 63 , 71 ]. The introduction of large data analysis gives new analytical possibilities in terms of scope, flexibility and visualization. Techniques such as data mining (computational pattern discovery process in large data sets) facilitate inductive reasoning and analysis of exploratory data, enabling scientists to identify data patterns that are independent of specific hypotheses. As a result, predictive analysis and real-time analysis becomes possible, making it easier for medical staff to start early treatments and reduce potential morbidity and mortality. In addition, document analysis, statistical modeling, discovering patterns and topics in document collections and data in the EHR, as well as an inductive approach can help identify and discover relationships between health phenomena.

Advanced analytical techniques can be used for a large amount of existing (but not yet analytical) data on patient health and related medical data to achieve a better understanding of the information and results obtained, as well as to design optimal clinical pathways [ 62 ]. Big Data Analytics in healthcare integrates analysis of several scientific areas such as bioinformatics, medical imaging, sensor informatics, medical informatics and health informatics [ 65 ]. Big Data Analytics in healthcare allows to analyze large datasets from thousands of patients, identifying clusters and correlation between datasets, as well as developing predictive models using data mining techniques [ 65 ]. Discussing all the techniques used for Big Data Analytics goes beyond the scope of a single article [ 25 ].

The success of Big Data analysis and its accuracy depend heavily on the tools and techniques used to analyze the ability to provide reliable, up-to-date and meaningful information to various stakeholders [ 12 ]. It is believed that the implementation of big data analytics by healthcare organizations could bring many benefits in the upcoming years, including lowering health care costs, better diagnosis and prediction of diseases and their spread, improving patient care and developing protocols to prevent re-hospitalization, optimizing staff, optimizing equipment, forecasting the need for hospital beds, operating rooms, treatments, and improving the drug supply chain [ 71 ].

Challenges and potential benefits of using Big Data Analytics in healthcare

Modern analytics gives possibilities not only to have insight in historical data, but also to have information necessary to generate insight into what may happen in the future. Even when it comes to prediction of evidence-based actions. The emphasis on reform has prompted payers and suppliers to pursue data analysis to reduce risk, detect fraud, improve efficiency and save lives. Everyone—payers, providers, even patients—are focusing on doing more with fewer resources. Thus, some areas in which enhanced data and analytics can yield the greatest results include various healthcare stakeholders (Table ​ (Table1 1 ).

The use of analytics by various healthcare stakeholders

Source: own elaboration on the basis of [ 19 , 20 ]

Healthcare organizations see the opportunity to grow through investments in Big Data Analytics. In recent years, by collecting medical data of patients, converting them into Big Data and applying appropriate algorithms, reliable information has been generated that helps patients, physicians and stakeholders in the health sector to identify values and opportunities [ 31 ]. It is worth noting that there are many changes and challenges in the structure of the healthcare sector. Digitization and effective use of Big Data in healthcare can bring benefits to every stakeholder in this sector. A single doctor would benefit the same as the entire healthcare system. Potential opportunities to achieve benefits and effects from Big Data in healthcare can be divided into four groups [ 8 ]:

  • assessment of diagnoses made by doctors and the manner of treatment of diseases indicated by them based on the decision support system working on Big Data collections,
  • detection of more effective, from a medical point of view, and more cost-effective ways to diagnose and treat patients,
  • analysis of large volumes of data to reach practical information useful for identifying needs, introducing new health services, preventing and overcoming crises,
  • prediction of the incidence of diseases,
  • detecting trends that lead to an improvement in health and lifestyle of the society,
  • analysis of the human genome for the introduction of personalized treatment.
  • doctors’ comparison of current medical cases to cases from the past for better diagnosis and treatment adjustment,
  • detection of diseases at earlier stages when they can be more easily and quickly cured,
  • detecting epidemiological risks and improving control of pathogenic spots and reaction rates,
  • identification of patients who are predicted to have the highest risk of specific, life-threatening diseases by collating data on the history of the most common diseases, in healing people with reports entering insurance companies,
  • health management of each patient individually (personalized medicine) and health management of the whole society,
  • capturing and analyzing large amounts of data from hospitals and homes in real time, life monitoring devices to monitor safety and predict adverse events,
  • analysis of patient profiles to identify people for whom prevention should be applied, lifestyle change or preventive care approach,
  • the ability to predict the occurrence of specific diseases or worsening of patients’ results,
  • predicting disease progression and its determinants, estimating the risk of complications,
  • detecting drug interactions and their side effects.
  • supporting work on new drugs and clinical trials thanks to the possibility of analyzing “all data” instead of selecting a test sample,
  • the ability to identify patients with specific, biological features that will take part in specialized clinical trials,
  • selecting a group of patients for which the tested drug is likely to have the desired effect and no side effects,
  • using modeling and predictive analysis to design better drugs and devices.
  • reduction of costs and counteracting abuse and counseling practices,
  • faster and more effective identification of incorrect or unauthorized financial operations in order to prevent abuse and eliminate errors,
  • increase in profitability by detecting patients generating high costs or identifying doctors whose work, procedures and treatment methods cost the most and offering them solutions that reduce the amount of money spent,
  • identification of unnecessary medical activities and procedures, e.g. duplicate tests.

According to research conducted by Wang, Kung and Byrd, Big Data Analytics benefits can be classified into five categories: IT infrastructure benefits (reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, better use of healthcare systems, processing standardization among various healthcare IT systems, reducing IT maintenance costs regarding data storage), operational benefits (improving the quality and accuracy of clinical decisions, processing a large number of health records in seconds, reducing the time of patient travel, immediate access to clinical data to analyze, shortening the time of diagnostic test, reductions in surgery-related hospitalizations, exploring inconceivable new research avenues), organizational benefits (detecting interoperability problems much more quickly than traditional manual methods, improving cross-functional communication and collaboration among administrative staffs, researchers, clinicians and IT staffs, enabling data sharing with other institutions and adding new services, content sources and research partners), managerial benefits (gaining quick insights about changing healthcare trends in the market, providing members of the board and heads of department with sound decision-support information on the daily clinical setting, optimizing business growth-related decisions) and strategic benefits (providing a big picture view of treatment delivery for meeting future need, creating high competitive healthcare services) [ 73 ].

The above specification does not constitute a full list of potential areas of use of Big Data Analysis in healthcare because the possibilities of using analysis are practically unlimited. In addition, advanced analytical tools allow to analyze data from all possible sources and conduct cross-analyses to provide better data insights [ 26 ]. For example, a cross-analysis can refer to a combination of patient characteristics, as well as costs and care results that can help identify the best, in medical terms, and the most cost-effective treatment or treatments and this may allow a better adjustment of the service provider’s offer [ 62 ].

In turn, the analysis of patient profiles (e.g. segmentation and predictive modeling) allows identification of people who should be subject to prophylaxis, prevention or should change their lifestyle [ 8 ]. Shortened list of benefits for Big Data Analytics in healthcare is presented in paper [ 3 ] and consists of: better performance, day-to-day guides, detection of diseases in early stages, making predictive analytics, cost effectiveness, Evidence Based Medicine and effectiveness in patient treatment.

Summarizing, healthcare big data represents a huge potential for the transformation of healthcare: improvement of patients’ results, prediction of outbreaks of epidemics, valuable insights, avoidance of preventable diseases, reduction of the cost of healthcare delivery and improvement of the quality of life in general [ 1 ]. Big Data also generates many challenges such as difficulties in data capture, data storage, data analysis and data visualization [ 15 ]. The main challenges are connected with the issues of: data structure (Big Data should be user-friendly, transparent, and menu-driven but it is fragmented, dispersed, rarely standardized and difficult to aggregate and analyze), security (data security, privacy and sensitivity of healthcare data, there are significant concerns related to confidentiality), data standardization (data is stored in formats that are not compatible with all applications and technologies), storage and transfers (especially costs associated with securing, storing, and transferring unstructured data), managerial skills, such as data governance, lack of appropriate analytical skills and problems with Real-Time Analytics (health care is to be able to utilize Big Data in real time) [ 4 , 34 , 41 ].

The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data Analytics in medical facilities in Poland.

Presented research results are part of a larger questionnaire form on Big Data Analytics. The direct research was based on an interview questionnaire which contained 100 questions with 5-point Likert scale (1—strongly disagree, 2—I rather disagree, 3—I do not agree, nor disagree, 4—I rather agree, 5—I definitely agree) and 4 metrics questions. The study was conducted in December 2018 on a sample of 217 medical facilities (110 private, 107 public). The research was conducted by a specialized market research agency: Center for Research and Expertise of the University of Economics in Katowice.

When it comes to direct research, the selected entities included entities financed from public sources—the National Health Fund (23.5%), and entities operating commercially (11.5%). In the surveyed group of entities, more than a half (64.9%) are hybrid financed, both from public and commercial sources. The diversity of the research sample also applies to the size of the entities, defined by the number of employees. Taking into account proportions of the surveyed entities, it should be noted that in the sector structure, medium-sized (10–50 employees—34% of the sample) and large (51–250 employees—27%) entities dominate. The research was of all-Poland nature, and the entities included in the research sample come from all of the voivodships. The largest group were entities from Łódzkie (32%), Śląskie (18%) and Mazowieckie (18%) voivodships, as these voivodships have the largest number of medical institutions. Other regions of the country were represented by single units. The selection of the research sample was random—layered. As part of medical facilities database, groups of private and public medical facilities have been identified and the ones to which the questionnaire was targeted were drawn from each of these groups. The analyses were performed using the GNU PSPP 0.10.2 software.

The aim of the study was to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Characteristics of the research sample is presented in Table ​ Table2 2 .

Characteristics of the research sample

The research is non-exhaustive due to the incomplete and uneven regional distribution of the samples, overrepresented in three voivodeships (Łódzkie, Mazowieckie and Śląskie). The size of the research sample (217 entities) allows the authors of the paper to formulate specific conclusions on the use of Big Data in the process of its management.

For the purpose of this paper, the following research hypotheses were formulated: (1) medical facilities in Poland are working on both structured and unstructured data (2) medical facilities in Poland are moving towards data-based healthcare and its benefits.

The paper poses the following research questions and statements that coincide with the selected questions from the research questionnaire:

  • From what sources do medical facilities obtain data? What types of data are used by the particular organization, whether structured or unstructured, and to what extent?
  • From what sources do medical facilities obtain data?
  • In which area organizations are using data and analytical systems (clinical or business)?
  • Is data analytics performed based on historical data or are predictive analyses also performed?
  • Determining whether administrative and medical staff receive complete, accurate and reliable data in a timely manner?
  • Determining whether real-time analyses are performed to support the particular organization’s activities.

Results and discussion

On the basis of the literature analysis and research study, a set of questions and statements related to the researched area was formulated. The results from the surveys show that medical facilities use a variety of data sources in their operations. These sources are both structured and unstructured data (Table ​ (Table3 3 ).

Type of data sources used in medical facility (%)

1—strongly disagree, 2—I disagree, 3—I agree or disagree, 4—I rather agree, 5—I strongly agree

According to the data provided by the respondents, considering the first statement made in the questionnaire, almost half of the medical institutions (47.58%) agreed that they rather collect and use structured data (e.g. databases and data warehouses, reports to external entities) and 10.57% entirely agree with this statement. As much as 23.35% of representatives of medical institutions stated “I agree or disagree”. Other medical facilities do not collect and use structured data (7.93%) and 6.17% strongly disagree with the first statement. Also, the median calculated based on the obtained results (median: 4), proves that medical facilities in Poland collect and use structured data (Table ​ (Table4 4 ).

Collection and use of data determined by the size of medical facility (number of employees)

In turn, 28.19% of the medical institutions agreed that they rather collect and use unstructured data and as much as 9.25% entirely agree with this statement. The number of representatives of medical institutions that stated “I agree or disagree” was 27.31%. Other medical facilities do not collect and use structured data (17.18%) and 13.66% strongly disagree with the first statement. In the case of unstructured data the median is 3, which means that the collection and use of this type of data by medical facilities in Poland is lower.

In the further part of the analysis, it was checked whether the size of the medical facility and form of ownership have an impact on whether it analyzes unstructured data (Tables ​ (Tables4 4 and ​ and5). 5 ). In order to find this out, correlation coefficients were calculated.

Collection and use of data determined by the form of ownership of medical facility

Based on the calculations, it can be concluded that there is a small statistically monotonic correlation between the size of the medical facility and its collection and use of structured data (p < 0.001; τ = 0.16). This means that the use of structured data is slightly increasing in larger medical facilities. The size of the medical facility is more important according to use of unstructured data (p < 0.001; τ = 0.23) (Table ​ (Table4 4 .).

To determine whether the form of medical facility ownership affects data collection, the Mann–Whitney U test was used. The calculations show that the form of ownership does not affect what data the organization collects and uses (Table ​ (Table5 5 ).

Detailed information on the sources of from which medical facilities collect and use data is presented in the Table ​ Table6 6 .

Data sources used in medical facility

1—we do not use at all, 5—we use extensively

The questionnaire results show that medical facilities are especially using information published in databases, reports to external units and transaction data, but they also use unstructured data from e-mails, medical devices, sensors, phone calls, audio and video data (Table ​ (Table6). 6 ). Data from social media, RFID and geolocation data are used to a small extent. Similar findings are concluded in the literature studies.

From the analysis of the answers given by the respondents, more than half of the medical facilities have integrated hospital system (HIS) implemented. As much as 43.61% use integrated hospital system and 16.30% use it extensively (Table ​ (Table7). 7 ). 19.38% of exanimated medical facilities do not use it at all. Moreover, most of the examined medical facilities (34.80% use it, 32.16% use extensively) conduct medical documentation in an electronic form, which gives an opportunity to use data analytics. Only 4.85% of medical facilities don’t use it at all.

The use of HIS and electronic documentation in medical facilities (%)

Other problems that needed to be investigated were: whether medical facilities in Poland use data analytics? If so, in what form and in what areas? (Table ​ (Table8). 8 ). The analysis of answers given by the respondents about the potential of data analytics in medical facilities shows that a similar number of medical facilities use data analytics in administration and business (31.72% agreed with the statement no. 5 and 12.33% strongly agreed) as in the clinical area (33.04% agreed with the statement no. 6 and 12.33% strongly agreed). When considering decision-making issues, 35.24% agree with the statement "the organization uses data and analytical systems to support business decisions” and 8.37% of respondents strongly agree. Almost 40.09% agree with the statement that “the organization uses data and analytical systems to support clinical decisions (in the field of diagnostics and therapy)” and 15.42% of respondents strongly agree. Exanimated medical facilities use in their activity analytics based both on historical data (33.48% agree with statement 7 and 12.78% strongly agree) and predictive analytics (33.04% agrees with the statement number 8 and 15.86% strongly agree). Detailed results are presented in Table ​ Table8 8 .

Conditions of using Big Data Analytics in medical facilities (%)

Medical facilities focus on development in the field of data processing, as they confirm that they conduct analytical planning processes systematically and analyze new opportunities for strategic use of analytics in business and clinical activities (38.33% rather agree and 10.57% strongly agree with this statement). The situation is different with real-time data analysis, here, the situation is not so optimistic. Only 28.19% rather agree and 14.10% strongly agree with the statement that real-time analyses are performed to support an organization’s activities.

When considering whether a facility’s performance in the clinical area depends on the form of ownership, it can be concluded that taking the average and the Mann–Whitney U test depends. A higher degree of use of analyses in the clinical area can be observed in public institutions.

Whether a medical facility performs a descriptive or predictive analysis do not depend on the form of ownership (p > 0.05). It can be concluded that when analyzing the mean and median, they are higher in public facilities, than in private ones. What is more, the Mann–Whitney U test shows that these variables are dependent from each other (p < 0.05) (Table ​ (Table9 9 ).

Conditions of using Big Data Analytics in medical facilities determined by the form of ownership of medical facility

When considering whether a facility’s performance in the clinical area depends on its size, it can be concluded that taking the Kendall’s Tau (τ) it depends (p < 0.001; τ = 0.22), and the correlation is weak but statistically important. This means that the use of data and analytical systems to support clinical decisions (in the field of diagnostics and therapy) increases with the increase of size of the medical facility. A similar relationship, but even less powerful, can be found in the use of descriptive and predictive analyses (Table ​ (Table10 10 ).

Conditions of using Big Data Analytics in medical facilities determined by the size of medical facility (number of employees)

Considering the results of research in the area of analytical maturity of medical facilities, 8.81% of medical facilities stated that they are at the first level of maturity, i.e. an organization has developed analytical skills and does not perform analyses. As much as 13.66% of medical facilities confirmed that they have poor analytical skills, while 38.33% of the medical facility has located itself at level 3, meaning that “there is a lot to do in analytics”. On the other hand, 28.19% believe that analytical capabilities are well developed and 6.61% stated that analytics are at the highest level and the analytical capabilities are very well developed. Detailed data is presented in Table ​ Table11. 11 . Average amounts to 3.11 and Median to 3.

Analytical maturity of examined medical facilities (%)

The results of the research have enabled the formulation of following conclusions. Medical facilities in Poland are working on both structured and unstructured data. This data comes from databases, transactions, unstructured content of emails and documents, devices and sensors. However, the use of data from social media is smaller. In their activity, they reach for analytics in the administrative and business, as well as in the clinical area. Also, the decisions made are largely data-driven.

In summary, analysis of the literature that the benefits that medical facilities can get using Big Data Analytics in their activities relate primarily to patients, physicians and medical facilities. It can be confirmed that: patients will be better informed, will receive treatments that will work for them, will have prescribed medications that work for them and not be given unnecessary medications [ 78 ]. Physician roles will likely change to more of a consultant than decision maker. They will advise, warn, and help individual patients and have more time to form positive and lasting relationships with their patients in order to help people. Medical facilities will see changes as well, for example in fewer unnecessary hospitalizations, resulting initially in less revenue, but after the market adjusts, also the accomplishment [ 78 ]. The use of Big Data Analytics can literally revolutionize the way healthcare is practiced for better health and disease reduction.

The analysis of the latest data reveals that data analytics increase the accuracy of diagnoses. Physicians can use predictive algorithms to help them make more accurate diagnoses [ 45 ]. Moreover, it could be helpful in preventive medicine and public health because with early intervention, many diseases can be prevented or ameliorated [ 29 ]. Predictive analytics also allows to identify risk factors for a given patient, and with this knowledge patients will be able to change their lives what, in turn, may contribute to the fact that population disease patterns may dramatically change, resulting in savings in medical costs. Moreover, personalized medicine is the best solution for an individual patient seeking treatment. It can help doctors decide the exact treatments for those individuals. Better diagnoses and more targeted treatments will naturally lead to increases in good outcomes and fewer resources used, including doctors’ time.

The quantitative analysis of the research carried out and presented in this article made it possible to determine whether medical facilities in Poland use Big Data Analytics and if so, in which areas. Thanks to the results obtained it was possible to formulate the following conclusions. Medical facilities are working on both structured and unstructured data, which comes from databases, transactions, unstructured content of emails and documents, devices and sensors. According to analytics, they reach for analytics in the administrative and business, as well as in the clinical area. It clearly showed that the decisions made are largely data-driven. The results of the study confirm what has been analyzed in the literature. Medical facilities are moving towards data-based healthcare and its benefits.

In conclusion, Big Data Analytics has the potential for positive impact and global implications in healthcare. Future research on the use of Big Data in medical facilities will concern the definition of strategies adopted by medical facilities to promote and implement such solutions, as well as the benefits they gain from the use of Big Data analysis and how the perspectives in this area are seen.

Practical implications

This work sought to narrow the gap that exists in analyzing the possibility of using Big Data Analytics in healthcare. Showing how medical facilities in Poland are doing in this respect is an element that is part of global research carried out in this area, including [ 29 , 32 , 60 ].

Limitations and future directions

The research described in this article does not fully exhaust the questions related to the use of Big Data Analytics in Polish healthcare facilities. Only some of the dimensions characterizing the use of data by medical facilities in Poland have been examined. In order to get the full picture, it would be necessary to examine the results of using structured and unstructured data analytics in healthcare. Future research may examine the benefits that medical institutions achieve as a result of the analysis of structured and unstructured data in the clinical and management areas and what limitations they encounter in these areas. For this purpose, it is planned to conduct in-depth interviews with chosen medical facilities in Poland. These facilities could give additional data for empirical analyses based more on their suggestions. Further research should also include medical institutions from beyond the borders of Poland, enabling international comparative analyses.

Future research in the healthcare field has virtually endless possibilities. These regard the use of Big Data Analytics to diagnose specific conditions [ 47 , 66 , 69 , 76 ], propose an approach that can be used in other healthcare applications and create mechanisms to identify “patients like me” [ 75 , 80 ]. Big Data Analytics could also be used for studies related to the spread of pandemics, the efficacy of covid treatment [ 18 , 79 ], or psychology and psychiatry studies, e.g. emotion recognition [ 35 ].

Acknowledgements

We would like to thank those who have touched our science paths.

Authors’ contributions

KB proposed the concept of research and its design. The manuscript was prepared by KB with the consultation of AŚ. AŚ reviewed the manuscript for getting its fine shape. KB prepared the manuscript in the contexts such as definition of intellectual content, literature search, data acquisition, data analysis, and so on. AŚ obtained research funding. Both authors read and approved the final manuscript.

This research was fully funded as statutory activity—subsidy of Ministry of Science and Higher Education granted for Technical University of Czestochowa on maintaining research potential in 2018. Research Number: BS/PB–622/3020/2014/P. Publication fee for the paper was financed by the University of Economics in Katowice.

Availability of data and materials

Declarations.

Not applicable.

The author declares no conflict of interest.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Kornelia Batko, Email: [email protected] .

Andrzej Ślęzak, Email: moc.liamg@25kazelsa .

Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023

  • Review article
  • Published: 26 February 2024

Cite this article

  • Anjuli Goel 1 ,
  • Chander Prabha 1 ,
  • Preeti Sharma 1 ,
  • Nitin Mittal 2 &
  • Vikas Mittal 3  

In the present time of industry and academia, the demand for efficient utilization of data storage needs to be taken into account, as lots of duplicate data on the cloud lead to a waste of storage space. Therefore, resulting in a need to explore and propose algorithms to increase the efficiency of storage space on the cloud. Data deduplication is a technique to turn out the need for managing the storage efficiently by removing duplicate data. It is important to study the existing state of art techniques of deduplication available in the literature that solves the storage problem. This paper discusses the impact on research via bibliometric analysis of the data deduplication for a time period from 2010 to 2023. This bibliometric analysis is based on samples of 461 documents taken from the Scopus database. Bibliometric review is done via the Biblioshiny application which is included in the Bibliometric package found in the R language. An analysis is carried out on various aspects such as annual scientific production, total citations per year, authors and documents citations, common key terms, highlights of the relevant authors and sources, and analysis of trending topics in relevant field. The inferred results are structured and organized in such a way as to help researchers in the future by providing directions for them to explore various options. The findings demonstrate that as research advances, experts pay greater attention to the consequences of duplicate data in the cloud brought by the data deduplication process and the research goals are getting more focused.

Avoid common mistakes on your manuscript.

1 Introduction

Data deduplication is a technique that minimizes the storage space and the data can be transferred on a less bandwidth network [ 1 , 2 , 3 , 4 ]. This technique removes the redundant data and leaves only one copy of the same data to avoid unnecessary coverage of space. The main objective of data deduplication is to increase the efficiency of storage utilization on cloud computing [ 5 ]. Cloud vendors can use data deduplication to maintain efficient storage space from existing infrastructure. Figure  1 represents the data deduplication process; redundant data is removed, only single data is maintained, and pointers are set towards the original file for the rest and removed data [ 5 , 6 , 7 ].

figure 1

Data deduplication process [ 8 ]

1.1 Data Deduplication Techniques

Data deduplication techniques are divided into three categories based on granularity, location, and time [ 9 , 10 , 11 ]. Deduplication techniques based on granularity are divided further into file level and chunk level. The whole file is considered as a chunk and compared with the backup file to verify the contents of a file. The chunk level is of two types: fixed size and variable size. In fixed-size chunks, the file is divided into smaller and fixed-size blocks [ 9 , 10 , 12 ]. Whereas, in variable chunks, the file is split into content-based blocks. Each block is stored on a different index in both fixed or variable chunks and compared with the stored file block by block [ 9 , 10 , 13 ]. Techniques based on location are also stratified into two: client side and server side. The client side indicates the removal of redundant data at the source before transmitting it to the server. Whereas, on the server side, replicate copies are deleted at server side which requires high bandwidth to transmit the data from source to target. Post-process and inline deduplication are time-based techniques. Post-process deduplication refers to the eradication of replicated data from the server. The data is transferred to the server as such needs more bandwidth [ 9 , 10 , 14 ]. In inline deduplication, duplicates are removed at the source side and then transferred to the server, which needs less bandwidth to transmit and is the most used technique these days [ 9 , 10 , 14 , 15 , 16 ]. Table 1 briefly describes all deduplication techniques concerning the various attributes [ 15 , 16 ]. Various merits and demerits of data deduplication are shown in Fig.  2 [ 5 , 12 , 16 , 17 , 18 , 19 ].

figure 2

Merits and demerits of data deduplication

The following are the benefits of data deduplication

Less expensive: data deduplication divides the file into smaller pieces which in turn reduces hosting costs.

Improved storage: more storage can be acquired by storing only modified files or data.

Reduced energy consumption: no need to install more disks or hardware to store the customer’s data, this results in less energy consumption.

Improves network bandwidth: less bandwidth can be utilized to transmit the data after applying deduplication process.

Better data recovery: data recovery service is a solution to ensure recovery and backup of the data accurately, quickly and reliably.

The following are the demerits of data deduplication:

High computational overhead: chunking of big files, calculating hash values of those chunks and then matching hash values with existing hash values, the whole process increases the number of operations.

Not able to remove duplicate encrypted data: data deduplication is not able to match and remove the duplicate encrypted data. The same data may be encrypted by different keys.

Data breaches: the data is not secured on the cloud even though there are various techniques available, still data is not secure. Which results in a data breach.

Backup appliance issue: the deduplication process needs an extra hardware device, known as a backup appliance which increases the cost and slows down the performance.

Loss of data integrity: improper matching of data during deduplication results in loss of data integrity. Decoding process never be implemented successfully.

As move forward, the second section provides a detailed explanation of the research methodology with a selection of databases and search patterns. The third section illustrates the outcomes of the bibliometric analysis are discussed with a graphical representation which includes keyword analysis, author-related patterns, and various publication patterns. The fourth section gives the insight details of data visualization bibliometric analysis such as conceptual, intellectual and social patterns are discussed with their graphical representation. In the next section, all results and analyses are discussed with future trends in deduplication research. The last section concludes the whole paper in brief.

2 Research Methodology

By examining the links between various research components, the bibliometric approach captures the use of quantitative tools on bibliometric data and summarizes the bibliometric and intellectual structure of data deduplication. It’s necessary to highlight the contributions of a particular field and point out links, silos, trends, and possible gaps [ 20 , 21 ]. As a result, the bibliometric analysis offers a performance analysis and a scientific mapping that aids in determining the thematic progression of a subject [ 22 , 23 ]. To conduct the bibliometric study, it is necessary to set some research questions. Creating a research question is crucial because it helps focus a wide topic of interest in a particular field of study. Research questions help in making a plan for carrying out the research and research articles. Table 2 represents the research questions along with their objectives.

A well-identified approach has been adopted for conducting the study, called PRISMA. A stated systematic review’s scientific value and transparency are enhanced by the PRISMA research. Some advantages of PRISMA for a research study include demonstrating the quality of the review, enabling readers to evaluate strengths and flaws, and permitting replication of review processes. The steps to get the data file on which the bibliometric analysis is to be performed are mentioned below. These include the selection of the database, search query, inclusion and exclusion criteria, selection of analysis tool, and result for conducting bibliometric analysis. Descriptive analysis has been conducted in terms of authors, citations, sources, keywords, and documents. The steps are shown in Fig.  3 and are explained below:

Database selection: it is very difficult to handle bibliographic data manually due to the involvement of cost and labor [ 24 , 25 ]. Various databases are available to extract the data, but the Scopus database is preferred here due to its compatibility with the bibliometric R language. The Scopus database has been selected for gathering data and conducting the bibliographic study. The Scopus inferred the double articles than others and also includes the articles which are currently in press. The number of articles is decreased due to the non-inclusion of articles in-press [ 24 ].

Preparing for data analysis: data were exported in CSV format. An analysis is done with the R studio bibliometric application. To get more publications in the data deduplication field, TITLE-ABS-KEY (data AND deduplication AND on AND cloud AND computing) search query was used. The 493 documents were extracted from 2010 to 2023 by running this query. The query was based on the abstract, keywords, and title of a domain such as data deduplication, de-duplication, and cloud computing [ 25 , 26 , 27 , 28 , 29 ]. The data file includes all types of documents such as articles, early access, book chapters, and reviews and all documents were in English language only. To know the research trends and patterns in the field of data deduplication, all articles from 2010 to 2023 were included.

Cleaning of data: the process of cleaning data means removing irrelevant documents. The process was carried out to remove records like abstracts, book chapters, early access, and editorial materials. After the cleaning process, 32 documents were removed and 461 remained relevant and can be considered for further study.

Selection of tool: a bibliometric technique has been adopted for the study of scientific analysis. This scientific analysis is very popular among authors and conducting for many years. The authors used software tools for doing this analysis. Various bibliometric tools available such as Perish or Publish [ 30 ], Histcite [ 31 ], CiteSpace [ 32 ], BibExcel [ 33 ], Gephi [ 34 ], VOSviewer [ 35 , 36 ], and more were used for data analysis. In this paper, a software bibliometric developed in R language is considered [ 37 ]. It is easy to use bibliometric tool due to its dynamic features such as user friendly, easy to understand, no need for prior training, open source (no need to buy a license), and capable of upgradation. So, it is better understood and grasped by the users, even naïve users can use it easily.

figure 3

Steps of bibliometric analysis

Table 3 displays the types and number i.e. frequency of documents studied related to data deduplication. There was a total of 461 documents that were considered for study and analyses for this research after the exclusion criteria as shown in Fig.  3 . The articles have the highest frequency and more numbers to consider whereas review articles have less number in frequency followed by a retracted article which is 1 in counting.

3 Descriptive Bibliometrics Trend Analysis

This section describes the descriptive analysis done on various parameters such as the authors’ pattern, sources, citations, and document pattern of the data set. The growth of recent research topics can easily be recognized immediately by the key terms and their frequency reuse within a particular time interval. Figure  4 shows the relation between the top 20 authors, the top 20 most used keywords, and the top 20 sources in the three-field plot, known as the Sankey plots [ 38 ]. The significance of this plot is to show the interrelation among fields. The first field on the left side of the plot contains authors, keywords in the middle, and the sources on the right side of the plot. Sources depict, where authors have mostly published their articles. The most related keywords are cloud computing, data deduplication, deduplication, storage, data security, and data encryption shown in the plot along with sources and most cited authors. All the keywords are completely or partially related to the domain. These authors gave these keywords a new and meaningful shape by solving real-world problems. It is very important to analyze sources, it is depicted in the figure that various journals of IEEE, and ACM were preferred to publish the articles. Virtualization, privacy, edge computing, convergent encryption, and secure deduplication are emerging topics addressed by authors for future research work [ 39 , 40 , 41 , 42 ].

figure 4

Three field plot

Annual scientific production is a graphical representation of published documents from 2010 to 2023. The data is taken from the Scopus database and implemented in the R studio biblioshiny tool. The results clearly show that data deduplication first arose in 2016 with 50 publications, then in 2019 with 70 publications. However, there was a little downfall in 2017 and it had 39 articles. In Fig.  5 , the y-axis represents the number of published articles and the x-axis represents the year in which articles were published. The dots on the line graph indicate the number of articles year-wise and show the growth in data deduplication, which in turn helps the researchers and academicians understand the importance of data deduplication.

figure 5

Annual scientific production

Figure  6 depicts the graph of the average of articles cited per year on data deduplication. It shows the highest cited year was 2011 having average of 5.2 and the lowest cited year was 2010 with an average of 0.8. The topic was not so popular till 2010, after 2010 it entered a maturity level where it was recognized by researchers as a research domain. The continuous growth can be seen in a graph with some backdrops and upwards trends. The development in this research domain includes the number of phases such as forerunner, rampant growth, and a decreasing number of published articles. The literature study showed continuous growth in the publications till 2023.

figure 6

Average citations per year

Figure  7 shows the top most 20 cited sources. According to the figure, Lecture Notes in Computer Science has the highest citations, also the top impact resource in terms of the H index. The various journals relating to data deduplication is included in various disciplines such as lecture notes in computer science, intelligent system, IEEE Transactions, cluster computing, and cloud computing. Table 4 represents the top 20 sources including books, journals, and conference proceedings with their publisher and impact factor. IEEE [ 43 , 44 ], Springer [ 13 , 45 , 46 , 47 ], and Elsevier [ 48 , 49 , 50 , 51 , 52 ] are found as top publishers in the data deduplication field, and most of the titles are submitted in the journals of these three.

figure 7

Most cited sources

The impact of journals is analyzed by its h index as presented in Fig.  8 The h_index is the only metric to know the efficacy and quality of journals based on the published articles and cited articles. H index refers to the relation between the number of published articles and how many times they are cited. Neither the maximum published papers in a journal nor the most cited documents are solely contributed to the journal.

figure 8

Impact of sources

Authors Zhang Y, LI J, HUR J, Zhang W and Yan Z are the most relevant authors having the maximum number of publications in the domain. LI J and HUR J had articles of more than 10. Zhang Y had 15 published documents which is the highest figure in publications in the field till now. ZHANG W and YAN Z placed in the second position in publications with the production of nine articles. This analysis can be clearly seen in Fig.  9 .

figure 9

Most relevant authors

Author LI J has the highest H index of 15. Figure  10 shows the author’s H index analysis. CHEN X and HUR J stand next place with a 7 H index whereas KOO D and WANG Y are in third position with a good H index value of 6. From the future perspective, it is important to know about the most productive authors. The work of these authors is to be considered as a seed for new researchers. Their publications are influential and productive.

figure 10

Author’s impact

Table 5 shows the contributions of different countries in the domain that which country is publishing most of the literature on data deduplication. China is on top with a high number of publications whereas India stays in second place. The USA also has an excellent frequency of production followed by South Korea. The publications are always converted into citations. The more citations the more quality of articles of a country. The country China has the maximum number of citations received more than 700. In the next step, Korea and India scored a smaller number of citations than China. It indicates the gradual awareness in the field among the researchers. According to bibliometric analysis, China performs outclass compared to the rest of the world in the context of publications and citations.

Figure  11 , shows the 20 top most cited documents. The topmost 5 documents have more than 100 citations. HALEVI has the highest citation figure of 357. IRAZOQUI has the second highest 160 citations then LI J stands with 134 citations in the domain. KOO D also has 58 citations in the publications. YAN Z who stands in 5 places relevant authors chart, also has excellent documents citations of 54. The work of these authors is seminal and can be considered as a reference for future work.

figure 11

Most cited documents

The word cloud is the analysis of the number of occurrences of keywords written and used in articles by researchers. The bibliometric analysis is done on the data set and extracts the result in the form of a word cloud Fig.  12 . The keyword with big font has high frequency whereas the decreasing size indicates the lower frequency of keywords used in the articles. It is discernible that digital storage is the mainly used term followed by data deduplication, cloud computing, de-duplication, and cryptography. Besides these, other terms like data privacy, cloud storage, big data, information management, network security, security of data, and secure data deduplication were seen to be presiding. The results can be inferred from the analysis that various terms are still in the emerging phase.

figure 12

Various categories were drawn from the bibliometric analysis related to data deduplication. These categories are accorded in TreeMap as represented in Fig.  13 . The most used categories by authors are digital storage, cloud computing, data deduplication, de duplications, and cryptography. When visualizing the word cloud and TreeMap, the results are mostly the same. The most prominent visual keyword in Word Cloud has the highest percentage in TreeMap. Similarly, the usage of rest keywords in both Figs.  9 and 10 are the same.

figure 13

TreeMap of top 20 categories of data deduplication field

Figure  14 represents the evolving topics that are in trend these days. The year is shown on the y-axis against the terms on the horizontal axis. The main topics in 2016 revolved around such as data reduction, data handling, and cloud computing environments. In 2017 and 2018, more topics came into research like cloud computing, de-duplication, big data, and digital storage. The maximum used and high-frequency term in 2018 was digital storage. The term data deduplication was found to emerge and widely used in the year 2019. As moved from 2019 to till date, data deduplication and cloud computing are two terms that are still trending and authors are focusing on these topics for their research and publications as well.

figure 14

Trend topics

4 Data Visualisation Analysis

Data visualization employs network analysis for quantitative evaluation of the number of developing clusters, node frequency, linkages between different units of study, the overall link strengths, and the number of citations. To extract the networks, several methodologies based on various analytic units, such as documents, authors, and keywords, must be drawn. The nodes in these networks are linked together through connections. The network analysis-based scientific mapping results in three different types of knowledge architectures i.e., conceptual, intellectual, and social structure [ 53 ].

4.1 Conceptual Analysis

Conceptual structures indicate the links between trends, topics, and themes by applying co-occurrence network analysis [ 54 ]. Results are drawn for the author’s keyword with 20 nodes. The layout is automatic, association is normalization, and the algorithm used in the analysis is Louvain’s. In Fig.  12 , the author’s keywords are based on titles and content of documents. These keywords are favored because they are used thoroughly within text and diversely captures the information of documents. In Fig.  15 , the cluster data is shown in three colors viz., red, blue, and green, representing different clusters. Edges indicate the relationship between two nodes, words are represented by a vertex in the network and the node’s size is related to its occurrences. The green cluster is related to cloud security and security deduplication, the red cluster indicates cloud computing and data deduplication and the blue cluster highlights words like deduplication, cloud storage, security, privacy, and cloud [ 13 , 55 , 56 ].

figure 15

Co-occurrence network. (colour figure online)

A thematic map is a classification plot, based on themes and is drawn in two dimensions (x-axis and y-axis). The themes in a plot are categorized into four quadrants based on their centrality and density (Fig.  16 ). The upper right quadrant represents a motor theme indicating clusters (digital storage, cloud computing, and data de duplications) having high density but are less focused. The lower right quadrant represents the basic theme and it contains the cluster of themes (deduplication, efficiency, cryptography, and data privacy) with less development and less centrality. An upper left quadrant has a niche cluster of themes (storage space, data handling, network security, virtual machines, and cloud environments) having more density and more focused. The keywords occurring in the niche quadrant further illustrate that authors are using these keywords more frequently in their research and a lot of work has been published. Lower left quadrant, the cluster of themes (information management, storage management, and bandwidth) is emerging with weak internal and external ties.

figure 16

Thematic map

Figure  17 depicts the thematic evolution map period-wise. This conceptual plot is divided into three intervals: 2010 to 2017, 2018 to 2020 and 2021 to 2023. This evolution map is drawn based on keyword analysis. It illustrates which keywords are used by authors prominently in which period. As seen in Fig, digital storage and cryptography keywords were used from 2010 to 2017, then carried forward to the year 2020 and continuing presently. After that, information security from 2018 to 2020 and efficiency from 2021 to 2023 are the second adopted keywords by authors. This plot is important for understanding the trending keywords period-wise.

figure 17

Period wise thematic map

4.2 Intellectual Analysis

By examining author-country collaborations, intellectual structure analyses how diverse writers affect the scientific community. It demonstrates the degree of collaboration between research organizations and the research companionship, as well as their ties to various institutes. Intellectual structures resulting from citations and co-citation analyses show different viewpoints and schools of thought that have developed over time. Citations mean that one document is cited in the reference list of another document called a citation. Co-citation analyses represent analyzing the correlation between citing and cited articles. It also measures the similarity among the authors, documents, and journals. Analysis of co-citation and citation are measurable approaches [ 57 ]. Figure  18 presents the co-citation network of papers, with 20 nodes. The algorithm followed in the co-citation network is the Louvain clustering. The figure reveals three clusters of authors in three colors red, blue, and green. In a blue cluster, Jiang H. and Zhang Y. are the most prominent author with a greater number of connections. In a red cluster, Ballare is the author followed by most of the other authors with more centrality measures. Li the most influential author [ 58 , 59 , 60 , 61 ] of a green cluster has a more betweenness point of view. Figure  19 depicts the co-citation analysis related to sources of published articles. The most preferable source is from IEEE i.e., IEEE Transactions on parallel and distributed systems. IEEE is the source that is preferred by most of the authors due to its reliability. As per the figure, only IEEE sources are visible and highlighted such as IEEE Access, IEEE Transactions Computing, IEEE Transactions parallel distributed systems, IEEE Transactions on Information Forensic and Security, and more. The cumulative degree of all sources is shown in Fig.  20 . The dots represent the various sources on the scale of nodes and cumulative degree. The IEEE Transactions on Parallel and Distributed Systems has the highest degree of 1, whereas the rest lies between 0 and 1.

figure 18

Co-citation network-papers. (colour figure online)

figure 19

Co-citation network-sources

figure 20

Cumulative degree of co-citation network-sources

4.3 Social Network Analysis

Social network analysis was performed to draw out the interconnections within the particular research field. Nodes denote players in a social network, including researchers, institutions, and sets of nodes that show related relationships. collaborations among these areas serve as a representation of a network’s dynamics. Geographical collaboration was analyzed on the interlinks among various countries’ shared common topics. It is represented in Fig.  21 that China is the most dominating country in the data deduplication field. China strongly collaborates with the USA, Australia, and Canada based on their author’s interest in one topic. In the second cluster, India is the most dominant country in the cluster followed by the United Kingdom, Saudi Arabia, Australia, and Germany. Institutions collaboration is presented in Fig.  22 , Xidian University is the leading university among all the universities as seen in the picture.

figure 21

Collaboration network-countries

figure 22

Collaboration network-institutes

5 Discussion

By doing the bibliometric analysis, this paper has surveyed the evolution, and structure, and explained the major themes related to the data deduplication field. Some important outcomes cannot be shown here because of considering the data from one database only i.e., Scopus. More important databases are not considered such as the Web of Science which can provide a wide range of results. It is seen from the outcomes that development in data deduplication increased gradually. The data utilized is from 2010 to 2023 for bibliometric analysis. The data has been extracted based on a search query and only articles about English-language have been considered.

Figure  1 illustrates three field plots of three fields with 20 nodes. It explained most top 20 keywords on which the top 20 authors are working and publishing in which top 20 sources. It provides help for researchers in the future to know about mostly using keywords, publishers in the field, and future keywords. The publications in the data deduplication field are increasing year by year with increasing awareness among authors. Research papers have been published yearly, but according to results, more articles 70 published in 2019 (Fig.  2 ). Citations play a more important role in research. Figure  3 shows the average citations in a zig-zag form and concludes that many citations were cited in 2011. Since 2012, citation trends are going up and down till the present time. The average citation from 2010 to 2023 is 1.4%. As per the findings of the analysis, it is concluded that Elsevier, Springer, and IEEE are highly publishing journals, and China, India, USA are highly publishing countries in the field having a high number of citations.

A keyword analysis is also helpful in drawing some important conclusions. Many disciplines are associated with the field, but the researchers focused on a few common and central keys. As per Figs.  9 and 10 , the most prominent occurred keywords were digital storage, cloud computing, and data deduplication as these three are used together. Data deduplication is done on cloud computing due to the shortage of storage on the cloud. The thematic map in Fig.  13 explained the density and centrality of data deduplication. Few authors have been focused on data deduplication but it is on the development process at the same time.

However, this paper provides the directions for the future researcher’s. The researchers can take help from the mentioned outcomes in terms of identification of keywords, understanding the pattern of author’s citation and documents citations, identifying the most publishing sources, countries, and institutions, and many more.

6 Conclusions

This bibliometric study has been conducted on data deduplication of duration from 2010 to 2023. This study has mapped out the structure of research in the field. This analysis shows the researcher’s interest in the domain. The results show an increasing number of articles and citations in data deduplication. More themes are associated with data deduplication such as storage, cloud computing, and de-duplication. The maximum number of publications were published in China and India countries. Slowly and gradually awareness is increasing among authors about data deduplication and they started more research in the field. In this work, a thorough and understandable visual analysis of papers about forest gaps in the key journals of Scopus was carried out. The quantity of publications has generally increased since 1981. Most experts in the sector have shattered ties. China started first to produce frequent articles on deduplication. China is the country with the highest productivity and centrality. India and USA started after China but grew faster and become second and third countries in the production of articles. Then, the authors and papers with the greatest number of citations and co-citations were determined to be the most productive. The article titled Proofs of ownership in remote storage systems received the most co-citations. Journals in IEEE, followed by journals in Springer, and Elsevier are the dominant research categories in this field. Hot subjects include cloud security, cloud computing, deduplication, and cloud storage as well as cloud servers and data deduplications, big data, digital storage as well as data handling. The findings demonstrate that as research advances, experts pay greater attention to the consequences of duplicate data in the cloud brought by the data deduplication process and the research goals are getting more focused. Future research may focus on various aspects and management patterns of data deduplication, and how gaps affect the sustainability of deduplication. In conclusion, a thorough examination of the literature’s particular subject matter results in a more thorough, objective scientific interpretation of the literature.

Data Availability

Data is available from the authors upon reasonable request from the corresponding author.

Abbreviations

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Comma-Separated Value

Rasina Begum B, Chitra P (2021) SEEDDUP: a three-tier secure data deduplication architecture-based storage and retrieval for cross-domains over cloud. IETE J Res. https://doi.org/10.1080/03772063.2021.1886882

Article   Google Scholar  

Mao Z, Xue Y, Wang H, Ou W (2019) Research on big data encryption algorithms based on data deduplication technology. In: 2019 international conference on electronic engineering and informatics (EEI). pp 520–522. https://doi.org/10.1109/EEI48997.2019.00118

Malathi P, Suganthidevi S (2021) Comparative study and secure data deduplication techniques for cloud computing storage. In: 2021 international conference on innovative computing, intelligent communication and smart electrical systems (ICSES). pp 1–5. https://doi.org/10.1109/ICSES52305.2021.9633960

Zhang D, Le J, Mu N, Wu J, Liao X (2023) Secure and efficient data deduplication in jointcloud storage. IEEE Trans Cloud Comput 11(1):156–167. https://doi.org/10.1109/TCC.2021.3081702

Viji D, Revathy S (2021) Comparative analysis for content defined chunking algorithms in data deduplication. Spec Issue Inf Retr Web Search 8:255–268. https://doi.org/10.14704/WEB/V18SI02/WEB18070

Wang C, Fu Y, Yan J, Wu X, Zhang Y, Xia H, Yuan Y (2021) A cost-efficient resemblance detection scheme for post-deduplication delta compression in backup systems. wileyonlinelibrary.com/journal/cpe:1-13. https://doi.org/10.1002/cpe.6558

Kumar PMA, Pugazhendhi E, Nayak RK (2022) Cloud storage performance improvement using deduplication and compression techniques. In: 2022 4th international conference on smart systems and inventive technology (ICSSIT). pp 443–449. https://doi.org/10.1109/ICSSIT53264.2022.9716524

Keith W. How does data deduplication work? https://www.actualtechmedia.com/io/how-data-deduplication-works . Accessed 20 May 2014

Chhabra N, Bala M (2020) A comparative study of data deduplication strategies. In: First international conference on secure cyber computing and communication (ICSCCC) 2020. pp 68–72. https://doi.org/10.1109/ICSCCC.2018.8703363

Priya J, Vinothini C, Dinesh PS, Reshmi TS (2021) Data deduplication techniques: a comparative analysis. Int J Aquat Sci 12(3):1057–1065

Google Scholar  

Prajapati P, Shah P (2022) A review on secure data deduplication: cloud storage security issue. J King Saud Univ Comput Inf Sci 34(7):3996–4007. https://doi.org/10.1016/j.jksuci.2020.10.021

Ni F, Jiang S (2019) RapidCDC: leveraging duplicate locality to accelerate chunking in CDC-based deduplication systems. In: Proceedings of the ACM symposium on cloud computing. pp 220–232. https://doi.org/10.1145/3357223.3362731

Shakarami A, Ghobaei-Arani M, Shahidinejad A, Masdari M, Shakarami H (2021) Data replication schemes in cloud computing: a survey. Clust Comput 24(3):2545–2579. https://doi.org/10.1007/s10586-021-03283-7

Lakshmi Narayana N, Tirapathi Reddy B (2020) A comprehensive study on data deduplication techniques in cloud storage systems. High Technol Lett 26(10):670–678

Kim WB, Lee IY (2021) Survey on data deduplication in cloud storage environments. J Inf Process Syst 17(3):658–673

Satish V, Singh DK (2016) Secure deduplication techniques: a study. Int J Comp Appl 137(8):41–43. https://doi.org/10.5120/ijca2016908874

Rajput U, Shinde S, Thakur P, Patil G, Deokar P (2022) Analysis on deduplication techniques for storage of data in cloud. Int Res J Eng Technol 9(5):296–304

What are the real benefits of data deduplication in Cloud? https://www.webwerks.in/blogs/what-are-real-benefits-data-deduplication-cloud . Accessed 5 Dec 2022

Data deduplication 101. https://www.computerweekly.com/tutorial/data-deduplication-101 . Accessed 5 Dec 2022

Donthu N, Kumar S, Mukherjee D, Pandey N, Lim WM (2021) How to conduct a bibliometric analysis: an overview and guidelines. J Bus Res 133:285–296. https://doi.org/10.1016/j.jbusres.2021.04.070

Block JH, Fisch C (2020) Eight tips and questions for your bibliographic study in business and management research. Springer Manag Rev Q 70:307–312. https://doi.org/10.1007/s11301-020-00188-4

Cobo MJ, Lopez-Herrera AG, Herrera-Viedma E, Herrera F (2011) An approach for detecting, quantifying, and visualizing the evolution of a research field: a practical application to the fuzzy sets theory field. J Informetr 5:146–166. https://doi.org/10.1016/j.joi.2010.10.002

Rojas-Sánchez MA, Palos-Sánchez PR, Folgado-Fernández JA (2023) Systematic literature review and bibliometric analysis on virtual reality and education. Educ Inf Technol 28:155–192. https://doi.org/10.1007/s10639-022-11167-5

Garg D, Sidhu J, Rani S (2019) Emerging trends in cloud computing security: a bibliometric analyses. IET Inst Eng Technol 13(3):223–231. https://doi.org/10.1049/iet-sen.2018.5222

Hr S, Thangam (2021) A hybrid cloud approach for efficient data storage and security. In: 6th international conference on communication and electronics systems (ICCES). pp 1072–1076. https://doi.org/10.1109/ICCES51350.2021.9488938

Sharma D, Kumar G, Sharma R (2021) Analysis of heterogeneous data storage and access control management for cloud computing under M/M/c queueing model. Int J Cloud Appl Comput 11(3):58–71. https://doi.org/10.4018/IJCAC.2021070104

Khattar N, Singh J, Sidhu J (2019) Multi-criteria-based energy-efficient framework for VM placement in cloud data centers. Arab J Sci Eng 44:9455–9469. https://doi.org/10.1007/s13369-019-04048-6

Nivedha R, Arshiya SS (2019) An effective system for storing data and resources using cloud computing. Int J Innov Technol Explor Eng 8(6S4):435–437

Serenko A, Bontis N (2004) Meta-review of knowledge management and intellectual capital literature: citation impact and research productivity rankings, knowledge and process management. Wiley Publisher 11(3):185–190. https://doi.org/10.1002/kpm.203

“The Publish or Perish Book,” Harzing.com. https://harzing.com/publications/publish-or-perish-book/pdf . Accessed 10 May 2023

Garfield E (2004) Historiographic mapping of knowledge domains literature. J Inf Sci 30(2):119–145. https://doi.org/10.1177/0165551504042

Jayantha WM, Oladinrin OT (2019) Bibliometric analysis of hedonic price model using CiteSpace. Int J Hous Mark Anal 13(2):357–371. https://doi.org/10.1108/IJHMA-04-2019-0044

Perrson O, Danell R, Schneider JW (2009) How to use Bibexcel for various types of bibliometric analysis. In: Celebrating scholarly communication studies. pp 9–24

Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the international AAAI conference on web and social media, vol 3, no 1. pp 361–362. https://doi.org/10.1609/icwsm.v3i1.13937

Rialti R, Marzi G, Ciappei C, Busso D (2019) Big data and dynamic capabilities: a bibliometric analysis and systematic literature review. Manag Decis 57(2):2052–2068. https://doi.org/10.1108/MD-07-2018-0821

Eck VNJ, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84:523–538. https://doi.org/10.1007/s11192-009-0146-3

Article   PubMed   Google Scholar  

Ariaa M, Cuccurullo C (2017) Bibliometrix: an R-tool for comprehensive science mapping analysis. J Informetr 11(4):959–975. https://doi.org/10.1016/j.joi.2017.08.007

Riehmann P, Hanfler M, Froehlich B (2005) Interactive Sankey diagrams. In: IEEE symposium on information visualization INFOVIS. pp 233–240. https://doi.org/10.1109/INFVIS.2005.1532152

Wang T, Yang M, Guo Y, Wang J (2021) Virtualized resource image storage system based on data deduplication techniques. In: 2021 IEEE international conference on computer science, electronic information engineering and intelligent control technology (CEI). pp 298–302. https://doi.org/10.1109/CEI52496.2021.9574536

Vianny MM, Vempati S, Pazhanivel K, Khasim S (2022) Intelligent compression scheme for securing storage preservation in virtualized hybrid cloud. ECS Trans 107(1):16689–16697. https://doi.org/10.1149/10701.16689ecst

Article   ADS   Google Scholar  

Ming Y, Wang C, Liu H, Zhao Y, Feng J, Zhang N, Shi W (2022) Blockchain-enabled efficient dynamic cross-domain deduplication in edge computing. IEEE Internet Things J 9(17):15639–15656. https://doi.org/10.1109/JIOT.2022.3150042

Yuvaraj D, Kumar VP, Anandaram H, Samatha B, Krishnamoorthy R, Thiyagarajan R (2022) Secure DE-duplication over wireless sensing data using convergent encryption. In: 2022 IEEE 3rd global conference for advancement in technology (GCAT). pp 1–5. https://doi.org/10.1109/GCAT55367.2022.9971983

Teng Y, Xian H, Lu Q, Guo F (2023) A data deduplication scheme based on DBSCAN with tolerable clustering deviation. IEEE Access 11:9742–9750. https://doi.org/10.1109/ACCESS.2022.3231604

Xia W, Wei C, Li Z, Wang X, Zou X (2022) NetSync: a network adaptive and deduplication-inspired delta synchronization approach for cloud storage services. IEEE Trans Parallel Distrib Syst 33(10):2554–2570. https://doi.org/10.1109/TPDS.2022.3145025

Afek Y, Giladi G, Patt-Shamir B (2021) Distributed computing with the cloud. In: Lecture notes in computer science. pp 1–20. https://doi.org/10.48550/arXiv.2109.12930

You W, Chen B (2020) Proofs of ownership on encrypted cloud data via Intel SGX. In: Lecture notes in computer science, vol 12418. pp 400–416. https://doi.org/10.1007/978-3-030-61638-0_22

Wang Z, Gao W, Yang M, Hao R (2022) Enabling secure data sharing with data deduplication and sensitive information hiding in cloud-assisted Electronic Medical Systems. Clust Comput. https://doi.org/10.1007/s10586-022-03785-y

Phyu MP, Sinha GR (eds) (2021) Efficient data deduplication scheme for scale-out distributed storage. In: Data deduplication approaches. pp 153–182

Patra SS, Jena S, Mohanty JR, Gourisaria MK (eds) (2021) DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment. In: Data deduplication approaches. Elsevier, pp 281–306

Girish DS, Bhurane AA (eds) (2021) Essentials of data deduplication using open-source toolkit. In: Data deduplication approaches. Elsevier, pp 125–151

Koushik CSN, Choubey SB, Choubey A, Sinha GR (eds) (2021) Data deduplication for cloud storage. In: Data deduplication approaches. Elsevier, pp 307–317

Mandal R, Mondal MK, Banerjee S, Chakraborty C, Biswas U (2021) A survey and critical analysis on energy generation from datacenter. In: Data deduplication approaches. Elsevier, pp 203–230

Muskan, Singh G, Singh J, Prabha C (2022) Data visualization and its key fundamentals: a comprehensive survey. In: 7th international conference on communication and electronics systems (ICCES). pp 1710–1714. https://doi.org/10.1109/ICCES54183.2022.9835803

Li T, Bai J, Yang X, Liu Q, Chen Y (2018) Co-occurrence network of high-frequency words in the bioinformatics literature: structural characteristics and evolution. Appl Sci 8(10):1–14. https://doi.org/10.3390/app8101994

Jayanthi MK, Saithya PVN, Vaibhavi PS, Reddy YH (2022) Achieving efficient data deduplication and key aggregation encryption system in cloud. In: International conference on intelligent emerging methods of artificial intelligence & cloud computing, vol 273. pp 328–340. https://doi.org/10.1007/978-3-030-92905-3_42

Panyam AS, Jakkula PK, Rao N (2021) Significant cloud computing service for secured heterogeneous data storing and its managing by cloud users. In: 5th international conference on trends in electronics and informatics (ICOEI). pp 1447–1450. https://doi.org/10.1109/ICOEI51242.2021.9452970

Rodríguez-Ruiz F, Almodóvar P, Nguyen Q (2019) Intellectual structure of international new venture research: a bibliometric analysis and suggestions for a future research agenda. Multinatl Bus Rev 27(4):285–316. https://doi.org/10.1108/MBR-01-2018-0003

Zhang D, Deng Y, Zhou Y, Li J, Zhu W, Min G (2022) MGRM: a multi-segment greedy rewriting method to alleviate data fragmentation in deduplication-based cloud backup systems. IEEE Trans Cloud Comput. https://doi.org/10.1109/TCC.2022.3214816

Li J, Li T, Liu Z, Chen X (2019) Secure deduplication system with active key update and its application in IoT. ACM Trans Intell Syst Technol 10(6):1–21. https://doi.org/10.1145/3356468

Li J, Hou M (2018) Improving data availability for deduplication in cloud storage. Int J Grid High Perform Comput 10(2):70–89. https://doi.org/10.4018/IJGHPC.2018040106

Wei J, Niu X, Zhang R, Liu J, Yao Y (2017) Efficient data possession–checking protocol with deduplication in cloud. Int J Distrib Sens Netw. https://doi.org/10.1177/1550147717727461

Download references

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, India

Anjuli Goel, Chander Prabha & Preeti Sharma

Skill Faculty of Engineering and Technology, Shri Viswakarma Skill University, Palwal, Haryana, 121102, India

Nitin Mittal

Electronics and Communication Engineering Department, Chandigarh University, Mohali, Punjab, 140413, India

Vikas Mittal

You can also search for this author in PubMed   Google Scholar

Contributions

All authors have contributed to the article to be recognized as co-author of this article. Conceptualization, methodology, implementation: AG; draft preparation: CP; and supervision, writing, reviews, and editing: PS; writing, reviews, and editing: NM; and writing, reviews, and editing: VM.

Corresponding author

Correspondence to Nitin Mittal .

Ethics declarations

Conflict of interest.

The authors affirm that they have no known financial or interpersonal conflicts that would have appeared to have an impact on the research presented in this study.

Ethical Approval

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national).

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by the any of the authors.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Goel, A., Prabha, C., Sharma, P. et al. Emerging Research Trends in Data Deduplication: A Bibliometric Analysis from 2010 to 2023. Arch Computat Methods Eng (2024). https://doi.org/10.1007/s11831-024-10074-x

Download citation

Received : 23 November 2023

Accepted : 17 January 2024

Published : 26 February 2024

DOI : https://doi.org/10.1007/s11831-024-10074-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us
  • Track your research
  • See us on facebook
  • See us on twitter
  • See us on youtube
  • See us on linkedin
  • See us on instagram

Stanford Medicine study identifies distinct brain organization patterns in women and men

Stanford Medicine researchers have developed a powerful new artificial intelligence model that can distinguish between male and female brains.

February 20, 2024

sex differences in brain

'A key motivation for this study is that sex plays a crucial role in human brain development, in aging, and in the manifestation of psychiatric and neurological disorders,' said Vinod Menon. clelia-clelia

A new study by Stanford Medicine investigators unveils a new artificial intelligence model that was more than 90% successful at determining whether scans of brain activity came from a woman or a man.

The findings, published Feb. 20 in the Proceedings of the National Academy of Sciences, help resolve a long-term controversy about whether reliable sex differences exist in the human brain and suggest that understanding these differences may be critical to addressing neuropsychiatric conditions that affect women and men differently.

“A key motivation for this study is that sex plays a crucial role in human brain development, in aging, and in the manifestation of psychiatric and neurological disorders,” said Vinod Menon , PhD, professor of psychiatry and behavioral sciences and director of the Stanford Cognitive and Systems Neuroscience Laboratory . “Identifying consistent and replicable sex differences in the healthy adult brain is a critical step toward a deeper understanding of sex-specific vulnerabilities in psychiatric and neurological disorders.”

Menon is the study’s senior author. The lead authors are senior research scientist Srikanth Ryali , PhD, and academic staff researcher Yuan Zhang , PhD.

“Hotspots” that most helped the model distinguish male brains from female ones include the default mode network, a brain system that helps us process self-referential information, and the striatum and limbic network, which are involved in learning and how we respond to rewards.

The investigators noted that this work does not weigh in on whether sex-related differences arise early in life or may be driven by hormonal differences or the different societal circumstances that men and women may be more likely to encounter.

Uncovering brain differences

The extent to which a person’s sex affects how their brain is organized and operates has long been a point of dispute among scientists. While we know the sex chromosomes we are born with help determine the cocktail of hormones our brains are exposed to — particularly during early development, puberty and aging — researchers have long struggled to connect sex to concrete differences in the human brain. Brain structures tend to look much the same in men and women, and previous research examining how brain regions work together has also largely failed to turn up consistent brain indicators of sex.

test

Vinod Menon

In their current study, Menon and his team took advantage of recent advances in artificial intelligence, as well as access to multiple large datasets, to pursue a more powerful analysis than has previously been employed. First, they created a deep neural network model, which learns to classify brain imaging data: As the researchers showed brain scans to the model and told it that it was looking at a male or female brain, the model started to “notice” what subtle patterns could help it tell the difference.

This model demonstrated superior performance compared with those in previous studies, in part because it used a deep neural network that analyzes dynamic MRI scans. This approach captures the intricate interplay among different brain regions. When the researchers tested the model on around 1,500 brain scans, it could almost always tell if the scan came from a woman or a man.

The model’s success suggests that detectable sex differences do exist in the brain but just haven’t been picked up reliably before. The fact that it worked so well in different datasets, including brain scans from multiple sites in the U.S. and Europe, make the findings especially convincing as it controls for many confounds that can plague studies of this kind.

“This is a very strong piece of evidence that sex is a robust determinant of human brain organization,” Menon said.

Making predictions

Until recently, a model like the one Menon’s team employed would help researchers sort brains into different groups but wouldn’t provide information about how the sorting happened. Today, however, researchers have access to a tool called “explainable AI,” which can sift through vast amounts of data to explain how a model’s decisions are made.

Using explainable AI, Menon and his team identified the brain networks that were most important to the model’s judgment of whether a brain scan came from a man or a woman. They found the model was most often looking to the default mode network, striatum, and the limbic network to make the call.

The team then wondered if they could create another model that could predict how well participants would do on certain cognitive tasks based on functional brain features that differ between women and men. They developed sex-specific models of cognitive abilities: One model effectively predicted cognitive performance in men but not women, and another in women but not men. The findings indicate that functional brain characteristics varying between sexes have significant behavioral implications.

“These models worked really well because we successfully separated brain patterns between sexes,” Menon said. “That tells me that overlooking sex differences in brain organization could lead us to miss key factors underlying neuropsychiatric disorders.”

While the team applied their deep neural network model to questions about sex differences, Menon says the model can be applied to answer questions regarding how just about any aspect of brain connectivity might relate to any kind of cognitive ability or behavior. He and his team plan to make their model publicly available for any researcher to use.

“Our AI models have very broad applicability,” Menon said. “A researcher could use our models to look for brain differences linked to learning impairments or social functioning differences, for instance — aspects we are keen to understand better to aid individuals in adapting to and surmounting these challenges.”

The research was sponsored by the National Institutes of Health (grants MH084164, EB022907, MH121069, K25HD074652 and AG072114), the Transdisciplinary Initiative, the Uytengsu-Hamilton 22q11 Programs, the Stanford Maternal and Child Health Research Institute, and the NARSAD Young Investigator Award.

About Stanford Medicine

Stanford Medicine is an integrated academic health system comprising the Stanford School of Medicine and adult and pediatric health care delivery systems. Together, they harness the full potential of biomedicine through collaborative research, education and clinical care for patients. For more information, please visit med.stanford.edu .

Artificial intelligence

Exploring ways AI is applied to health care

Stanford Medicine Magazine: AI

IMAGES

  1. FREE 42+ Research Paper Examples in PDF

    research papers on data analysis

  2. Presentation And Analysis Of Data In Research Paper

    research papers on data analysis

  3. Statistical Analysis of Data with report writing

    research papers on data analysis

  4. 😊 Statistical analysis paper. Free statistics project Essays and Papers

    research papers on data analysis

  5. Tools for data analysis in research example

    research papers on data analysis

  6. (PDF) Quantitative Data Analysis

    research papers on data analysis

VIDEO

  1. Research Design: Decide on your Data Analysis Strategy

  2. How to write Data analysis and Findings in Quantitative research paper

  3. Data Analysis Techniques for Quantitative Research

  4. QUANTITATIVE DATA ANALYSIS

  5. HOW TO WRITE RESEARCH METHODS PT. 6: DATA ANALYSIS

  6. DIY: Data Analysis Procedure

COMMENTS

  1. Learning to Do Qualitative Data Analysis: A Starting Point

    Research article First published online February 9, 2020 Learning to Do Qualitative Data Analysis: A Starting Point Jessica Nina Lester, Yonjoo Cho https://orcid.org/0000-0003-2757-5712, and Chad R. Lochmiller View all authors and affiliations Volume 19, Issue 1 https://doi.org/10.1177/1534484320903890 PDF / ePub More Abstract

  2. data analysis Latest Research Papers

    pp. 567-590 Author (s): Kiomars Motarjem Keyword (s): Data Analysis Random Effects Survival Model Download Full-text Futuristic Prediction of Missing Value Imputation Methods Using Extended ANN International Journal of Business Analytics 10.4018/ijban.292055 2022 Vol 9 (3) pp. 0-0 Keyword (s): Data Analysis Missing Data

  3. (PDF) ANALYSIS OF DATA

    Data Analysis is a process of applying statistical practices to organize, represent, describe, evaluate, and interpret data. In statistical applications data analysis can be divided into-...

  4. Home

    The International Journal of Data Science and Analytics is a pioneering journal in data science and analytics, publishing original and applied research outcomes. Focuses on fundamental and applied research outcomes in data and analytics theories, technologies and applications.

  5. Data Science and Analytics: An Overview from Data-Driven Smart

    In this paper, we present a comprehensive view on "Data Science" including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios.

  6. A practical guide to data analysis in general literature reviews

    A general literature review starts with formulating a research question, defining the population, and conducting a systematic search in scientific databases, steps that are well-described elsewhere. 1, 2, 3 Once students feel confident that they have thoroughly combed through relevant databases and found the most relevant research on the topic, ...

  7. Rapid and Rigorous Qualitative Data Analysis: The "RADaR" Technique for

    Similarly, some researchers argue that the most time-consuming step in a qualitative research project occurs during data analysis, as the amount of data (e.g., number of pages) and depth of the data generated from qualitative data collection methods can exceed that of quantitative data collection methods. ... papers, and data reports. Popular ...

  8. Computational Statistics & Data Analysis

    1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statisti...

  9. Data Analysis in Quantitative Research

    Data Analysis in Quantitative Research Yong Moon Jung Reference work entry First Online: 13 January 2019 1641 Accesses 1 Citations Abstract Quantitative data analysis serves as part of an essential process of evidence-making in health and social sciences.

  10. (PDF) Different Types of Data Analysis; Data Analysis Methods and

    The data analysis process, generally, aims to achieve statistical relationships between the variables [8] . ... What are Different Research Approaches? Comprehensive Review of Qualitative,...

  11. Research themes in big data analytics for policymaking: Insights from a

    This approach has been used in several research papers to form the basis for research agenda building (Suominen et al., 2019; Yuan et al., 2015). Using the retrieved publication metadata, the VOSviewer tool (van Eck & Waltman, 2009) was selected to calculate bibliographical coupling weights for all the documents in our data set. VOSviewer is a ...

  12. A new theoretical understanding of big data analytics capabilities in

    Data analysis is complex, but one data-handling method, "Big Data Analytics" (BDA)—the application of advanced analytic techniques, ... The number of published papers on Big Data is increasing. Between 2015 and May 2021, the highest proportion of journal articles for any given year (21%) occurred until May 2021 with the inclusion or ...

  13. Data Analysis in Research: Types & Methods

    Definition of research in data analysis: According to LeCompte and Schensul, research data analysis is a process used by researchers to reduce data to a story and interpret it to derive insights. The data analysis process helps reduce a large chunk of data into smaller fragments, which makes sense.

  14. Big data analytics: a survey

    Big data analytics: a survey Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao & Athanasios V. Vasilakos Journal of Big Data 2, Article number: 21 ( 2015 ) Cite this article 143k Accesses 382 Citations 130 Altmetric Metrics Abstract The age of big data is now coming.

  15. (PDF) Data Analytics: A Literature Review Paper

    The papers were mapped onto a framework according to their methodological stance, approaches to data gathering, and data analysis. This paper also discusses the implications of the analysis in ...

  16. Big data analytics and firm performance: Findings from a mixed-method

    A big data analytics capability is defined as the ability of a firm to capture and analyze data towards the generation of insights by effectively orchestrating and deploying its data, technology, and talent ( Mikalef et al., 2018 ).

  17. Data Science and Analytics: An Overview from Data-Driven Smart

    The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...

  18. The use of Big Data Analytics in healthcare

    The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected results of direct research on the use of Big Data ...

  19. Data science and big data analytics: a systematic review of ...

    Data science and big data analytics (DS &BDA) methodologies and tools are used extensively in supply chains and logistics (SC &L). However, the existing insights are scattered over different literature sources and there is a lack of a structured and unbiased review methodology to systematise DS &BDA application areas in the SC &L comprehensively covering efficiency, resilience and ...

  20. Research on Data Science, Data Analytics and Big Data

    Data Analytics has shown such a tremendous growth across the globe that soon the Big Data market revenue is expected grow by 50 percent.Impact on various sectors like Traveling and transportation, Financial analysis, Retail, Research, Energy management, Healthcare. Keywords: Data Science, Data Analytics, Big Data undefined Suggested Citation:

  21. PDF Structure of a Data Analysis Report

    - Data - Methods - Analysis - Results This format is very familiar to those who have written psych research papers. It often works well for a data analysis paper as well, though one problem with it is that the Methods section often sounds like a bit of a stretch: In a psych research paper the Methods section describes what you did to ...

  22. An Invitation to Intrinsic Compositional Data Analysis Using ...

    We close by drawing some perspectives for further research, inviting for new directions for CoDa analysis based on the intrinsic projective viewpoint. Keywords: compositional data, projective geometry, intrinsic statistical, analysis, Hilbert's projective metric, Fréchet mean, nonparametric regression, Gaussian distribution

  23. (PDF) Data Analytics and Techniques: A Review

    This paper presents several innovative methods that use data analytics techniques to improve the analysis process and data management. Furthermore, this paper discusses how the revolution...

  24. Journal of Medical Internet Research

    This paper is in the following e-collection/theme issue: Mobile Health (mhealth) (2508) Human Factors and Usability Case Studies (652) mHealth for Telemedicine and Homecare (313) Advanced Data Analytics in eHealth (78) Decision Support for Health Professionals (1001) mHealth for Symptom and Disease Monitoring, Chronic Disease Management (1201) Cardiac Disease Management (67)

  25. Genomic data in the All of Us Research Program

    A study describes the release of clinical-grade whole-genome sequence data for 245,388 diverse participants by the All of Us Research Program and characterizes the properties of the dataset.

  26. The use of Big Data Analytics in healthcare

    The introduction of Big Data Analytics (BDA) in healthcare will allow to use new technologies both in treatment of patients and health management. The paper aims at analyzing the possibilities of using Big Data Analytics in healthcare. The research is based on a critical analysis of the literature, as well as the presentation of selected ...

  27. Listening to Children: A Childist Analysis of Children's ...

    Building on critical childhood studies and childism, this paper analyses children's participation in family law cases in Denmark. Spurred particularly by the UN Convention on the Rights of the Child, together with a general shift in the view on children, several jurisdictions, including Denmark, have implemented legislative reform in the last decades to accommodate children's participation ...

  28. Emerging Research Trends in Data Deduplication: A Bibliometric Analysis

    This paper discusses the impact on research via bibliometric analysis of the data deduplication for a time period from 2010 to 2023. This bibliometric analysis is based on samples of 461 documents taken from the Scopus database. ... Research papers have been published yearly, but according to results, more articles 70 published in 2019 ...

  29. Stanford Medicine study identifies distinct brain organization patterns

    A new study by Stanford Medicine investigators unveils a new artificial intelligence model that was more than 90% successful at determining whether scans of brain activity came from a woman or a man.. The findings, published Feb. 20 in the Proceedings of the National Academy of Sciences, help resolve a long-term controversy about whether reliable sex differences exist in the human brain and ...