Jump to navigation

Home

Cochrane Training

Chapter 5: collecting data.

Tianjing Li, Julian PT Higgins, Jonathan J Deeks

Key Points:

  • Systematic reviews have studies, rather than reports, as the unit of interest, and so multiple reports of the same study need to be identified and linked together before or after data extraction.
  • Because of the increasing availability of data sources (e.g. trials registers, regulatory documents, clinical study reports), review authors should decide on which sources may contain the most useful information for the review, and have a plan to resolve discrepancies if information is inconsistent across sources.
  • Review authors are encouraged to develop outlines of tables and figures that will appear in the review to facilitate the design of data collection forms. The key to successful data collection is to construct easy-to-use forms and collect sufficient and unambiguous data that faithfully represent the source in a structured and organized manner.
  • Effort should be made to identify data needed for meta-analyses, which often need to be calculated or converted from data reported in diverse formats.
  • Data should be collected and archived in a form that allows future access and data sharing.

Cite this chapter as: Li T, Higgins JPT, Deeks JJ (editors). Chapter 5: Collecting data. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.4 (updated August 2023). Cochrane, 2023. Available from www.training.cochrane.org/handbook .

5.1 Introduction

Systematic reviews aim to identify all studies that are relevant to their research questions and to synthesize data about the design, risk of bias, and results of those studies. Consequently, the findings of a systematic review depend critically on decisions relating to which data from these studies are presented and analysed. Data collected for systematic reviews should be accurate, complete, and accessible for future updates of the review and for data sharing. Methods used for these decisions must be transparent; they should be chosen to minimize biases and human error. Here we describe approaches that should be used in systematic reviews for collecting data, including extraction of data directly from journal articles and other reports of studies.

5.2 Sources of data

Studies are reported in a range of sources which are detailed later. As discussed in Section 5.2.1 , it is important to link together multiple reports of the same study. The relative strengths and weaknesses of each type of source are discussed in Section 5.2.2 . For guidance on searching for and selecting reports of studies, refer to Chapter 4 .

Journal articles are the source of the majority of data included in systematic reviews. Note that a study can be reported in multiple journal articles, each focusing on some aspect of the study (e.g. design, main results, and other results).

Conference abstracts are commonly available. However, the information presented in conference abstracts is highly variable in reliability, accuracy, and level of detail (Li et al 2017).

Errata and letters can be important sources of information about studies, including critical weaknesses and retractions, and review authors should examine these if they are identified (see MECIR Box 5.2.a ).

Trials registers (e.g. ClinicalTrials.gov) catalogue trials that have been planned or started, and have become an important data source for identifying trials, for comparing published outcomes and results with those planned, and for obtaining efficacy and safety data that are not available elsewhere (Ross et al 2009, Jones et al 2015, Baudard et al 2017).

Clinical study reports (CSRs) contain unabridged and comprehensive descriptions of the clinical problem, design, conduct and results of clinical trials, following a structure and content guidance prescribed by the International Conference on Harmonisation (ICH 1995). To obtain marketing approval of drugs and biologics for a specific indication, pharmaceutical companies submit CSRs and other required materials to regulatory authorities. Because CSRs also incorporate tables and figures, with appendices containing the protocol, statistical analysis plan, sample case report forms, and patient data listings (including narratives of all serious adverse events), they can be thousands of pages in length. CSRs often contain more data about trial methods and results than any other single data source (Mayo-Wilson et al 2018). CSRs are often difficult to access, and are usually not publicly available. Review authors could request CSRs from the European Medicines Agency (Davis and Miller 2017). The US Food and Drug and Administration had historically avoided releasing CSRs but launched a pilot programme in 2018 whereby selected portions of CSRs for new drug applications were posted on the agency’s website. Many CSRs are obtained through unsealed litigation documents, repositories (e.g. clinicalstudydatarequest.com ), and other open data and data-sharing channels (e.g. The Yale University Open Data Access Project) (Doshi et al 2013, Wieland et al 2014, Mayo-Wilson et al 2018)).

Regulatory reviews such as those available from the US Food and Drug Administration or European Medicines Agency provide useful information about trials of drugs, biologics, and medical devices submitted by manufacturers for marketing approval (Turner 2013). These documents are summaries of CSRs and related documents, prepared by agency staff as part of the process of approving the products for marketing, after reanalysing the original trial data. Regulatory reviews often are available only for the first approved use of an intervention and not for later applications (although review authors may request those documents, which are usually brief). Using regulatory reviews from the US Food and Drug Administration as an example, drug approval packages are available on the agency’s website for drugs approved since 1997 (Turner 2013); for drugs approved before 1997, information must be requested through a freedom of information request. The drug approval packages contain various documents: approval letter(s), medical review(s), chemistry review(s), clinical pharmacology review(s), and statistical reviews(s).

Individual participant data (IPD) are usually sought directly from the researchers responsible for the study, or may be identified from open data repositories (e.g. www.clinicalstudydatarequest.com ). These data typically include variables that represent the characteristics of each participant, intervention (or exposure) group, prognostic factors, and measurements of outcomes (Stewart et al 2015). Access to IPD has the advantage of allowing review authors to reanalyse the data flexibly, in accordance with the preferred analysis methods outlined in the protocol, and can reduce the variation in analysis methods across studies included in the review. IPD reviews are addressed in detail in Chapter 26 .

MECIR Box 5.2.a Relevant expectations for conduct of intervention reviews

5.2.1 Studies (not reports) as the unit of interest

In a systematic review, studies rather than reports of studies are the principal unit of interest. Since a study may have been reported in several sources, a comprehensive search for studies for the review may identify many reports from a potentially relevant study (Mayo-Wilson et al 2017a, Mayo-Wilson et al 2018). Conversely, a report may describe more than one study.

Multiple reports of the same study should be linked together (see MECIR Box 5.2.b ). Some authors prefer to link reports before they collect data, and collect data from across the reports onto a single form. Other authors prefer to collect data from each report and then link together the collected data across reports. Either strategy may be appropriate, depending on the nature of the reports at hand. It may not be clear that two reports relate to the same study until data collection has commenced. Although sometimes there is a single report for each study, it should never be assumed that this is the case.

MECIR Box 5.2.b Relevant expectations for conduct of intervention reviews

It can be difficult to link multiple reports from the same study, and review authors may need to do some ‘detective work’. Multiple sources about the same trial may not reference each other, do not share common authors (Gøtzsche 1989, Tramèr et al 1997), or report discrepant information about the study design, characteristics, outcomes, and results (von Elm et al 2004, Mayo-Wilson et al 2017a).

Some of the most useful criteria for linking reports are:

  • trial registration numbers;
  • authors’ names;
  • sponsor for the study and sponsor identifiers (e.g. grant or contract numbers);
  • location and setting (particularly if institutions, such as hospitals, are named);
  • specific details of the interventions (e.g. dose, frequency);
  • numbers of participants and baseline data; and
  • date and duration of the study (which also can clarify whether different sample sizes are due to different periods of recruitment), length of follow-up, or subgroups selected to address secondary goals.

Review authors should use as many trial characteristics as possible to link multiple reports. When uncertainties remain after considering these and other factors, it may be necessary to correspond with the study authors or sponsors for confirmation.

5.2.2 Determining which sources might be most useful

A comprehensive search to identify all eligible studies from all possible sources is resource-intensive but necessary for a high-quality systematic review (see Chapter 4 ). Because some data sources are more useful than others (Mayo-Wilson et al 2018), review authors should consider which data sources may be available and which may contain the most useful information for the review. These considerations should be described in the protocol. Table 5.2.a summarizes the strengths and limitations of different data sources (Mayo-Wilson et al 2018). Gaining access to CSRs and IPD often takes a long time. Review authors should begin searching repositories and contact trial investigators and sponsors as early as possible to negotiate data usage agreements (Mayo-Wilson et al 2015, Mayo-Wilson et al 2018).

Table 5.2.a Strengths and limitations of different data sources for systematic reviews

5.2.3 Correspondence with investigators

Review authors often find that they are unable to obtain all the information they seek from available reports about the details of the study design, the full range of outcomes measured and the numerical results. In such circumstances, authors are strongly encouraged to contact the original investigators (see MECIR Box 5.2.c ). Contact details of study authors, when not available from the study reports, often can be obtained from more recent publications, from university or institutional staff listings, from membership directories of professional societies, or by a general search of the web. If the contact author named in the study report cannot be contacted or does not respond, it is worthwhile attempting to contact other authors.

Review authors should consider the nature of the information they require and make their request accordingly. For descriptive information about the conduct of the trial, it may be most appropriate to ask open-ended questions (e.g. how was the allocation process conducted, or how were missing data handled?). If specific numerical data are required, it may be more helpful to request them specifically, possibly providing a short data collection form (either uncompleted or partially completed). If IPD are required, they should be specifically requested (see also Chapter 26 ). In some cases, study investigators may find it more convenient to provide IPD rather than conduct additional analyses to obtain the specific statistics requested.

MECIR Box 5.2.c Relevant expectations for conduct of intervention reviews

5.3 What data to collect

5.3.1 what are data.

For the purposes of this chapter, we define ‘data’ to be any information about (or derived from) a study, including details of methods, participants, setting, context, interventions, outcomes, results, publications, and investigators. Review authors should plan in advance what data will be required for their systematic review, and develop a strategy for obtaining them (see MECIR Box 5.3.a ). The involvement of consumers and other stakeholders can be helpful in ensuring that the categories of data collected are sufficiently aligned with the needs of review users ( Chapter 1, Section 1.3 ). The data to be sought should be described in the protocol, with consideration wherever possible of the issues raised in the rest of this chapter.

The data collected for a review should adequately describe the included studies, support the construction of tables and figures, facilitate the risk of bias assessment, and enable syntheses and meta-analyses. Review authors should familiarize themselves with reporting guidelines for systematic reviews (see online Chapter III and the PRISMA statement; (Liberati et al 2009) to ensure that relevant elements and sections are incorporated. The following sections review the types of information that should be sought, and these are summarized in Table 5.3.a (Li et al 2015).

MECIR Box 5.3.a Relevant expectations for conduct of intervention reviews

Table 5.3.a Checklist of items to consider in data collection

*Full description required for assessments of risk of bias (see Chapter 8 , Chapter 23 and Chapter 25 ).

5.3.2 Study methods and potential sources of bias

Different research methods can influence study outcomes by introducing different biases into results. Important study design characteristics should be collected to allow the selection of appropriate methods for assessment and analysis, and to enable description of the design of each included study in a table of ‘Characteristics of included studies’, including whether the study is randomized, whether the study has a cluster or crossover design, and the duration of the study. If the review includes non-randomized studies, appropriate features of the studies should be described (see Chapter 24 ).

Detailed information should be collected to facilitate assessment of the risk of bias in each included study. Risk-of-bias assessment should be conducted using the tool most appropriate for the design of each study, and the information required to complete the assessment will depend on the tool. Randomized studies should be assessed using the tool described in Chapter 8 . The tool covers bias arising from the randomization process, due to deviations from intended interventions, due to missing outcome data, in measurement of the outcome, and in selection of the reported result. For each item in the tool, a description of what happened in the study is required, which may include verbatim quotes from study reports. Information for assessment of bias due to missing outcome data and selection of the reported result may be most conveniently collected alongside information on outcomes and results. Chapter 7 (Section 7.3.1) discusses some issues in the collection of information for assessments of risk of bias. For non-randomized studies, the most appropriate tool is described in Chapter 25 . A separate tool also covers bias due to missing results in meta-analysis (see Chapter 13 ).

A particularly important piece of information is the funding source of the study and potential conflicts of interest of the study authors.

Some review authors will wish to collect additional information on study characteristics that bear on the quality of the study’s conduct but that may not lead directly to risk of bias, such as whether ethical approval was obtained and whether a sample size calculation was performed a priori.

5.3.3 Participants and setting

Details of participants are collected to enable an understanding of the comparability of, and differences between, the participants within and between included studies, and to allow assessment of how directly or completely the participants in the included studies reflect the original review question.

Typically, aspects that should be collected are those that could (or are believed to) affect presence or magnitude of an intervention effect and those that could help review users assess applicability to populations beyond the review. For example, if the review authors suspect important differences in intervention effect between different socio-economic groups, this information should be collected. If intervention effects are thought constant over such groups, and if such information would not be useful to help apply results, it should not be collected. Participant characteristics that are often useful for assessing applicability include age and sex. Summary information about these should always be collected unless they are not obvious from the context. These characteristics are likely to be presented in different formats (e.g. ages as means or medians, with standard deviations or ranges; sex as percentages or counts for the whole study or for each intervention group separately). Review authors should seek consistent quantities where possible, and decide whether it is more relevant to summarize characteristics for the study as a whole or by intervention group. It may not be possible to select the most consistent statistics until data collection is complete across all or most included studies. Other characteristics that are sometimes important include ethnicity, socio-demographic details (e.g. education level) and the presence of comorbid conditions. Clinical characteristics relevant to the review question (e.g. glucose level for reviews on diabetes) also are important for understanding the severity or stage of the disease.

Diagnostic criteria that were used to define the condition of interest can be a particularly important source of diversity across studies and should be collected. For example, in a review of drug therapy for congestive heart failure, it is important to know how the definition and severity of heart failure was determined in each study (e.g. systolic or diastolic dysfunction, severe systolic dysfunction with ejection fractions below 20%). Similarly, in a review of antihypertensive therapy, it is important to describe baseline levels of blood pressure of participants.

If the settings of studies may influence intervention effects or applicability, then information on these should be collected. Typical settings of healthcare intervention studies include acute care hospitals, emergency facilities, general practice, and extended care facilities such as nursing homes, offices, schools, and communities. Sometimes studies are conducted in different geographical regions with important differences that could affect delivery of an intervention and its outcomes, such as cultural characteristics, economic context, or rural versus city settings. Timing of the study may be associated with important technology differences or trends over time. If such information is important for the interpretation of the review, it should be collected.

Important characteristics of the participants in each included study should be summarized for the reader in the table of ‘Characteristics of included studies’.

5.3.4 Interventions

Details of all experimental and comparator interventions of relevance to the review should be collected. Again, details are required for aspects that could affect the presence or magnitude of an effect or that could help review users assess applicability to their own circumstances. Where feasible, information should be sought (and presented in the review) that is sufficient for replication of the interventions under study. This includes any co-interventions administered as part of the study, and applies similarly to comparators such as ‘usual care’. Review authors may need to request missing information from study authors.

The Template for Intervention Description and Replication (TIDieR) provides a comprehensive framework for full description of interventions and has been proposed for use in systematic reviews as well as reports of primary studies (Hoffmann et al 2014). The checklist includes descriptions of:

  • the rationale for the intervention and how it is expected to work;
  • any documentation that instructs the recipient on the intervention;
  • what the providers do to deliver the intervention (procedures and processes);
  • who provides the intervention (including their skill level), how (e.g. face to face, web-based) and in what setting (e.g. home, school, or hospital);
  • the timing and intensity;
  • whether any variation is permitted or expected, and whether modifications were actually made; and
  • any strategies used to ensure or assess fidelity or adherence to the intervention, and the extent to which the intervention was delivered as planned.

For clinical trials of pharmacological interventions, key information to collect will often include routes of delivery (e.g. oral or intravenous delivery), doses (e.g. amount or intensity of each treatment, frequency of delivery), timing (e.g. within 24 hours of diagnosis), and length of treatment. For other interventions, such as those that evaluate psychotherapy, behavioural and educational approaches, or healthcare delivery strategies, the amount of information required to characterize the intervention will typically be greater, including information about multiple elements of the intervention, who delivered it, and the format and timing of delivery. Chapter 17 provides further information on how to manage intervention complexity, and how the intervention Complexity Assessment Tool (iCAT) can facilitate data collection (Lewin et al 2017).

Important characteristics of the interventions in each included study should be summarized for the reader in the table of ‘Characteristics of included studies’. Additional tables or diagrams such as logic models ( Chapter 2, Section 2.5.1 ) can assist descriptions of multi-component interventions so that review users can better assess review applicability to their context.

5.3.4.1 Integrity of interventions

The degree to which specified procedures or components of the intervention are implemented as planned can have important consequences for the findings from a study. We describe this as intervention integrity ; related terms include adherence, compliance and fidelity (Carroll et al 2007). The verification of intervention integrity may be particularly important in reviews of non-pharmacological trials such as behavioural interventions and complex interventions, which are often implemented in conditions that present numerous obstacles to idealized delivery.

It is generally expected that reports of randomized trials provide detailed accounts of intervention implementation (Zwarenstein et al 2008, Moher et al 2010). In assessing whether interventions were implemented as planned, review authors should bear in mind that some interventions are standardized (with no deviations permitted in the intervention protocol), whereas others explicitly allow a degree of tailoring (Zwarenstein et al 2008). In addition, the growing field of implementation science has led to an increased awareness of the impact of setting and context on delivery of interventions (Damschroder et al 2009). (See Chapter 17, Section 17.1.2.1 for further information and discussion about how an intervention may be tailored to local conditions in order to preserve its integrity.)

Information about integrity can help determine whether unpromising results are due to a poorly conceptualized intervention or to an incomplete delivery of the prescribed components. It can also reveal important information about the feasibility of implementing a given intervention in real life settings. If it is difficult to achieve full implementation in practice, the intervention will have low feasibility (Dusenbury et al 2003).

Whether a lack of intervention integrity leads to a risk of bias in the estimate of its effect depends on whether review authors and users are interested in the effect of assignment to intervention or the effect of adhering to intervention, as discussed in more detail in Chapter 8, Section 8.2.2 . Assessment of deviations from intended interventions is important for assessing risk of bias in the latter, but not the former (see Chapter 8, Section 8.4 ), but both may be of interest to decision makers in different ways.

An example of a Cochrane Review evaluating intervention integrity is provided by a review of smoking cessation in pregnancy (Chamberlain et al 2017). The authors found that process evaluation of the intervention occurred in only some trials and that the implementation was less than ideal in others, including some of the largest trials. The review highlighted how the transfer of an intervention from one setting to another may reduce its effectiveness when elements are changed, or aspects of the materials are culturally inappropriate.

5.3.4.2 Process evaluations

Process evaluations seek to evaluate the process (and mechanisms) between the intervention’s intended implementation and the actual effect on the outcome (Moore et al 2015). Process evaluation studies are characterized by a flexible approach to data collection and the use of numerous methods to generate a range of different types of data, encompassing both quantitative and qualitative methods. Guidance for including process evaluations in systematic reviews is provided in Chapter 21 . When it is considered important, review authors should aim to collect information on whether the trial accounted for, or measured, key process factors and whether the trials that thoroughly addressed integrity showed a greater impact. Process evaluations can be a useful source of factors that potentially influence the effectiveness of an intervention.

5.3.5 Outcome s

An outcome is an event or a measurement value observed or recorded for a particular person or intervention unit in a study during or following an intervention, and that is used to assess the efficacy and safety of the studied intervention (Meinert 2012). Review authors should indicate in advance whether they plan to collect information about all outcomes measured in a study or only those outcomes of (pre-specified) interest in the review. Research has shown that trials addressing the same condition and intervention seldom agree on which outcomes are the most important, and consequently report on numerous different outcomes (Dwan et al 2014, Ismail et al 2014, Denniston et al 2015, Saldanha et al 2017a). The selection of outcomes across systematic reviews of the same condition is also inconsistent (Page et al 2014, Saldanha et al 2014, Saldanha et al 2016, Liu et al 2017). Outcomes used in trials and in systematic reviews of the same condition have limited overlap (Saldanha et al 2017a, Saldanha et al 2017b).

We recommend that only the outcomes defined in the protocol be described in detail. However, a complete list of the names of all outcomes measured may allow a more detailed assessment of the risk of bias due to missing outcome data (see Chapter 13 ).

Review authors should collect all five elements of an outcome (Zarin et al 2011, Saldanha et al 2014):

1. outcome domain or title (e.g. anxiety);

2. measurement tool or instrument (including definition of clinical outcomes or endpoints); for a scale, name of the scale (e.g. the Hamilton Anxiety Rating Scale), upper and lower limits, and whether a high or low score is favourable, definitions of any thresholds if appropriate;

3. specific metric used to characterize each participant’s results (e.g. post-intervention anxiety, or change in anxiety from baseline to a post-intervention time point, or post-intervention presence of anxiety (yes/no));

4. method of aggregation (e.g. mean and standard deviation of anxiety scores in each group, or proportion of people with anxiety);

5. timing of outcome measurements (e.g. assessments at end of eight-week intervention period, events occurring during eight-week intervention period).

Further considerations for economics outcomes are discussed in Chapter 20 , and for patient-reported outcomes in Chapter 18 .

5.3.5.1 Adverse effects

Collection of information about the harmful effects of an intervention can pose particular difficulties, discussed in detail in Chapter 19 . These outcomes may be described using multiple terms, including ‘adverse event’, ‘adverse effect’, ‘adverse drug reaction’, ‘side effect’ and ‘complication’. Many of these terminologies are used interchangeably in the literature, although some are technically different. Harms might additionally be interpreted to include undesirable changes in other outcomes measured during a study, such as a decrease in quality of life where an improvement may have been anticipated.

In clinical trials, adverse events can be collected either systematically or non-systematically. Systematic collection refers to collecting adverse events in the same manner for each participant using defined methods such as a questionnaire or a laboratory test. For systematically collected outcomes representing harm, data can be collected by review authors in the same way as efficacy outcomes (see Section 5.3.5 ).

Non-systematic collection refers to collection of information on adverse events using methods such as open-ended questions (e.g. ‘Have you noticed any symptoms since your last visit?’), or reported by participants spontaneously. In either case, adverse events may be selectively reported based on their severity, and whether the participant suspected that the effect may have been caused by the intervention, which could lead to bias in the available data. Unfortunately, most adverse events are collected non-systematically rather than systematically, creating a challenge for review authors. The following pieces of information are useful and worth collecting (Nicole Fusco, personal communication):

  • any coding system or standard medical terminology used (e.g. COSTART, MedDRA), including version number;
  • name of the adverse events (e.g. dizziness);
  • reported intensity of the adverse event (e.g. mild, moderate, severe);
  • whether the trial investigators categorized the adverse event as ‘serious’;
  • whether the trial investigators identified the adverse event as being related to the intervention;
  • time point (most commonly measured as a count over the duration of the study);
  • any reported methods for how adverse events were selected for inclusion in the publication (e.g. ‘We reported all adverse events that occurred in at least 5% of participants’); and
  • associated results.

Different collection methods lead to very different accounting of adverse events (Safer 2002, Bent et al 2006, Ioannidis et al 2006, Carvajal et al 2011, Allen et al 2013). Non-systematic collection methods tend to underestimate how frequently an adverse event occurs. It is particularly problematic when the adverse event of interest to the review is collected systematically in some studies but non-systematically in other studies. Different collection methods introduce an important source of heterogeneity. In addition, when non-systematic adverse events are reported based on quantitative selection criteria (e.g. only adverse events that occurred in at least 5% of participants were included in the publication), use of reported data alone may bias the results of meta-analyses. Review authors should be cautious of (or refrain from) synthesizing adverse events that are collected differently.

Regardless of the collection methods, precise definitions of adverse effect outcomes and their intensity should be recorded, since they may vary between studies. For example, in a review of aspirin and gastrointestinal haemorrhage, some trials simply reported gastrointestinal bleeds, while others reported specific categories of bleeding, such as haematemesis, melaena, and proctorrhagia (Derry and Loke 2000). The definition and reporting of severity of the haemorrhages (e.g. major, severe, requiring hospital admission) also varied considerably among the trials (Zanchetti and Hansson 1999). Moreover, a particular adverse effect may be described or measured in different ways among the studies. For example, the terms ‘tiredness’, ‘fatigue’ or ‘lethargy’ may all be used in reporting of adverse effects. Study authors also may use different thresholds for ‘abnormal’ results (e.g. hypokalaemia diagnosed at a serum potassium concentration of 3.0 mmol/L or 3.5 mmol/L).

No mention of adverse events in trial reports does not necessarily mean that no adverse events occurred. It is usually safest to assume that they were not reported. Quality of life measures are sometimes used as a measure of the participants’ experience during the study, but these are usually general measures that do not look specifically at particular adverse effects of the intervention. While quality of life measures are important and can be used to gauge overall participant well-being, they should not be regarded as substitutes for a detailed evaluation of safety and tolerability.

5.3.6 Results

Results data arise from the measurement or ascertainment of outcomes for individual participants in an intervention study. Results data may be available for each individual in a study (i.e. individual participant data; see Chapter 26 ), or summarized at arm level, or summarized at study level into an intervention effect by comparing two intervention arms. Results data should be collected only for the intervention groups and outcomes specified to be of interest in the protocol (see MECIR Box 5.3.b ). Results for other outcomes should not be collected unless the protocol is modified to add them. Any modification should be reported in the review. However, review authors should be alert to the possibility of important, unexpected findings, particularly serious adverse effects.

MECIR Box 5.3.b Relevant expectations for conduct of intervention reviews

Reports of studies often include several results for the same outcome. For example, different measurement scales might be used, results may be presented separately for different subgroups, and outcomes may have been measured at different follow-up time points. Variation in the results can be very large, depending on which data are selected (Gøtzsche et al 2007, Mayo-Wilson et al 2017a). Review protocols should be as specific as possible about which outcome domains, measurement tools, time points, and summary statistics (e.g. final values versus change from baseline) are to be collected (Mayo-Wilson et al 2017b). A framework should be pre-specified in the protocol to facilitate making choices between multiple eligible measures or results. For example, a hierarchy of preferred measures might be created, or plans articulated to select the result with the median effect size, or to average across all eligible results for a particular outcome domain (see also Chapter 9, Section 9.3.3 ). Any additional decisions or changes to this framework made once the data are collected should be reported in the review as changes to the protocol.

Section 5.6 describes the numbers that will be required to perform meta-analysis, if appropriate. The unit of analysis (e.g. participant, cluster, body part, treatment period) should be recorded for each result when it is not obvious (see Chapter 6, Section 6.2 ). The type of outcome data determines the nature of the numbers that will be sought for each outcome. For example, for a dichotomous (‘yes’ or ‘no’) outcome, the number of participants and the number who experienced the outcome will be sought for each group. It is important to collect the sample size relevant to each result, although this is not always obvious. A flow diagram as recommended in the CONSORT Statement (Moher et al 2001) can help to determine the flow of participants through a study. If one is not available in a published report, review authors can consider drawing one (available from www.consort-statement.org ).

The numbers required for meta-analysis are not always available. Often, other statistics can be collected and converted into the required format. For example, for a continuous outcome, it is usually most convenient to seek the number of participants, the mean and the standard deviation for each intervention group. These are often not available directly, especially the standard deviation. Alternative statistics enable calculation or estimation of the missing standard deviation (such as a standard error, a confidence interval, a test statistic (e.g. from a t-test or F-test) or a P value). These should be extracted if they provide potentially useful information (see MECIR Box 5.3.c ). Details of recalculation are provided in Section 5.6 . Further considerations for dealing with missing data are discussed in Chapter 10, Section 10.12 .

MECIR Box 5.3.c Relevant expectations for conduct of intervention reviews

5.3.7 Other information to collect

We recommend that review authors collect the key conclusions of the included study as reported by its authors. It is not necessary to report these conclusions in the review, but they should be used to verify the results of analyses undertaken by the review authors, particularly in relation to the direction of effect. Further comments by the study authors, for example any explanations they provide for unexpected findings, may be noted. References to other studies that are cited in the study report may be useful, although review authors should be aware of the possibility of citation bias (see Chapter 7, Section 7.2.3.2 ). Documentation of any correspondence with the study authors is important for review transparency.

5.4 Data collection tools

5.4.1 rationale for data collection forms.

Data collection for systematic reviews should be performed using structured data collection forms (see MECIR Box 5.4.a ). These can be paper forms, electronic forms (e.g. Google Form), or commercially or custom-built data systems (e.g. Covidence, EPPI-Reviewer, Systematic Review Data Repository (SRDR)) that allow online form building, data entry by several users, data sharing, and efficient data management (Li et al 2015). All different means of data collection require data collection forms.

MECIR Box 5.4.a Relevant expectations for conduct of intervention reviews

The data collection form is a bridge between what is reported by the original investigators (e.g. in journal articles, abstracts, personal correspondence) and what is ultimately reported by the review authors. The data collection form serves several important functions (Meade and Richardson 1997). First, the form is linked directly to the review question and criteria for assessing eligibility of studies, and provides a clear summary of these that can be used to identify and structure the data to be extracted from study reports. Second, the data collection form is the historical record of the provenance of the data used in the review, as well as the multitude of decisions (and changes to decisions) that occur throughout the review process. Third, the form is the source of data for inclusion in an analysis.

Given the important functions of data collection forms, ample time and thought should be invested in their design. Because each review is different, data collection forms will vary across reviews. However, there are many similarities in the types of information that are important. Thus, forms can be adapted from one review to the next. Although we use the term ‘data collection form’ in the singular, in practice it may be a series of forms used for different purposes: for example, a separate form could be used to assess the eligibility of studies for inclusion in the review to assist in the quick identification of studies to be excluded from or included in the review.

5.4.2 Considerations in selecting data collection tools

The choice of data collection tool is largely dependent on review authors’ preferences, the size of the review, and resources available to the author team. Potential advantages and considerations of selecting one data collection tool over another are outlined in Table 5.4.a (Li et al 2015). A significant advantage that data systems have is in data management ( Chapter 1, Section 1.6 ) and re-use. They make review updates more efficient, and also facilitate methodological research across reviews. Numerous ‘meta-epidemiological’ studies have been carried out using Cochrane Review data, resulting in methodological advances which would not have been possible if thousands of studies had not all been described using the same data structures in the same system.

Some data collection tools facilitate automatic imports of extracted data into RevMan (Cochrane’s authoring tool), such as CSV (Excel) and Covidence. Details available here https://documentation.cochrane.org/revman-kb/populate-study-data-260702462.html

Table 5.4.a Considerations in selecting data collection tools

5.4.3 Design of a data collection form

Regardless of whether data are collected using a paper or electronic form, or a data system, the key to successful data collection is to construct easy-to-use forms and collect sufficient and unambiguous data that faithfully represent the source in a structured and organized manner (Li et al 2015). In most cases, a document format should be developed for the form before building an electronic form or a data system. This can be distributed to others, including programmers and data analysts, and as a guide for creating an electronic form and any guidance or codebook to be used by data extractors. Review authors also should consider compatibility of any electronic form or data system with analytical software, as well as mechanisms for recording, assessing and correcting data entry errors.

Data described in multiple reports (or even within a single report) of a study may not be consistent. Review authors will need to describe how they work with multiple reports in the protocol, for example, by pre-specifying which report will be used when sources contain conflicting data that cannot be resolved by contacting the investigators. Likewise, when there is only one report identified for a study, review authors should specify the section within the report (e.g. abstract, methods, results, tables, and figures) for use in case of inconsistent information.

If review authors wish to automatically import their extracted data into RevMan, it is advised that their data collection forms match the data extraction templates available via the RevMan Knowledge Base. Details available here https://documentation.cochrane.org/revman-kb/data-extraction-templates-260702375.html.

A good data collection form should minimize the need to go back to the source documents. When designing a data collection form, review authors should involve all members of the team, that is, content area experts, authors with experience in systematic review methods and data collection form design, statisticians, and persons who will perform data extraction. Here are suggested steps and some tips for designing a data collection form, based on the informal collation of experiences from numerous review authors (Li et al 2015).

Step 1. Develop outlines of tables and figures expected to appear in the systematic review, considering the comparisons to be made between different interventions within the review, and the various outcomes to be measured. This step will help review authors decide the right amount of data to collect (not too much or too little). Collecting too much information can lead to forms that are longer than original study reports, and can be very wasteful of time. Collection of too little information, or omission of key data, can lead to the need to return to study reports later in the review process.

Step 2. Assemble and group data elements to facilitate form development. Review authors should consult Table 5.3.a , in which the data elements are grouped to facilitate form development and data collection. Note that it may be more efficient to group data elements in the order in which they are usually found in study reports (e.g. starting with reference information, followed by eligibility criteria, intervention description, statistical methods, baseline characteristics and results).

Step 3. Identify the optimal way of framing the data items. Much has been written about how to frame data items for developing robust data collection forms in primary research studies. We summarize a few key points and highlight issues that are pertinent to systematic reviews.

  • Ask closed-ended questions (i.e. questions that define a list of permissible responses) as much as possible. Closed-ended questions do not require post hoc coding and provide better control over data quality than open-ended questions. When setting up a closed-ended question, one must anticipate and structure possible responses and include an ‘other, specify’ category because the anticipated list may not be exhaustive. Avoid asking data extractors to summarize data into uncoded text, no matter how short it is.
  • Avoid asking a question in a way that the response may be left blank. Include ‘not applicable’, ‘not reported’ and ‘cannot tell’ options as needed. The ‘cannot tell’ option tags uncertain items that may promote review authors to contact study authors for clarification, especially on data items critical to reach conclusions.
  • Remember that the form will focus on what is reported in the article rather what has been done in the study. The study report may not fully reflect how the study was actually conducted. For example, a question ‘Did the article report that the participants were masked to the intervention?’ is more appropriate than ‘Were participants masked to the intervention?’
  • Where a judgement is required, record the raw data (i.e. quote directly from the source document) used to make the judgement. It is also important to record the source of information collected, including where it was found in a report or whether information was obtained from unpublished sources or personal communications. As much as possible, questions should be asked in a way that minimizes subjective interpretation and judgement to facilitate data comparison and adjudication.
  • Incorporate flexibility to allow for variation in how data are reported. It is strongly recommended that outcome data be collected in the format in which they were reported and transformed in a subsequent step if required. Review authors also should consider the software they will use for analysis and for publishing the review (e.g. RevMan).

Step 4. Develop and pilot-test data collection forms, ensuring that they provide data in the right format and structure for subsequent analysis. In addition to data items described in Step 2, data collection forms should record the title of the review as well as the person who is completing the form and the date of completion. Forms occasionally need revision; forms should therefore include the version number and version date to reduce the chances of using an outdated form by mistake. Because a study may be associated with multiple reports, it is important to record the study ID as well as the report ID. Definitions and instructions helpful for answering a question should appear next to the question to improve quality and consistency across data extractors (Stock 1994). Provide space for notes, regardless of whether paper or electronic forms are used.

All data collection forms and data systems should be thoroughly pilot-tested before launch (see MECIR Box 5.4.a ). Testing should involve several people extracting data from at least a few articles. The initial testing focuses on the clarity and completeness of questions. Users of the form may provide feedback that certain coding instructions are confusing or incomplete (e.g. a list of options may not cover all situations). The testing may identify data that are missing from the form, or likely to be superfluous. After initial testing, accuracy of the extracted data should be checked against the source document or verified data to identify problematic areas. It is wise to draft entries for the table of ‘Characteristics of included studies’ and complete a risk of bias assessment ( Chapter 8 ) using these pilot reports to ensure all necessary information is collected. A consensus between review authors may be required before the form is modified to avoid any misunderstandings or later disagreements. It may be necessary to repeat the pilot testing on a new set of reports if major changes are needed after the first pilot test.

Problems with the data collection form may surface after pilot testing has been completed, and the form may need to be revised after data extraction has started. When changes are made to the form or coding instructions, it may be necessary to return to reports that have already undergone data extraction. In some situations, it may be necessary to clarify only coding instructions without modifying the actual data collection form.

5.5 Extracting data from reports

5.5.1 introduction.

In most systematic reviews, the primary source of information about each study is published reports of studies, usually in the form of journal articles. Despite recent developments in machine learning models to automate data extraction in systematic reviews (see Section 5.5.9 ), data extraction is still largely a manual process. Electronic searches for text can provide a useful aid to locating information within a report. Examples include using search facilities in PDF viewers, internet browsers and word processing software. However, text searching should not be considered a replacement for reading the report, since information may be presented using variable terminology and presented in multiple formats.

5.5.2 Who should extract data?

Data extractors should have at least a basic understanding of the topic, and have knowledge of study design, data analysis and statistics. They should pay attention to detail while following instructions on the forms. Because errors that occur at the data extraction stage are rarely detected by peer reviewers, editors, or users of systematic reviews, it is recommended that more than one person extract data from every report to minimize errors and reduce introduction of potential biases by review authors (see MECIR Box 5.5.a ). As a minimum, information that involves subjective interpretation and information that is critical to the interpretation of results (e.g. outcome data) should be extracted independently by at least two people (see MECIR Box 5.5.a ). In common with implementation of the selection process ( Chapter 4, Section 4.6 ), it is preferable that data extractors are from complementary disciplines, for example a methodologist and a topic area specialist. It is important that everyone involved in data extraction has practice using the form and, if the form was designed by someone else, receives appropriate training.

Evidence in support of duplicate data extraction comes from several indirect sources. One study observed that independent data extraction by two authors resulted in fewer errors than data extraction by a single author followed by verification by a second (Buscemi et al 2006). A high prevalence of data extraction errors (errors in 20 out of 34 reviews) has been observed (Jones et al 2005). A further study of data extraction to compute standardized mean differences found that a minimum of seven out of 27 reviews had substantial errors (Gøtzsche et al 2007).

MECIR Box 5.5.a Relevant expectations for conduct of intervention reviews

5.5.3 Training data extractors

Training of data extractors is intended to familiarize them with the review topic and methods, the data collection form or data system, and issues that may arise during data extraction. Results of the pilot testing of the form should prompt discussion among review authors and extractors of ambiguous questions or responses to establish consistency. Training should take place at the onset of the data extraction process and periodically over the course of the project (Li et al 2015). For example, when data related to a single item on the form are present in multiple locations within a report (e.g. abstract, main body of text, tables, and figures) or in several sources (e.g. publications, ClinicalTrials.gov, or CSRs), the development and documentation of instructions to follow an agreed algorithm are critical and should be reinforced during the training sessions.

Some have proposed that some information in a report, such as its authors, be blinded to the review author prior to data extraction and assessment of risk of bias (Jadad et al 1996). However, blinding of review authors to aspects of study reports generally is not recommended for Cochrane Reviews as there is little evidence that it alters the decisions made (Berlin 1997).

5.5.4 Extracting data from multiple reports of the same study

Studies frequently are reported in more than one publication or in more than one source (Tramèr et al 1997, von Elm et al 2004). A single source rarely provides complete information about a study; on the other hand, multiple sources may contain conflicting information about the same study (Mayo-Wilson et al 2017a, Mayo-Wilson et al 2017b, Mayo-Wilson et al 2018). Because the unit of interest in a systematic review is the study and not the report, information from multiple reports often needs to be collated and reconciled. It is not appropriate to discard any report of an included study without careful examination, since it may contain valuable information not included in the primary report. Review authors will need to decide between two strategies:

  • Extract data from each report separately, then combine information across multiple data collection forms.
  • Extract data from all reports directly into a single data collection form.

The choice of which strategy to use will depend on the nature of the reports and may vary across studies and across reports. For example, when a full journal article and multiple conference abstracts are available, it is likely that the majority of information will be obtained from the journal article; completing a new data collection form for each conference abstract may be a waste of time. Conversely, when there are two or more detailed journal articles, perhaps relating to different periods of follow-up, then it is likely to be easier to perform data extraction separately for these articles and collate information from the data collection forms afterwards. When data from all reports are extracted into a single data collection form, review authors should identify the ‘main’ data source for each study when sources include conflicting data and these differences cannot be resolved by contacting authors (Mayo-Wilson et al 2018). Flow diagrams such as those modified from the PRISMA statement can be particularly helpful when collating and documenting information from multiple reports (Mayo-Wilson et al 2018).

5.5.5 Reliability and reaching consensus

When more than one author extracts data from the same reports, there is potential for disagreement. After data have been extracted independently by two or more extractors, responses must be compared to assure agreement or to identify discrepancies. An explicit procedure or decision rule should be specified in the protocol for identifying and resolving disagreements. Most often, the source of the disagreement is an error by one of the extractors and is easily resolved. Thus, discussion among the authors is a sensible first step. More rarely, a disagreement may require arbitration by another person. Any disagreement that cannot be resolved should be addressed by contacting the study authors; if this is unsuccessful, the disagreement should be reported in the review.

The presence and resolution of disagreements should be carefully recorded. Maintaining a copy of the data ‘as extracted’ (in addition to the consensus data) allows assessment of reliability of coding. Examples of ways in which this can be achieved include the following:

  • Use one author’s (paper) data collection form and record changes after consensus in a different ink colour.
  • Enter consensus data onto an electronic form.
  • Record original data extracted and consensus data in separate forms (some online tools do this automatically).

Agreement of coded items before reaching consensus can be quantified, for example using kappa statistics (Orwin 1994), although this is not routinely done in Cochrane Reviews. If agreement is assessed, this should be done only for the most important data (e.g. key risk of bias assessments, or availability of key outcomes).

Throughout the review process informal consideration should be given to the reliability of data extraction. For example, if after reaching consensus on the first few studies, the authors note a frequent disagreement for specific data, then coding instructions may need modification. Furthermore, an author’s coding strategy may change over time, as the coding rules are forgotten, indicating a need for retraining and, possibly, some recoding.

5.5.6 Extracting data from clinical study reports

Clinical study reports (CSRs) obtained for a systematic review are likely to be in PDF format. Although CSRs can be thousands of pages in length and very time-consuming to review, they typically follow the content and format required by the International Conference on Harmonisation (ICH 1995). Information in CSRs is usually presented in a structured and logical way. For example, numerical data pertaining to important demographic, efficacy, and safety variables are placed within the main text in tables and figures. Because of the clarity and completeness of information provided in CSRs, data extraction from CSRs may be clearer and conducted more confidently than from journal articles or other short reports.

To extract data from CSRs efficiently, review authors should familiarize themselves with the structure of the CSRs. In practice, review authors may want to browse or create ‘bookmarks’ within a PDF document that record section headers and subheaders and search key words related to the data extraction (e.g. randomization). In addition, it may be useful to utilize optical character recognition software to convert tables of data in the PDF to an analysable format when additional analyses are required, saving time and minimizing transcription errors.

CSRs may contain many outcomes and present many results for a single outcome (due to different analyses) (Mayo-Wilson et al 2017b). We recommend review authors extract results only for outcomes of interest to the review (Section 5.3.6 ). With regard to different methods of analysis, review authors should have a plan and pre-specify preferred metrics in their protocol for extracting results pertaining to different populations (e.g. ‘all randomized’, ‘all participants taking at least one dose of medication’), methods for handling missing data (e.g. ‘complete case analysis’, ‘multiple imputation’), and adjustment (e.g. unadjusted, adjusted for baseline covariates). It may be important to record the range of analysis options available, even if not all are extracted in detail. In some cases it may be preferable to use metrics that are comparable across multiple included studies, which may not be clear until data collection for all studies is complete.

CSRs are particularly useful for identifying outcomes assessed but not presented to the public. For efficacy outcomes and systematically collected adverse events, review authors can compare what is described in the CSRs with what is reported in published reports to assess the risk of bias due to missing outcome data ( Chapter 8, Section 8.5 ) and in selection of reported result ( Chapter 8, Section 8.7 ). Note that non-systematically collected adverse events are not amenable to such comparisons because these adverse events may not be known ahead of time and thus not pre-specified in the protocol.

5.5.7 Extracting data from regulatory reviews

Data most relevant to systematic reviews can be found in the medical and statistical review sections of a regulatory review. Both of these are substantially longer than journal articles (Turner 2013). A list of all trials on a drug usually can be found in the medical review. Because trials are referenced by a combination of numbers and letters, it may be difficult for the review authors to link the trial with other reports of the same trial (Section 5.2.1 ).

Many of the documents downloaded from the US Food and Drug Administration’s website for older drugs are scanned copies and are not searchable because of redaction of confidential information (Turner 2013). Optical character recognition software can convert most of the text. Reviews for newer drugs have been redacted electronically; documents remain searchable as a result.

Compared to CSRs, regulatory reviews contain less information about trial design, execution, and results. They provide limited information for assessing the risk of bias. In terms of extracting outcomes and results, review authors should follow the guidance provided for CSRs (Section 5.5.6 ).

5.5.8 Extracting data from figures with software

Sometimes numerical data needed for systematic reviews are only presented in figures. Review authors may request the data from the study investigators, or alternatively, extract the data from the figures either manually (e.g. with a ruler) or by using software. Numerous tools are available, many of which are free. Those available at the time of writing include tools called Plot Digitizer, WebPlotDigitizer, Engauge, Dexter, ycasd, GetData Graph Digitizer. The software works by taking an image of a figure and then digitizing the data points off the figure using the axes and scales set by the users. The numbers exported can be used for systematic reviews, although additional calculations may be needed to obtain the summary statistics, such as calculation of means and standard deviations from individual-level data points (or conversion of time-to-event data presented on Kaplan-Meier plots to hazard ratios; see Chapter 6, Section 6.8.2 ).

It has been demonstrated that software is more convenient and accurate than visual estimation or use of a ruler (Gross et al 2014, Jelicic Kadic et al 2016). Review authors should consider using software for extracting numerical data from figures when the data are not available elsewhere.

5.5.9 Automating data extraction in systematic reviews

Because data extraction is time-consuming and error-prone, automating or semi-automating this step may make the extraction process more efficient and accurate. The state of science relevant to automating data extraction is summarized here (Jonnalagadda et al 2015).

  • At least 26 studies have tested various natural language processing and machine learning approaches for facilitating data extraction for systematic reviews.

· Each tool focuses on only a limited number of data elements (ranges from one to seven). Most of the existing tools focus on the PICO information (e.g. number of participants, their age, sex, country, recruiting centres, intervention groups, outcomes, and time points). A few are able to extract study design and results (e.g. objectives, study duration, participant flow), and two extract risk of bias information (Marshall et al 2016, Millard et al 2016). To date, well over half of the data elements needed for systematic reviews have not been explored for automated extraction.

  • Most tools highlight the sentence(s) that may contain the data elements as opposed to directly recording these data elements into a data collection form or a data system.
  • There is no gold standard or common dataset to evaluate the performance of these tools, limiting our ability to interpret the significance of the reported accuracy measures.

At the time of writing, we cannot recommend a specific tool for automating data extraction for routine systematic review production. There is a need for review authors to work with experts in informatics to refine these tools and evaluate them rigorously. Such investigations should address how the tool will fit into existing workflows. For example, the automated or semi-automated data extraction approaches may first act as checks for manual data extraction before they can replace it.

5.5.10 Suspicions of scientific misconduct

Systematic review authors can uncover suspected misconduct in the published literature. Misconduct includes fabrication or falsification of data or results, plagiarism, and research that does not adhere to ethical norms. Review authors need to be aware of scientific misconduct because the inclusion of fraudulent material could undermine the reliability of a review’s findings. Plagiarism of results data in the form of duplicated publication (either by the same or by different authors) may, if undetected, lead to study participants being double counted in a synthesis.

It is preferable to identify potential problems before, rather than after, publication of the systematic review, so that readers are not misled. However, empirical evidence indicates that the extent to which systematic review authors explore misconduct varies widely (Elia et al 2016). Text-matching software and systems such as CrossCheck may be helpful for detecting plagiarism, but they can detect only matching text, so data tables or figures need to be inspected by hand or using other systems (e.g. to detect image manipulation). Lists of data such as in a meta-analysis can be a useful means of detecting duplicated studies. Furthermore, examination of baseline data can lead to suspicions of misconduct for an individual randomized trial (Carlisle et al 2015). For example, Al-Marzouki and colleagues concluded that a trial report was fabricated or falsified on the basis of highly unlikely baseline differences between two randomized groups (Al-Marzouki et al 2005).

Cochrane Review authors are advised to consult with Cochrane editors if cases of suspected misconduct are identified. Searching for comments, letters or retractions may uncover additional information. Sensitivity analyses can be used to determine whether the studies arousing suspicion are influential in the conclusions of the review. Guidance for editors for addressing suspected misconduct will be available from Cochrane’s Editorial Publishing and Policy Resource (see community.cochrane.org ). Further information is available from the Committee on Publication Ethics (COPE; publicationethics.org ), including a series of flowcharts on how to proceed if various types of misconduct are suspected. Cases should be followed up, typically including an approach to the editors of the journals in which suspect reports were published. It may be useful to write first to the primary investigators to request clarification of apparent inconsistencies or unusual observations.

Because investigations may take time, and institutions may not always be responsive (Wager 2011), articles suspected of being fraudulent should be classified as ‘awaiting assessment’. If a misconduct investigation indicates that the publication is unreliable, or if a publication is retracted, it should not be included in the systematic review, and the reason should be noted in the ‘excluded studies’ section.

5.5.11 Key points in planning and reporting data extraction

In summary, the methods section of both the protocol and the review should detail:

  • the data categories that are to be extracted;
  • how extracted data from each report will be verified (e.g. extraction by two review authors, independently);
  • whether data extraction is undertaken by content area experts, methodologists, or both;
  • pilot testing, training and existence of coding instructions for the data collection form;
  • how data are extracted from multiple reports from the same study; and
  • how disagreements are handled when more than one author extracts data from each report.

5.6 Extracting study results and converting to the desired format

In most cases, it is desirable to collect summary data separately for each intervention group of interest and to enter these into software in which effect estimates can be calculated, such as RevMan. Sometimes the required data may be obtained only indirectly, and the relevant results may not be obvious. Chapter 6 provides many useful tips and techniques to deal with common situations. When summary data cannot be obtained from each intervention group, or where it is important to use results of adjusted analyses (for example to account for correlations in crossover or cluster-randomized trials) effect estimates may be available directly.

5.7 Managing and sharing data

When data have been collected for each individual study, it is helpful to organize them into a comprehensive electronic format, such as a database or spreadsheet, before entering data into a meta-analysis or other synthesis. When data are collated electronically, all or a subset of them can easily be exported for cleaning, consistency checks and analysis.

Tabulation of collected information about studies can facilitate classification of studies into appropriate comparisons and subgroups. It also allows identification of comparable outcome measures and statistics across studies. It will often be necessary to perform calculations to obtain the required statistics for presentation or synthesis. It is important through this process to retain clear information on the provenance of the data, with a clear distinction between data from a source document and data obtained through calculations. Statistical conversions, for example from standard errors to standard deviations, ideally should be undertaken with a computer rather than using a hand calculator to maintain a permanent record of the original and calculated numbers as well as the actual calculations used.

Ideally, data only need to be extracted once and should be stored in a secure and stable location for future updates of the review, regardless of whether the original review authors or a different group of authors update the review (Ip et al 2012). Standardizing and sharing data collection tools as well as data management systems among review authors working in similar topic areas can streamline systematic review production. Review authors have the opportunity to work with trialists, journal editors, funders, regulators, and other stakeholders to make study data (e.g. CSRs, IPD, and any other form of study data) publicly available, increasing the transparency of research. When legal and ethical to do so, we encourage review authors to share the data used in their systematic reviews to reduce waste and to allow verification and reanalysis because data will not have to be extracted again for future use (Mayo-Wilson et al 2018).

5.8 Chapter information

Editors: Tianjing Li, Julian PT Higgins, Jonathan J Deeks

Acknowledgements: This chapter builds on earlier versions of the Handbook . For details of previous authors and editors of the Handbook , see Preface. Andrew Herxheimer, Nicki Jackson, Yoon Loke, Deirdre Price and Helen Thomas contributed text. Stephanie Taylor and Sonja Hood contributed suggestions for designing data collection forms. We are grateful to Judith Anzures, Mike Clarke, Miranda Cumpston and Peter Gøtzsche for helpful comments.

Funding: JPTH is a member of the National Institute for Health Research (NIHR) Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. JJD received support from the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. JPTH received funding from National Institute for Health Research Senior Investigator award NF-SI-0617-10145. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

5.9 References

Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 2005; 331 : 267-270.

Allen EN, Mushi AK, Massawe IS, Vestergaard LS, Lemnge M, Staedke SG, Mehta U, Barnes KI, Chandler CI. How experiences become data: the process of eliciting adverse event, medical history and concomitant medication reports in antimalarial and antiretroviral interaction trials. BMC Medical Research Methodology 2013; 13 : 140.

Baudard M, Yavchitz A, Ravaud P, Perrodeau E, Boutron I. Impact of searching clinical trial registries in systematic reviews of pharmaceutical treatments: methodological systematic review and reanalysis of meta-analyses. BMJ 2017; 356 : j448.

Bent S, Padula A, Avins AL. Better ways to question patients about adverse medical events: a randomized, controlled trial. Annals of Internal Medicine 2006; 144 : 257-261.

Berlin JA. Does blinding of readers affect the results of meta-analyses? University of Pennsylvania Meta-analysis Blinding Study Group. Lancet 1997; 350 : 185-186.

Buscemi N, Hartling L, Vandermeer B, Tjosvold L, Klassen TP. Single data extraction generated more errors than double data extraction in systematic reviews. Journal of Clinical Epidemiology 2006; 59 : 697-703.

Carlisle JB, Dexter F, Pandit JJ, Shafer SL, Yentis SM. Calculating the probability of random sampling for continuous variables in submitted or published randomised controlled trials. Anaesthesia 2015; 70 : 848-858.

Carroll C, Patterson M, Wood S, Booth A, Rick J, Balain S. A conceptual framework for implementation fidelity. Implementation Science 2007; 2 : 40.

Carvajal A, Ortega PG, Sainz M, Velasco V, Salado I, Arias LHM, Eiros JM, Rubio AP, Castrodeza J. Adverse events associated with pandemic influenza vaccines: Comparison of the results of a follow-up study with those coming from spontaneous reporting. Vaccine 2011; 29 : 519-522.

Chamberlain C, O'Mara-Eves A, Porter J, Coleman T, Perlen SM, Thomas J, McKenzie JE. Psychosocial interventions for supporting women to stop smoking in pregnancy. Cochrane Database of Systematic Reviews 2017; 2 : CD001055.

Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implementation Science 2009; 4 : 50.

Davis AL, Miller JD. The European Medicines Agency and publication of clinical study reports: a challenge for the US FDA. JAMA 2017; 317 : 905-906.

Denniston AK, Holland GN, Kidess A, Nussenblatt RB, Okada AA, Rosenbaum JT, Dick AD. Heterogeneity of primary outcome measures used in clinical trials of treatments for intermediate, posterior, and panuveitis. Orphanet Journal of Rare Diseases 2015; 10 : 97.

Derry S, Loke YK. Risk of gastrointestinal haemorrhage with long term use of aspirin: meta-analysis. BMJ 2000; 321 : 1183-1187.

Doshi P, Dickersin K, Healy D, Vedula SS, Jefferson T. Restoring invisible and abandoned trials: a call for people to publish the findings. BMJ 2013; 346 : f2865.

Dusenbury L, Brannigan R, Falco M, Hansen WB. A review of research on fidelity of implementation: implications for drug abuse prevention in school settings. Health Education Research 2003; 18 : 237-256.

Dwan K, Altman DG, Clarke M, Gamble C, Higgins JPT, Sterne JAC, Williamson PR, Kirkham JJ. Evidence for the selective reporting of analyses and discrepancies in clinical trials: a systematic review of cohort studies of clinical trials. PLoS Medicine 2014; 11 : e1001666.

Elia N, von Elm E, Chatagner A, Popping DM, Tramèr MR. How do authors of systematic reviews deal with research malpractice and misconduct in original studies? A cross-sectional analysis of systematic reviews and survey of their authors. BMJ Open 2016; 6 : e010442.

Gøtzsche PC. Multiple publication of reports of drug trials. European Journal of Clinical Pharmacology 1989; 36 : 429-432.

Gøtzsche PC, Hróbjartsson A, Maric K, Tendal B. Data extraction errors in meta-analyses that use standardized mean differences. JAMA 2007; 298 : 430-437.

Gross A, Schirm S, Scholz M. Ycasd - a tool for capturing and scaling data from graphical representations. BMC Bioinformatics 2014; 15 : 219.

Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, Moher D, Altman DG, Barbour V, Macdonald H, Johnston M, Lamb SE, Dixon-Woods M, McCulloch P, Wyatt JC, Chan AW, Michie S. Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide. BMJ 2014; 348 : g1687.

ICH. ICH Harmonised tripartite guideline: Struture and content of clinical study reports E31995. ICH1995. www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Efficacy/E3/E3_Guideline.pdf .

Ioannidis JPA, Mulrow CD, Goodman SN. Adverse events: The more you search, the more you find. Annals of Internal Medicine 2006; 144 : 298-300.

Ip S, Hadar N, Keefe S, Parkin C, Iovin R, Balk EM, Lau J. A web-based archive of systematic review data. Systematic Reviews 2012; 1 : 15.

Ismail R, Azuara-Blanco A, Ramsay CR. Variation of clinical outcomes used in glaucoma randomised controlled trials: a systematic review. British Journal of Ophthalmology 2014; 98 : 464-468.

Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, McQuay H. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Controlled Clinical Trials 1996; 17 : 1-12.

Jelicic Kadic A, Vucic K, Dosenovic S, Sapunar D, Puljak L. Extracting data from figures with software was faster, with higher interrater reliability than manual extraction. Journal of Clinical Epidemiology 2016; 74 : 119-123.

Jones AP, Remmington T, Williamson PR, Ashby D, Smyth RL. High prevalence but low impact of data extraction and reporting errors were found in Cochrane systematic reviews. Journal of Clinical Epidemiology 2005; 58 : 741-742.

Jones CW, Keil LG, Holland WC, Caughey MC, Platts-Mills TF. Comparison of registered and published outcomes in randomized controlled trials: a systematic review. BMC Medicine 2015; 13 : 282.

Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews: a systematic review. Systematic Reviews 2015; 4 : 78.

Lewin S, Hendry M, Chandler J, Oxman AD, Michie S, Shepperd S, Reeves BC, Tugwell P, Hannes K, Rehfuess EA, Welch V, McKenzie JE, Burford B, Petkovic J, Anderson LM, Harris J, Noyes J. Assessing the complexity of interventions within systematic reviews: development, content and use of a new tool (iCAT_SR). BMC Medical Research Methodology 2017; 17 : 76.

Li G, Abbade LPF, Nwosu I, Jin Y, Leenus A, Maaz M, Wang M, Bhatt M, Zielinski L, Sanger N, Bantoto B, Luo C, Shams I, Shahid H, Chang Y, Sun G, Mbuagbaw L, Samaan Z, Levine MAH, Adachi JD, Thabane L. A scoping review of comparisons between abstracts and full reports in primary biomedical research. BMC Medical Research Methodology 2017; 17 : 181.

Li TJ, Vedula SS, Hadar N, Parkin C, Lau J, Dickersin K. Innovations in data collection, management, and archiving for systematic reviews. Annals of Internal Medicine 2015; 162 : 287-294.

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Medicine 2009; 6 : e1000100.

Liu ZM, Saldanha IJ, Margolis D, Dumville JC, Cullum NA. Outcomes in Cochrane systematic reviews related to wound care: an investigation into prespecification. Wound Repair and Regeneration 2017; 25 : 292-308.

Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association 2016; 23 : 193-201.

Mayo-Wilson E, Doshi P, Dickersin K. Are manufacturers sharing data as promised? BMJ 2015; 351 : h4169.

Mayo-Wilson E, Li TJ, Fusco N, Bertizzolo L, Canner JK, Cowley T, Doshi P, Ehmsen J, Gresham G, Guo N, Haythomthwaite JA, Heyward J, Hong H, Pham D, Payne JL, Rosman L, Stuart EA, Suarez-Cuervo C, Tolbert E, Twose C, Vedula S, Dickersin K. Cherry-picking by trialists and meta-analysts can drive conclusions about intervention efficacy. Journal of Clinical Epidemiology 2017a; 91 : 95-110.

Mayo-Wilson E, Fusco N, Li TJ, Hong H, Canner JK, Dickersin K, MUDS Investigators. Multiple outcomes and analyses in clinical trials create challenges for interpretation and research synthesis. Journal of Clinical Epidemiology 2017b; 86 : 39-50.

Mayo-Wilson E, Li T, Fusco N, Dickersin K. Practical guidance for using multiple data sources in systematic reviews and meta-analyses (with examples from the MUDS study). Research Synthesis Methods 2018; 9 : 2-12.

Meade MO, Richardson WS. Selecting and appraising studies for a systematic review. Annals of Internal Medicine 1997; 127 : 531-537.

Meinert CL. Clinical trials dictionary: Terminology and usage recommendations . Hoboken (NJ): Wiley; 2012.

Millard LAC, Flach PA, Higgins JPT. Machine learning to assist risk-of-bias assessments in systematic reviews. International Journal of Epidemiology 2016; 45 : 266-277.

Moher D, Schulz KF, Altman DG. The CONSORT Statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001; 357 : 1191-1194.

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340 : c869.

Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, Moore L, O'Cathain A, Tinati T, Wight D, Baird J. Process evaluation of complex interventions: Medical Research Council guidance. BMJ 2015; 350 : h1258.

Orwin RG. Evaluating coding decisions. In: Cooper H, Hedges LV, editors. The Handbook of Research Synthesis . New York (NY): Russell Sage Foundation; 1994. p. 139-162.

Page MJ, McKenzie JE, Kirkham J, Dwan K, Kramer S, Green S, Forbes A. Bias due to selective inclusion and reporting of outcomes and analyses in systematic reviews of randomised trials of healthcare interventions. Cochrane Database of Systematic Reviews 2014; 10 : MR000035.

Ross JS, Mulvey GK, Hines EM, Nissen SE, Krumholz HM. Trial publication after registration in ClinicalTrials.Gov: a cross-sectional analysis. PLoS Medicine 2009; 6 .

Safer DJ. Design and reporting modifications in industry-sponsored comparative psychopharmacology trials. Journal of Nervous and Mental Disease 2002; 190 : 583-592.

Saldanha IJ, Dickersin K, Wang X, Li TJ. Outcomes in Cochrane systematic reviews addressing four common eye conditions: an evaluation of completeness and comparability. PloS One 2014; 9 : e109400.

Saldanha IJ, Li T, Yang C, Ugarte-Gil C, Rutherford GW, Dickersin K. Social network analysis identified central outcomes for core outcome sets using systematic reviews of HIV/AIDS. Journal of Clinical Epidemiology 2016; 70 : 164-175.

Saldanha IJ, Lindsley K, Do DV, Chuck RS, Meyerle C, Jones LS, Coleman AL, Jampel HD, Dickersin K, Virgili G. Comparison of clinical trial and systematic review outcomes for the 4 most prevalent eye diseases. JAMA Ophthalmology 2017a; 135 : 933-940.

Saldanha IJ, Li TJ, Yang C, Owczarzak J, Williamson PR, Dickersin K. Clinical trials and systematic reviews addressing similar interventions for the same condition do not consider similar outcomes to be important: a case study in HIV/AIDS. Journal of Clinical Epidemiology 2017b; 84 : 85-94.

Stewart LA, Clarke M, Rovers M, Riley RD, Simmonds M, Stewart G, Tierney JF, PRISMA-IPD Development Group. Preferred reporting items for a systematic review and meta-analysis of individual participant data: the PRISMA-IPD statement. JAMA 2015; 313 : 1657-1665.

Stock WA. Systematic coding for research synthesis. In: Cooper H, Hedges LV, editors. The Handbook of Research Synthesis . New York (NY): Russell Sage Foundation; 1994. p. 125-138.

Tramèr MR, Reynolds DJ, Moore RA, McQuay HJ. Impact of covert duplicate publication on meta-analysis: a case study. BMJ 1997; 315 : 635-640.

Turner EH. How to access and process FDA drug approval packages for use in research. BMJ 2013; 347 .

von Elm E, Poglia G, Walder B, Tramèr MR. Different patterns of duplicate publication: an analysis of articles used in systematic reviews. JAMA 2004; 291 : 974-980.

Wager E. Coping with scientific misconduct. BMJ 2011; 343 : d6586.

Wieland LS, Rutkow L, Vedula SS, Kaufmann CN, Rosman LM, Twose C, Mahendraratnam N, Dickersin K. Who has used internal company documents for biomedical and public health research and where did they find them? PloS One 2014; 9 .

Zanchetti A, Hansson L. Risk of major gastrointestinal bleeding with aspirin (Authors' reply). Lancet 1999; 353 : 149-150.

Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC. The ClinicalTrials.gov results database: update and key issues. New England Journal of Medicine 2011; 364 : 852-860.

Zwarenstein M, Treweek S, Gagnier JJ, Altman DG, Tunis S, Haynes B, Oxman AD, Moher D. Improving the reporting of pragmatic trials: an extension of the CONSORT statement. BMJ 2008; 337 : a2390.

For permission to re-use material from the Handbook (either academic or commercial), please see here for full details.

Research Methods

  • Getting Started
  • Literature Review Research
  • Research Design
  • Research Design By Discipline
  • SAGE Research Methods
  • Teaching with SAGE Research Methods

Literature Review

  • What is a Literature Review?
  • What is NOT a Literature Review?
  • Purposes of a Literature Review
  • Types of Literature Reviews
  • Literature Reviews vs. Systematic Reviews
  • Systematic vs. Meta-Analysis

Literature Review  is a comprehensive survey of the works published in a particular field of study or line of research, usually over a specific period of time, in the form of an in-depth, critical bibliographic essay or annotated list in which attention is drawn to the most significant works.

Also, we can define a literature review as the collected body of scholarly works related to a topic:

  • Summarizes and analyzes previous research relevant to a topic
  • Includes scholarly books and articles published in academic journals
  • Can be an specific scholarly paper or a section in a research paper

The objective of a Literature Review is to find previous published scholarly works relevant to an specific topic

  • Help gather ideas or information
  • Keep up to date in current trends and findings
  • Help develop new questions

A literature review is important because it:

  • Explains the background of research on a topic.
  • Demonstrates why a topic is significant to a subject area.
  • Helps focus your own research questions or problems
  • Discovers relationships between research studies/ideas.
  • Suggests unexplored ideas or populations
  • Identifies major themes, concepts, and researchers on a topic.
  • Tests assumptions; may help counter preconceived ideas and remove unconscious bias.
  • Identifies critical gaps, points of disagreement, or potentially flawed methodology or theoretical approaches.
  • Indicates potential directions for future research.

All content in this section is from Literature Review Research from Old Dominion University 

Keep in mind the following, a literature review is NOT:

Not an essay 

Not an annotated bibliography  in which you summarize each article that you have reviewed.  A literature review goes beyond basic summarizing to focus on the critical analysis of the reviewed works and their relationship to your research question.

Not a research paper   where you select resources to support one side of an issue versus another.  A lit review should explain and consider all sides of an argument in order to avoid bias, and areas of agreement and disagreement should be highlighted.

A literature review serves several purposes. For example, it

  • provides thorough knowledge of previous studies; introduces seminal works.
  • helps focus one’s own research topic.
  • identifies a conceptual framework for one’s own research questions or problems; indicates potential directions for future research.
  • suggests previously unused or underused methodologies, designs, quantitative and qualitative strategies.
  • identifies gaps in previous studies; identifies flawed methodologies and/or theoretical approaches; avoids replication of mistakes.
  • helps the researcher avoid repetition of earlier research.
  • suggests unexplored populations.
  • determines whether past studies agree or disagree; identifies controversy in the literature.
  • tests assumptions; may help counter preconceived ideas and remove unconscious bias.

As Kennedy (2007) notes*, it is important to think of knowledge in a given field as consisting of three layers. First, there are the primary studies that researchers conduct and publish. Second are the reviews of those studies that summarize and offer new interpretations built from and often extending beyond the original studies. Third, there are the perceptions, conclusions, opinion, and interpretations that are shared informally that become part of the lore of field. In composing a literature review, it is important to note that it is often this third layer of knowledge that is cited as "true" even though it often has only a loose relationship to the primary studies and secondary literature reviews.

Given this, while literature reviews are designed to provide an overview and synthesis of pertinent sources you have explored, there are several approaches to how they can be done, depending upon the type of analysis underpinning your study. Listed below are definitions of types of literature reviews:

Argumentative Review      This form examines literature selectively in order to support or refute an argument, deeply imbedded assumption, or philosophical problem already established in the literature. The purpose is to develop a body of literature that establishes a contrarian viewpoint. Given the value-laden nature of some social science research [e.g., educational reform; immigration control], argumentative approaches to analyzing the literature can be a legitimate and important form of discourse. However, note that they can also introduce problems of bias when they are used to to make summary claims of the sort found in systematic reviews.

Integrative Review      Considered a form of research that reviews, critiques, and synthesizes representative literature on a topic in an integrated way such that new frameworks and perspectives on the topic are generated. The body of literature includes all studies that address related or identical hypotheses. A well-done integrative review meets the same standards as primary research in regard to clarity, rigor, and replication.

Historical Review      Few things rest in isolation from historical precedent. Historical reviews are focused on examining research throughout a period of time, often starting with the first time an issue, concept, theory, phenomena emerged in the literature, then tracing its evolution within the scholarship of a discipline. The purpose is to place research in a historical context to show familiarity with state-of-the-art developments and to identify the likely directions for future research.

Methodological Review      A review does not always focus on what someone said [content], but how they said it [method of analysis]. This approach provides a framework of understanding at different levels (i.e. those of theory, substantive fields, research approaches and data collection and analysis techniques), enables researchers to draw on a wide variety of knowledge ranging from the conceptual level to practical documents for use in fieldwork in the areas of ontological and epistemological consideration, quantitative and qualitative integration, sampling, interviewing, data collection and data analysis, and helps highlight many ethical issues which we should be aware of and consider as we go through our study.

Systematic Review      This form consists of an overview of existing evidence pertinent to a clearly formulated research question, which uses pre-specified and standardized methods to identify and critically appraise relevant research, and to collect, report, and analyse data from the studies that are included in the review. Typically it focuses on a very specific empirical question, often posed in a cause-and-effect form, such as "To what extent does A contribute to B?"

Theoretical Review      The purpose of this form is to concretely examine the corpus of theory that has accumulated in regard to an issue, concept, theory, phenomena. The theoretical literature review help establish what theories already exist, the relationships between them, to what degree the existing theories have been investigated, and to develop new hypotheses to be tested. Often this form is used to help establish a lack of appropriate theories or reveal that current theories are inadequate for explaining new or emerging research problems. The unit of analysis can focus on a theoretical concept or a whole theory or framework.

* Kennedy, Mary M. "Defining a Literature."  Educational Researcher  36 (April 2007): 139-147.

All content in this section is from The Literature Review created by Dr. Robert Larabee USC

Robinson, P. and Lowe, J. (2015),  Literature reviews vs systematic reviews.  Australian and New Zealand Journal of Public Health, 39: 103-103. doi: 10.1111/1753-6405.12393

literature review data collection

What's in the name? The difference between a Systematic Review and a Literature Review, and why it matters . By Lynn Kysh from University of Southern California

literature review data collection

Systematic review or meta-analysis?

A  systematic review  answers a defined research question by collecting and summarizing all empirical evidence that fits pre-specified eligibility criteria.

A  meta-analysis  is the use of statistical methods to summarize the results of these studies.

Systematic reviews, just like other research articles, can be of varying quality. They are a significant piece of work (the Centre for Reviews and Dissemination at York estimates that a team will take 9-24 months), and to be useful to other researchers and practitioners they should have:

  • clearly stated objectives with pre-defined eligibility criteria for studies
  • explicit, reproducible methodology
  • a systematic search that attempts to identify all studies
  • assessment of the validity of the findings of the included studies (e.g. risk of bias)
  • systematic presentation, and synthesis, of the characteristics and findings of the included studies

Not all systematic reviews contain meta-analysis. 

Meta-analysis is the use of statistical methods to summarize the results of independent studies. By combining information from all relevant studies, meta-analysis can provide more precise estimates of the effects of health care than those derived from the individual studies included within a review.  More information on meta-analyses can be found in  Cochrane Handbook, Chapter 9 .

A meta-analysis goes beyond critique and integration and conducts secondary statistical analysis on the outcomes of similar studies.  It is a systematic review that uses quantitative methods to synthesize and summarize the results.

An advantage of a meta-analysis is the ability to be completely objective in evaluating research findings.  Not all topics, however, have sufficient research evidence to allow a meta-analysis to be conducted.  In that case, an integrative review is an appropriate strategy. 

Some of the content in this section is from Systematic reviews and meta-analyses: step by step guide created by Kate McAllister.

  • << Previous: Getting Started
  • Next: Research Design >>
  • Last Updated: Aug 21, 2023 4:07 PM
  • URL: https://guides.lib.udel.edu/researchmethods
  • USC Libraries
  • Research Guides

Organizing Your Social Sciences Research Paper

  • 5. The Literature Review
  • Purpose of Guide
  • Design Flaws to Avoid
  • Independent and Dependent Variables
  • Glossary of Research Terms
  • Reading Research Effectively
  • Narrowing a Topic Idea
  • Broadening a Topic Idea
  • Extending the Timeliness of a Topic Idea
  • Academic Writing Style
  • Applying Critical Thinking
  • Choosing a Title
  • Making an Outline
  • Paragraph Development
  • Research Process Video Series
  • Executive Summary
  • The C.A.R.S. Model
  • Background Information
  • The Research Problem/Question
  • Theoretical Framework
  • Citation Tracking
  • Content Alert Services
  • Evaluating Sources
  • Primary Sources
  • Secondary Sources
  • Tiertiary Sources
  • Scholarly vs. Popular Publications
  • Qualitative Methods
  • Quantitative Methods
  • Insiderness
  • Using Non-Textual Elements
  • Limitations of the Study
  • Common Grammar Mistakes
  • Writing Concisely
  • Avoiding Plagiarism
  • Footnotes or Endnotes?
  • Further Readings
  • Generative AI and Writing
  • USC Libraries Tutorials and Other Guides
  • Bibliography

A literature review surveys prior research published in books, scholarly articles, and any other sources relevant to a particular issue, area of research, or theory, and by so doing, provides a description, summary, and critical evaluation of these works in relation to the research problem being investigated. Literature reviews are designed to provide an overview of sources you have used in researching a particular topic and to demonstrate to your readers how your research fits within existing scholarship about the topic.

Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . Fourth edition. Thousand Oaks, CA: SAGE, 2014.

Importance of a Good Literature Review

A literature review may consist of simply a summary of key sources, but in the social sciences, a literature review usually has an organizational pattern and combines both summary and synthesis, often within specific conceptual categories . A summary is a recap of the important information of the source, but a synthesis is a re-organization, or a reshuffling, of that information in a way that informs how you are planning to investigate a research problem. The analytical features of a literature review might:

  • Give a new interpretation of old material or combine new with old interpretations,
  • Trace the intellectual progression of the field, including major debates,
  • Depending on the situation, evaluate the sources and advise the reader on the most pertinent or relevant research, or
  • Usually in the conclusion of a literature review, identify where gaps exist in how a problem has been researched to date.

Given this, the purpose of a literature review is to:

  • Place each work in the context of its contribution to understanding the research problem being studied.
  • Describe the relationship of each work to the others under consideration.
  • Identify new ways to interpret prior research.
  • Reveal any gaps that exist in the literature.
  • Resolve conflicts amongst seemingly contradictory previous studies.
  • Identify areas of prior scholarship to prevent duplication of effort.
  • Point the way in fulfilling a need for additional research.
  • Locate your own research within the context of existing literature [very important].

Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper. 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Jesson, Jill. Doing Your Literature Review: Traditional and Systematic Techniques . Los Angeles, CA: SAGE, 2011; Knopf, Jeffrey W. "Doing a Literature Review." PS: Political Science and Politics 39 (January 2006): 127-132; Ridley, Diana. The Literature Review: A Step-by-Step Guide for Students . 2nd ed. Los Angeles, CA: SAGE, 2012.

Types of Literature Reviews

It is important to think of knowledge in a given field as consisting of three layers. First, there are the primary studies that researchers conduct and publish. Second are the reviews of those studies that summarize and offer new interpretations built from and often extending beyond the primary studies. Third, there are the perceptions, conclusions, opinion, and interpretations that are shared informally among scholars that become part of the body of epistemological traditions within the field.

In composing a literature review, it is important to note that it is often this third layer of knowledge that is cited as "true" even though it often has only a loose relationship to the primary studies and secondary literature reviews. Given this, while literature reviews are designed to provide an overview and synthesis of pertinent sources you have explored, there are a number of approaches you could adopt depending upon the type of analysis underpinning your study.

Argumentative Review This form examines literature selectively in order to support or refute an argument, deeply embedded assumption, or philosophical problem already established in the literature. The purpose is to develop a body of literature that establishes a contrarian viewpoint. Given the value-laden nature of some social science research [e.g., educational reform; immigration control], argumentative approaches to analyzing the literature can be a legitimate and important form of discourse. However, note that they can also introduce problems of bias when they are used to make summary claims of the sort found in systematic reviews [see below].

Integrative Review Considered a form of research that reviews, critiques, and synthesizes representative literature on a topic in an integrated way such that new frameworks and perspectives on the topic are generated. The body of literature includes all studies that address related or identical hypotheses or research problems. A well-done integrative review meets the same standards as primary research in regard to clarity, rigor, and replication. This is the most common form of review in the social sciences.

Historical Review Few things rest in isolation from historical precedent. Historical literature reviews focus on examining research throughout a period of time, often starting with the first time an issue, concept, theory, phenomena emerged in the literature, then tracing its evolution within the scholarship of a discipline. The purpose is to place research in a historical context to show familiarity with state-of-the-art developments and to identify the likely directions for future research.

Methodological Review A review does not always focus on what someone said [findings], but how they came about saying what they say [method of analysis]. Reviewing methods of analysis provides a framework of understanding at different levels [i.e. those of theory, substantive fields, research approaches, and data collection and analysis techniques], how researchers draw upon a wide variety of knowledge ranging from the conceptual level to practical documents for use in fieldwork in the areas of ontological and epistemological consideration, quantitative and qualitative integration, sampling, interviewing, data collection, and data analysis. This approach helps highlight ethical issues which you should be aware of and consider as you go through your own study.

Systematic Review This form consists of an overview of existing evidence pertinent to a clearly formulated research question, which uses pre-specified and standardized methods to identify and critically appraise relevant research, and to collect, report, and analyze data from the studies that are included in the review. The goal is to deliberately document, critically evaluate, and summarize scientifically all of the research about a clearly defined research problem . Typically it focuses on a very specific empirical question, often posed in a cause-and-effect form, such as "To what extent does A contribute to B?" This type of literature review is primarily applied to examining prior research studies in clinical medicine and allied health fields, but it is increasingly being used in the social sciences.

Theoretical Review The purpose of this form is to examine the corpus of theory that has accumulated in regard to an issue, concept, theory, phenomena. The theoretical literature review helps to establish what theories already exist, the relationships between them, to what degree the existing theories have been investigated, and to develop new hypotheses to be tested. Often this form is used to help establish a lack of appropriate theories or reveal that current theories are inadequate for explaining new or emerging research problems. The unit of analysis can focus on a theoretical concept or a whole theory or framework.

NOTE : Most often the literature review will incorporate some combination of types. For example, a review that examines literature supporting or refuting an argument, assumption, or philosophical problem related to the research problem will also need to include writing supported by sources that establish the history of these arguments in the literature.

Baumeister, Roy F. and Mark R. Leary. "Writing Narrative Literature Reviews."  Review of General Psychology 1 (September 1997): 311-320; Mark R. Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Kennedy, Mary M. "Defining a Literature." Educational Researcher 36 (April 2007): 139-147; Petticrew, Mark and Helen Roberts. Systematic Reviews in the Social Sciences: A Practical Guide . Malden, MA: Blackwell Publishers, 2006; Torracro, Richard. "Writing Integrative Literature Reviews: Guidelines and Examples." Human Resource Development Review 4 (September 2005): 356-367; Rocco, Tonette S. and Maria S. Plakhotnik. "Literature Reviews, Conceptual Frameworks, and Theoretical Frameworks: Terms, Functions, and Distinctions." Human Ressource Development Review 8 (March 2008): 120-130; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016.

Structure and Writing Style

I.  Thinking About Your Literature Review

The structure of a literature review should include the following in support of understanding the research problem :

  • An overview of the subject, issue, or theory under consideration, along with the objectives of the literature review,
  • Division of works under review into themes or categories [e.g. works that support a particular position, those against, and those offering alternative approaches entirely],
  • An explanation of how each work is similar to and how it varies from the others,
  • Conclusions as to which pieces are best considered in their argument, are most convincing of their opinions, and make the greatest contribution to the understanding and development of their area of research.

The critical evaluation of each work should consider :

  • Provenance -- what are the author's credentials? Are the author's arguments supported by evidence [e.g. primary historical material, case studies, narratives, statistics, recent scientific findings]?
  • Methodology -- were the techniques used to identify, gather, and analyze the data appropriate to addressing the research problem? Was the sample size appropriate? Were the results effectively interpreted and reported?
  • Objectivity -- is the author's perspective even-handed or prejudicial? Is contrary data considered or is certain pertinent information ignored to prove the author's point?
  • Persuasiveness -- which of the author's theses are most convincing or least convincing?
  • Validity -- are the author's arguments and conclusions convincing? Does the work ultimately contribute in any significant way to an understanding of the subject?

II.  Development of the Literature Review

Four Basic Stages of Writing 1.  Problem formulation -- which topic or field is being examined and what are its component issues? 2.  Literature search -- finding materials relevant to the subject being explored. 3.  Data evaluation -- determining which literature makes a significant contribution to the understanding of the topic. 4.  Analysis and interpretation -- discussing the findings and conclusions of pertinent literature.

Consider the following issues before writing the literature review: Clarify If your assignment is not specific about what form your literature review should take, seek clarification from your professor by asking these questions: 1.  Roughly how many sources would be appropriate to include? 2.  What types of sources should I review (books, journal articles, websites; scholarly versus popular sources)? 3.  Should I summarize, synthesize, or critique sources by discussing a common theme or issue? 4.  Should I evaluate the sources in any way beyond evaluating how they relate to understanding the research problem? 5.  Should I provide subheadings and other background information, such as definitions and/or a history? Find Models Use the exercise of reviewing the literature to examine how authors in your discipline or area of interest have composed their literature review sections. Read them to get a sense of the types of themes you might want to look for in your own research or to identify ways to organize your final review. The bibliography or reference section of sources you've already read, such as required readings in the course syllabus, are also excellent entry points into your own research. Narrow the Topic The narrower your topic, the easier it will be to limit the number of sources you need to read in order to obtain a good survey of relevant resources. Your professor will probably not expect you to read everything that's available about the topic, but you'll make the act of reviewing easier if you first limit scope of the research problem. A good strategy is to begin by searching the USC Libraries Catalog for recent books about the topic and review the table of contents for chapters that focuses on specific issues. You can also review the indexes of books to find references to specific issues that can serve as the focus of your research. For example, a book surveying the history of the Israeli-Palestinian conflict may include a chapter on the role Egypt has played in mediating the conflict, or look in the index for the pages where Egypt is mentioned in the text. Consider Whether Your Sources are Current Some disciplines require that you use information that is as current as possible. This is particularly true in disciplines in medicine and the sciences where research conducted becomes obsolete very quickly as new discoveries are made. However, when writing a review in the social sciences, a survey of the history of the literature may be required. In other words, a complete understanding the research problem requires you to deliberately examine how knowledge and perspectives have changed over time. Sort through other current bibliographies or literature reviews in the field to get a sense of what your discipline expects. You can also use this method to explore what is considered by scholars to be a "hot topic" and what is not.

III.  Ways to Organize Your Literature Review

Chronology of Events If your review follows the chronological method, you could write about the materials according to when they were published. This approach should only be followed if a clear path of research building on previous research can be identified and that these trends follow a clear chronological order of development. For example, a literature review that focuses on continuing research about the emergence of German economic power after the fall of the Soviet Union. By Publication Order your sources by publication chronology, then, only if the order demonstrates a more important trend. For instance, you could order a review of literature on environmental studies of brown fields if the progression revealed, for example, a change in the soil collection practices of the researchers who wrote and/or conducted the studies. Thematic [“conceptual categories”] A thematic literature review is the most common approach to summarizing prior research in the social and behavioral sciences. Thematic reviews are organized around a topic or issue, rather than the progression of time, although the progression of time may still be incorporated into a thematic review. For example, a review of the Internet’s impact on American presidential politics could focus on the development of online political satire. While the study focuses on one topic, the Internet’s impact on American presidential politics, it would still be organized chronologically reflecting technological developments in media. The difference in this example between a "chronological" and a "thematic" approach is what is emphasized the most: themes related to the role of the Internet in presidential politics. Note that more authentic thematic reviews tend to break away from chronological order. A review organized in this manner would shift between time periods within each section according to the point being made. Methodological A methodological approach focuses on the methods utilized by the researcher. For the Internet in American presidential politics project, one methodological approach would be to look at cultural differences between the portrayal of American presidents on American, British, and French websites. Or the review might focus on the fundraising impact of the Internet on a particular political party. A methodological scope will influence either the types of documents in the review or the way in which these documents are discussed.

Other Sections of Your Literature Review Once you've decided on the organizational method for your literature review, the sections you need to include in the paper should be easy to figure out because they arise from your organizational strategy. In other words, a chronological review would have subsections for each vital time period; a thematic review would have subtopics based upon factors that relate to the theme or issue. However, sometimes you may need to add additional sections that are necessary for your study, but do not fit in the organizational strategy of the body. What other sections you include in the body is up to you. However, only include what is necessary for the reader to locate your study within the larger scholarship about the research problem.

Here are examples of other sections, usually in the form of a single paragraph, you may need to include depending on the type of review you write:

  • Current Situation : Information necessary to understand the current topic or focus of the literature review.
  • Sources Used : Describes the methods and resources [e.g., databases] you used to identify the literature you reviewed.
  • History : The chronological progression of the field, the research literature, or an idea that is necessary to understand the literature review, if the body of the literature review is not already a chronology.
  • Selection Methods : Criteria you used to select (and perhaps exclude) sources in your literature review. For instance, you might explain that your review includes only peer-reviewed [i.e., scholarly] sources.
  • Standards : Description of the way in which you present your information.
  • Questions for Further Research : What questions about the field has the review sparked? How will you further your research as a result of the review?

IV.  Writing Your Literature Review

Once you've settled on how to organize your literature review, you're ready to write each section. When writing your review, keep in mind these issues.

Use Evidence A literature review section is, in this sense, just like any other academic research paper. Your interpretation of the available sources must be backed up with evidence [citations] that demonstrates that what you are saying is valid. Be Selective Select only the most important points in each source to highlight in the review. The type of information you choose to mention should relate directly to the research problem, whether it is thematic, methodological, or chronological. Related items that provide additional information, but that are not key to understanding the research problem, can be included in a list of further readings . Use Quotes Sparingly Some short quotes are appropriate if you want to emphasize a point, or if what an author stated cannot be easily paraphrased. Sometimes you may need to quote certain terminology that was coined by the author, is not common knowledge, or taken directly from the study. Do not use extensive quotes as a substitute for using your own words in reviewing the literature. Summarize and Synthesize Remember to summarize and synthesize your sources within each thematic paragraph as well as throughout the review. Recapitulate important features of a research study, but then synthesize it by rephrasing the study's significance and relating it to your own work and the work of others. Keep Your Own Voice While the literature review presents others' ideas, your voice [the writer's] should remain front and center. For example, weave references to other sources into what you are writing but maintain your own voice by starting and ending the paragraph with your own ideas and wording. Use Caution When Paraphrasing When paraphrasing a source that is not your own, be sure to represent the author's information or opinions accurately and in your own words. Even when paraphrasing an author’s work, you still must provide a citation to that work.

V.  Common Mistakes to Avoid

These are the most common mistakes made in reviewing social science research literature.

  • Sources in your literature review do not clearly relate to the research problem;
  • You do not take sufficient time to define and identify the most relevant sources to use in the literature review related to the research problem;
  • Relies exclusively on secondary analytical sources rather than including relevant primary research studies or data;
  • Uncritically accepts another researcher's findings and interpretations as valid, rather than examining critically all aspects of the research design and analysis;
  • Does not describe the search procedures that were used in identifying the literature to review;
  • Reports isolated statistical results rather than synthesizing them in chi-squared or meta-analytic methods; and,
  • Only includes research that validates assumptions and does not consider contrary findings and alternative interpretations found in the literature.

Cook, Kathleen E. and Elise Murowchick. “Do Literature Review Skills Transfer from One Course to Another?” Psychology Learning and Teaching 13 (March 2014): 3-11; Fink, Arlene. Conducting Research Literature Reviews: From the Internet to Paper . 2nd ed. Thousand Oaks, CA: Sage, 2005; Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1998; Jesson, Jill. Doing Your Literature Review: Traditional and Systematic Techniques . London: SAGE, 2011; Literature Review Handout. Online Writing Center. Liberty University; Literature Reviews. The Writing Center. University of North Carolina; Onwuegbuzie, Anthony J. and Rebecca Frels. Seven Steps to a Comprehensive Literature Review: A Multimodal and Cultural Approach . Los Angeles, CA: SAGE, 2016; Ridley, Diana. The Literature Review: A Step-by-Step Guide for Students . 2nd ed. Los Angeles, CA: SAGE, 2012; Randolph, Justus J. “A Guide to Writing the Dissertation Literature Review." Practical Assessment, Research, and Evaluation. vol. 14, June 2009; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016; Taylor, Dena. The Literature Review: A Few Tips On Conducting It. University College Writing Centre. University of Toronto; Writing a Literature Review. Academic Skills Centre. University of Canberra.

Writing Tip

Break Out of Your Disciplinary Box!

Thinking interdisciplinarily about a research problem can be a rewarding exercise in applying new ideas, theories, or concepts to an old problem. For example, what might cultural anthropologists say about the continuing conflict in the Middle East? In what ways might geographers view the need for better distribution of social service agencies in large cities than how social workers might study the issue? You don’t want to substitute a thorough review of core research literature in your discipline for studies conducted in other fields of study. However, particularly in the social sciences, thinking about research problems from multiple vectors is a key strategy for finding new solutions to a problem or gaining a new perspective. Consult with a librarian about identifying research databases in other disciplines; almost every field of study has at least one comprehensive database devoted to indexing its research literature.

Frodeman, Robert. The Oxford Handbook of Interdisciplinarity . New York: Oxford University Press, 2010.

Another Writing Tip

Don't Just Review for Content!

While conducting a review of the literature, maximize the time you devote to writing this part of your paper by thinking broadly about what you should be looking for and evaluating. Review not just what scholars are saying, but how are they saying it. Some questions to ask:

  • How are they organizing their ideas?
  • What methods have they used to study the problem?
  • What theories have been used to explain, predict, or understand their research problem?
  • What sources have they cited to support their conclusions?
  • How have they used non-textual elements [e.g., charts, graphs, figures, etc.] to illustrate key points?

When you begin to write your literature review section, you'll be glad you dug deeper into how the research was designed and constructed because it establishes a means for developing more substantial analysis and interpretation of the research problem.

Hart, Chris. Doing a Literature Review: Releasing the Social Science Research Imagination . Thousand Oaks, CA: Sage Publications, 1 998.

Yet Another Writing Tip

When Do I Know I Can Stop Looking and Move On?

Here are several strategies you can utilize to assess whether you've thoroughly reviewed the literature:

  • Look for repeating patterns in the research findings . If the same thing is being said, just by different people, then this likely demonstrates that the research problem has hit a conceptual dead end. At this point consider: Does your study extend current research?  Does it forge a new path? Or, does is merely add more of the same thing being said?
  • Look at sources the authors cite to in their work . If you begin to see the same researchers cited again and again, then this is often an indication that no new ideas have been generated to address the research problem.
  • Search Google Scholar to identify who has subsequently cited leading scholars already identified in your literature review [see next sub-tab]. This is called citation tracking and there are a number of sources that can help you identify who has cited whom, particularly scholars from outside of your discipline. Here again, if the same authors are being cited again and again, this may indicate no new literature has been written on the topic.

Onwuegbuzie, Anthony J. and Rebecca Frels. Seven Steps to a Comprehensive Literature Review: A Multimodal and Cultural Approach . Los Angeles, CA: Sage, 2016; Sutton, Anthea. Systematic Approaches to a Successful Literature Review . Los Angeles, CA: Sage Publications, 2016.

  • << Previous: Theoretical Framework
  • Next: Citation Tracking >>
  • Last Updated: Apr 9, 2024 1:19 PM
  • URL: https://libguides.usc.edu/writingguide
  • UNC Libraries
  • HSL Academic Process
  • Systematic Reviews
  • Step 7: Extract Data from Included Studies

Systematic Reviews: Step 7: Extract Data from Included Studies

Created by health science librarians.

HSL Logo

  • Step 1: Complete Pre-Review Tasks
  • Step 2: Develop a Protocol
  • Step 3: Conduct Literature Searches
  • Step 4: Manage Citations
  • Step 5: Screen Citations
  • Step 6: Assess Quality of Included Studies

About Step 7: Extract Data from Included Studies

About data extraction, select a data extraction tool, what should i extract, helpful tip- data extraction.

  • Data extraction FAQs
  • Step 8: Write the Review

  Check our FAQ's

   Email us

  Chat with us (during business hours)

   Call (919) 962-0800

   Make an appointment with a librarian

  Request a systematic or scoping review consultation

In Step 7, you will skim the full text of included articles to collect information about the studies in a table format (extract data), to summarize the studies and make them easier to compare. You will: 

  • Make sure you have collected the full text of any included articles.
  • Choose the pieces of information you want to collect from each study.
  • Choose a method for collecting the data.
  • Create the data extraction table.
  • Test the data collection table (optional). 
  • Collect (extract) the data. 
  • Review the data collected for any errors. 

For accuracy, two or more people should extract data from each study. This process can be done by hand or by using a computer program. 

Click an item below to see how it applies to Step 7: Extract Data from Included Studies.

Reporting your review with PRISMA

If you reach the data extraction step and choose to exclude articles for any reason, update the number of included and excluded studies in your PRISMA flow diagram.

Managing your review with Covidence

Covidence allows you to assemble a custom data extraction template, have two reviewers conduct extraction, then send their extractions for consensus.

How a librarian can help with Step 7

A librarian can advise you on data extraction for your systematic review, including: 

  • What the data extraction stage of the review entails
  • Finding examples in the literature of similar reviews and their completed data tables
  • How to choose what data to extract from your included articles 
  • How to create a randomized sample of citations for a pilot test
  • Best practices for reporting your included studies and their important data in your review

In this step of the systematic review, you will develop your evidence tables, which give detailed information for each study (perhaps using a PICO framework as a guide), and summary tables, which give a high-level overview of the findings of your review. You can create evidence and summary tables to describe study characteristics, results, or both. These tables will help you determine which studies, if any, are eligible for quantitative synthesis.

Data extraction requires a lot of planning.  We will review some of the tools you can use for data extraction, the types of information you will want to extract, and the options available in the systematic review software used here at UNC, Covidence .

How many people should extract data?

The Cochrane Handbook and other studies strongly suggest at least two reviewers and extractors to reduce the number of errors.

  • Chapter 5: Collecting Data (Cochrane Handbook)
  • A Practical Guide to Data Extraction for Intervention Systematic Reviews (Covidence)

Click on a type of data extraction tool below to see some more information about using that type of tool and what UNC has to offer.

Systematic Review Software (Covidence)

Most systematic review software tools have data extraction functionality that can save you time and effort.  Here at UNC, we use a systematic review software called Covidence. You can see a more complete list of options in the Systematic Review Toolbox .

Covidence allows you to create and publish a data extraction template with text fields, single-choice items, section headings and section subheadings; perform dual and single reviewer data extraction ; review extractions for consensus ; and export data extraction and quality assessment to a CSV with each item in a column and each study in a row.

  • Covidence@UNC Guide
  • Covidence for Data Extraction (Covidence)
  • A Practical Guide to Data Extraction for Intervention Systematic Reviews(Covidence)

Spreadsheet or Database Software (Excel, Google Sheets)

You can also use spreadsheet or database software to create custom extraction forms. Spreadsheet software (such as Microsoft Excel) has functions such as drop-down menus and range checks can speed up the process and help prevent data entry errors. Relational databases (such as Microsoft Access) can help you extract information from different categories like citation details, demographics, participant selection, intervention, outcomes, etc.

  • Microsoft Products (UNC Information Technology Services)

Cochrane RevMan

RevMan offers collection forms for descriptive information on population, interventions, and outcomes, and quality assessments, as well as for data for analysis and forest plots. The form elements may not be changed, and data must be entered manually. RevMan is a free software download.

  • Cochrane RevMan 5.0 Download
  • RevMan for Non-Cochrane Reviews (Cochrane Training)

Survey or Form Software (Qualtrics, Poll Everywhere)

Survey or form tools can help you create custom forms with many different question types, such as multiple choice, drop downs, ranking, and more. Content from these tools can often be exported to spreadsheet or database software as well. Here at UNC we have access to the survey/form software Qualtrics & Poll Everywhere.

  • Qualtrics (UNC Information Technology Services)
  • Poll Everywhere (UNC Information Technology Services)

Electronic Documents or Paper & Pencil (Word, Google Docs)

In the past, people often used paper and pencil to record the data they extracted from articles. Handwritten extraction is less popular now due to widespread electronic tools. You can record extracted data in electronic tables or forms created in Microsoft Word or other word processing programs, but this process may take longer than many of our previously listed methods. If chosen, the electronic document or paper-and-pencil extraction methods should only be used for small reviews, as larger sets of articles may become unwieldy. These methods may also be more prone to errors in data entry than some of the more automated methods.

There are benefits and limitations to each method of data extraction.  You will want to consider:

  • The cost of the software / tool
  • Shareability / versioning
  • Existing versus custom data extraction forms
  • The data entry process
  • Interrater reliability

For example, in Covidence you may spend more time building your data extraction form, but save time later in the extraction process as Covidence can automatically highlight discrepancies for review and resolution between different extractors. Excel may require less time investment to create an extraction form, but it may take longer for you to match and compare data between extractors. More in-depth comparison of the benefits and limitations of each extraction tool can be found in the table below.

Sample information to include in an extraction table

It may help to consult other similar systematic reviews to identify what data to collect or to think about your question in a framework such as PICO .

Helpful data for an intervention question may include:

  • Information about the article (author(s), year of publication, title, DOI)
  • Information about the study (study type, participant recruitment / selection / allocation, level of evidence, study quality)
  • Patient demographics (age, sex, ethnicity, diseases / conditions, other characteristics related to the intervention / outcome)
  • Intervention (quantity, dosage, route of administration, format, duration, time frame, setting)
  • Outcomes (quantitative and / or qualitative)

If you plan to synthesize data, you will want to collect additional information such as sample sizes, effect sizes, dependent variables, reliability measures, pre-test data, post-test data, follow-up data, and statistical tests used.

Extraction templates and approaches should be determined by the needs of the specific review.   For example, if you are extracting qualitative data, you will want to extract data such as theoretical framework, data collection method, or role of the researcher and their potential bias.

  • Supplementary Guidance for Inclusion of Qualitative Research in Cochrane Systematic Reviews of Interventions (Cochrane Collaboration Qualitative Methods Group)
  • Look for an existing extraction form or tool to help guide you.  Use existing systematic reviews on your topic to identify what information to collect if you are not sure what to do.
  • Train the review team on the extraction categories and what type of data would be expected.  A manual or guide may help your team establish standards.
  • Pilot the extraction / coding form to ensure data extractors are recording similar data. Revise the extraction form if needed.
  • Discuss any discrepancies in coding throughout the process.
  • Document any changes to the process or the form.  Keep track of the decisions the team makes and the reasoning behind them.
  • << Previous: Step 6: Assess Quality of Included Studies
  • Next: Step 8: Write the Review >>
  • Last Updated: Mar 28, 2024 9:43 AM
  • URL: https://guides.lib.unc.edu/systematic-reviews

Search & Find

  • E-Research by Discipline
  • More Search & Find

Places & Spaces

  • Places to Study
  • Book a Study Room
  • Printers, Scanners, & Computers
  • More Places & Spaces
  • Borrowing & Circulation
  • Request a Title for Purchase
  • Schedule Instruction Session
  • More Services

Support & Guides

  • Course Reserves
  • Research Guides
  • Citing & Writing
  • More Support & Guides
  • Mission Statement
  • Diversity Statement
  • Staff Directory
  • Job Opportunities
  • Give to the Libraries
  • News & Exhibits
  • Reckoning Initiative
  • More About Us

UNC University Libraries Logo

  • Search This Site
  • Privacy Policy
  • Accessibility
  • Give Us Your Feedback
  • 208 Raleigh Street CB #3916
  • Chapel Hill, NC 27515-8890
  • 919-962-1053

Eugene McDermott Library

Literature review.

  • Collecting Resources for a Literature Review
  • Organizing the Literature Review
  • Writing the Literature Review
  • Examples of Literature Reviews

Sources for Literature Review Items

Sources for a Literature Review will come from a variety of places, including:

•Books Use the Library Catalog  to see what items McDermott Library has on your topic or if McDermott Library has a specific source you need. The WorldCat   database allows you to search the catalogs on many, many libraries. WorldCat is a good place to find out what books exist on your topic.

•Reference Materials Reference Materials such as encyclopedias and dictionaries provide good overall views of topics and provide keyword hints for searching. Many will include lists of sources to consider for your literature review.

•Journals via Electronic Databases Journals are a major source of materials for a literature review. With the library's databases, you can search thousands of journals back a century or more.   

•Conference Papers At conferences, professionals and scholars explore the latest trends, share new ideas, and present new research. Searching Conference papers allows you to see research before it is published and get a feel for what is going on in a particular organization or within a particular group. 

Many electronic databases include conference proceedings, but with Conference Proceedings Citation Index database, you can search proceedings alone. 

•Dissertations & Theses Here is a link to databases licensed by UTD McDermott library with full-text access to Dissertations and Theses

Some of these are specific to Texas or UTD produced studies. Choose the Global option to search more broadly.

•Internet The general internet can be a valuable resource for information. However, it is largely unregulated. Be sure to critically evaluate internet sources. Look at the Evaluating Websites  LibGuide for suggestions on evaluating websites.

•Government Publications The U.S. government produces a wide variety of information sources, from consumer brochures to congressional reports to large amounts of data to longitudinal studies. For the United States, Usa.gov is a good place to start.  Official state websites can be helpful for individual state statistics and information. 

  • << Previous: Home
  • Next: Organizing the Literature Review >>
  • Last Updated: Mar 22, 2024 12:28 PM
  • URL: https://libguides.utdallas.edu/literature-review

1Library

  • No results found

Literature review

Chapter 3 research methodology, 3.5 data collection methods, 3.5.1 literature review.

A literature review is often undertaken prior to empirical research as it provides a synthesis of the extant knowledge on a given topic. The scope of a literature review can vary. The emphasis may be on a review of research methods to determine which approach to adopt or examination of current knowledge to inform policy decisions. An essay style review was criticised by Hakim (1992, pp.18-19) for its subjective approach and partial coverage. The preferred style is a meta-analysis which introduces more rigour into the process. Meta-analysis involves statistical analysis to highlight significance in reported study findings. It is a useful tool for reviews of quantitative studies but is not believed to be as appropriate for reviews of qualitative studies (Hakim 1992, pp.19-20). An alternative approach is to carry out a systematic review where explicit procedures are followed making bias less likely to occur (Bryman 2008, p.85). Systematic reviews involve a series of defined steps:

• purpose statement;

• criteria for selection of published works; • all in-scope works are included in the review;

• study features recorded against a defined protocol (location, sample size, data collection methods and key findings); and

• results summarised and synthesised, possibly presented in a table (Millar 2004, p.145).

One limitation of a systematic review is that differences between studies are not highlighted, resulting in a loss of important detail (Millar 2004, p.146).

A narrative or descriptive literature review is useful for gaining an insight into a topic which is further understood by empirical research. This form of review is more wide ranging, exploratory and not as clearly defined as other types of literature review (Bryman 2008, pp.92-93). Prior studies are compared for trends or patterns in their results (Millar 2004, p.142).

Literature reviews are advantageous because they can be conducted relatively quickly with little cost. They are, however, limited to published literature which may not adequately cover areas under investigation (Hakim 1992, p.24).

3.5.2 Questionnaires

The criteria for research questionnaires are that they should:

• collect information for analysis;

• comprise a set list of questions which is presented to all respondents; and • gather information directly from subjects (Denscombe 2007, pp.153-154)

They are ideal tools to use where the researcher wishes to gather information from a large number of individuals who are geographically dispersed, where standard data are required and respondents have the ability to understand the questions being asked. Questionnaires tend to gather information around ‘facts’ or ‘opinions’ and the researcher must have no ambiguities regarding the focus of their investigation (Denscombe 2007, pp.154-155).

The length and complexity of the questionnaire is a matter of judgement for the researcher. The decision needs to be made by taking into account the audience and time required to complete the questionnaire, however, a major deterrent to completion is its size. Therefore, key research issues should be addressed by the questionnaire (Denscombe 2007, pp.161-162). In addition, when compared with interviews, self-completion questionnaires need to be easy to follow, short to minimise the risk of survey fatigue, and have a limited number of open questions as closed questions are easier to answer in the absence of an interviewer to guide the process (Bryman 2008, p.217).

Prior to releasing a questionnaire to its intended audience it needs to be tested and refined. This pilot process ensures optimal wording and question ordering, tests letters of introduction and analysis of pilot data assists in developing a plan for final data analysis (Oppenheim 1992, pp.47-64).

One of the weaknesses of structured questionnaires is that they provide less depth of information than interviews (Hakim 1992, p.49). To be effective the researcher needs to ensure that questionnaire respondents mirror the wider target population. Failure to do so can introduce bias into the results. Responses also need to be an accurate measure of respondent characteristics (Fowler 2009, pp.12-14).

3.5.3 Interviews

Interviews are a useful source of preliminary information for the researcher and they can help to frame the research to follow (Blakeslee & Fleischer 2007, pp.30-31). In this respect they provide a mechanism for identifying issues and themes. They are also used to obtain in-depth data when “information based on insider experience, privileged insights and experiences” are required (Wisker 2001, p.165). Interviews can take a variety of formats from formal structured, through semi-structured to informal or opportunistic. Formal interviews follow a set structure and question list; for the researcher they are a way of gathering a standard set of data which is consistent across all

interviewees (Blakeslee & Fleischer 2007, p.133). Semi-structured interviews have a defined list of questions but provide scope for discussion (Wisker 2001, pp.168-169).

Interviews are conducted from the perspective of the interviewer; their views will have a bearing on the interview process and subsequent analysis of the transcript. It is therefore important to follow ethical practices, to avoid bias and to be open to the views of the interviewee (Wisker 2001, pp.142-143).

One of the drawbacks of adopting interviews as a research method is that they are time consuming (Gillham 2000, pp.65-66; Wisker 2001, 165). Thus, it is advisable to maintain a focus on the research topic (Blakeslee & Fleischer 2007, 138-139; Gillham 2000, pp.65-66).

3.5.4 Document analysis

Document analysis draws on written, visual and audio files from a range of sources. Written documents include Government publications, newspapers, meeting notes, letters, diaries or webpages. Particularly attractive sources of data for researchers are those which are freely available and accessible. Documents that are not freely available require the researcher to negotiate access or undertake undercover activities to source. Researchers need to assess the validity of the documents they examine; for a website this involves consideration of the authority of the source, trustworthiness of the website, whether information is up-to-date and the popularity of the website (Denscombe 2007, pp.227-234).

When conducting research based on documents the context within which these artefacts were created and the intended audience should be considered. Bryman (2008, p.527) offered the example of an organisation’s meeting minutes which may have been crafted to exclude certain discussions because they could be accessed by members of the public. Background information to meeting minutes might also be available internally, thus connecting them to wider internal events. Researchers may have to probe into the broader

3.6 Data analysis

In quantitative data analysis facts expressed in numerical form are used to test hypotheses (Neuman 2007, p.329). Raw data are processed by software and charts or graphs representing these data produced. Summaries of the data are explained and given meaning by the researcher (Merriam 1998, p.178; Neuman 2007, p.248). Qualitative data consists of words, photographs and other materials which require a different treatment for analysis. Researchers begin data analysis early in their research by looking for patterns and relationships in the data (Neuman 2007, p.329). Data analysis is achieved through a series of steps which involve preparing, coding, identifying themes and presentation (Creswell 2007, p.148). These activities are broken down into six stages: data managing, reading/memoing, describing, classifying, interpreting, and representing/visualising. The following activities are carried out during the process of collating and comparing these data:

• data managing: creating and organising files for the data;

• reading/memoing: reading, note taking in the margins and initial coding;

• describing, classifying and interpreting: describing the data and its context; analysing to identify themes and patterns; making sense of the data and bringing meaning to its interpretation; and

• representing/visualising: findings are presented by narration and visual representations (models, tables, figures or sketches) (Creswell 2007, pp.156-157).

Data analysis is designed to aid the understanding of an event; therefore, core elements of complex events are identified. Data are studied for themes, common issues, words or phrases. These are coded (tagged) into broad categories to develop an understanding of a phenomenon. Codes are not fixed; they change and develop as the research progresses. Thus, initial coding is descriptive and applied to broad chunks of text (open coding). Relationships between codes aids identification of key (axial) components and this leads on to a more focused effort on the core codes (selective coding) which are essential in explaining phenomena (Denscombe 2007, pp.97-98).

This approach is mirrored in the analysis of case study research data. Where data are interpreted and analysed for patterns in order to gain an understanding of the case and surrounding influences and conditions. The researcher questions the data, reading it over and again; taking the time to reflect on the data, their assumptions and analysis. In this way meaning and significance can be better understood and through coding and triangulation the process is enhanced (Stake 1995, pp.78-79).

Stake (1995, p.108) noted that “All researchers recognize the need not only for being accurate in measuring things but logical in interpreting the meaning of those measurements.” The protocol by which this validation is achieved is triangulation. There are four methods of triangulation:

1. data source triangulation: identifies whether a phenomenon occurs or carries the same meaning under different circumstances;

2. investigator triangulation: is achieved by having an independent observer of proceedings, or to present research observations and discuss appropriate interpretations with colleagues;

3. theory triangulation: data are compared by researchers with different theoretical perspectives and where agreement is reached triangulation is achieved. When different meanings are derived from the data, there is an opportunity to enhance understanding of the case; and

4. methodological triangulation: findings are confirmed by following a sequence of methods. In case study the most commonly used methods are observation, interview and document review. Adopting a range of methods can confirm events but it may also uncover an alternative perspective or reading of a situation (Stake 1995, pp112-115).

3.7 Research ethics

Research involving human subjects needs to be conducted in an ethical manner to ensure individuals are not adversely affected by the research (Fowler 2009. p.163). The standards for ethical research practice involve ensuring informed consent, data protection

and privacy (Pauwels 2007). Gaining informed consent from subjects willing to be involved in a research project necessitates that the following points are explained by the researcher and understood by the participant:

• research goals are clearly stated;

• side effects or potentially detrimental factors are transparent;

• gratuities do not act as an inducement to participate in the research; and • participants can withdraw at any time without prejudice (Pauwels 2007, p.20).

To this list Fowler (2009, p.164) added further guiding principles for research surveys involving general populations including:

• making participants aware of the name of the organisation under which the research is being conducted and providing the interviewer’s name;

• notifying subjects of any sponsoring body involved in the research; • stipulating terms of confidentiality; and

• ensuring there are no negative consequences for non-participation.

Data protection and privacy exist to ensure that data sharing does not infringe an individual’s right to privacy. Therefore, researchers are bound to protect identity by coding data during processing and anonymising it to ensure that the connection between an individual and data stored on them are not associated in any traceable way (Pauwels 2007, pp.27-28). Care should be taken when reporting data from small categories of respondents as they might be identifiable. In addition, completed responses should not be available to individuals beyond the project team. It is a researcher’s responsibility to ensure that the completed survey instrument is destroyed, or its continued storage is secure, once the research is completed (Fowler 2009, p.166).

Benefits to participating in research are usually altruistic and inducements should not be excessive so that the principle of voluntary participation is upheld. Researchers should not overstate any benefits and any promises made should be met (Fowler 2009, p.167).

  • Website management and development
  • Website metrics
  • Conclusions
  • Literature review (You are here)
  • Research samples
  • University policy guidance, website management and website template
  • Library website governance and staffing
  • Methods of website evaluation
  • Use of data gathering methods
  • Interview results
  • Gaps in the survey data

Related documents

A Systematic Literature Review and Future Perspectives for Handling Big Data Analytics in COVID-19 Diagnosis

  • Published: 16 March 2023
  • Volume 41 , pages 243–280, ( 2023 )

Cite this article

  • Nagamani Tenali 1 &
  • Gatram Rama Mohan Babu 2  

3234 Accesses

8 Citations

Explore all metrics

In today’s digital world, information is growing along with the expansion of Internet usage worldwide. As a consequence, bulk of data is generated constantly which is known to be “Big Data”. One of the most evolving technologies in twenty-first century is Big Data analytics, it is promising field for extracting knowledge from very large datasets and enhancing benefits while lowering costs. Due to the enormous success of big data analytics, the healthcare sector is increasingly shifting toward adopting these approaches to diagnose diseases. Due to the recent boom in medical big data and the development of computational methods, researchers and practitioners have gained the ability to mine and visualize medical big data on a larger scale. Thus, with the aid of integration of big data analytics in healthcare sectors, precise medical data analysis is now feasible with early sickness detection, health status monitoring, patient treatment, and community services is now achievable. With all these improvements, a deadly disease COVID is considered in this comprehensive review with the intention of offering remedies utilizing big data analytics. The use of big data applications is vital to managing pandemic conditions, such as predicting outbreaks of COVID-19 and identifying cases and patterns of spread of COVID-19. Research is still being done on leveraging big data analytics to forecast COVID-19. But precise and early identification of COVID disease is still lacking due to the volume of medical records like dissimilar medical imaging modalities. Meanwhile, Digital imaging has now become essential to COVID diagnosis, but the main challenge is the storage of massive volumes of data. Taking these limitations into account, a comprehensive analysis is presented in the systematic literature review (SLR) to provide a deeper understanding of big data in the field of COVID-19.

Similar content being viewed by others

literature review data collection

Data Mining in Healthcare and Prediction Model Using Data Mining Technique on COVID-19

literature review data collection

Artificial Intelligence and Big Data for COVID-19 Diagnosis

literature review data collection

A Review of Machine Learning Techniques to Detect and Treat COVID-19 Using EHR Data

Avoid common mistakes on your manuscript.

1 Introduction

Due to information technology advances, the modern society utilizes an enormous amount of data, making big data analytics an essential data management tool in every sector. In the twenty-first century, big data is ingrained in every aspect of contemporary life [ 2 ]. A key goal of big data technology is to predict future trends using observed patterns [ 8 ]. In order to provide large datasets, a considerable amount of data must be collected and advanced monitoring and analysis tools must be used. Conventional data processing systems have difficulty in storing, managing, or analysing big data collections [ 1 ]. To fix these issues, big data [ 3 ] come up with different characteristics like veracity, velocity, variety, volume, validity, variability, venue, vocabulary, vagueness, value [ 4 ]. Typically, Big data analytics is the methodology for analysing massive volumes of data to uncover pertinent information using cutting-edge techniques [ 11 ]. Massive amounts of data are now widely obtainable, setting new countless opportunities for heterogeneous datasets, facilitating better medical decisions, and enhanced efficiency in the healthcare system [ 12 ].

Academics may examine this vast amount of data using healthcare data analytics, spot patterns, and trends in the data, and propose a solution to improve healthcare, therefore reducing costs, modernizing access to healthcare, and saving irreplaceable human lives [ 13 ]. Big data technologies serve several sides of healthcare, including patient monitoring, clinical decision support and healthcare management [ 14 ]. Through the discovery of data correlations and the understanding of patterns and trends, big data technique offers a great ability to improve healthcare [ 7 ], and streamline the health care system's financial burden [ 15 ]. Clinicians, Epidemiologists, and health experts have a great potential to use big data to make choices based on the best evidence available, enhancing patient care [ 16 ]. Biomedical researchers must correctly comprehend and use big data, which is not just a current reality but also a requirement in the search for new information [ 5 ]. In order to effectually battle the COVID-19 epidemic in the current digital era, clear criteria for effective data collection and analysis on a global basis are crucial [ 17 ].

The COVID-19 outbreak has killed a huge number of people while wreaking havoc on society, the economy, and the health of the whole world. To effectively control an epidemic, it is important to understand its characteristics and behaviour [ 18 ]. This knowledge may be achieved by collecting and analysing the pertinent big data. Considering the vast quantity of COVID-19 [ 9 ] data that is currently accessible from several sources [ 19 ], it is vital to assess the roles played by big data analysis in preventing the COVID-19 extent as well as to outline the key challenges and potential paths for COVID-19 data studies going forward. Additionally, the COVID-19 [ 6 ] will provide improved treatments that are both economical and well-received by patients by combining big data with image processing techniques [ 20 ]. Therefore, the analysis of COVID-19 requires a framework based on existing applications and studies for providing early decision-making system [ 10 ]. Following is a list of contributions discussed in this review:

Big data analytics in cancer disease-based systematic literature review is offered, acting as a road map for experts in the area to spot and deal with problems caused by new developments.

A comprehensive analysis of the issues and challenges posed by deep learning-based healthcare big data analytics is given, along with a look ahead.

A summary of the potential uses for deep learning methods in the context of evidence-based big data analytics is given in line with the discussion of open research challenges.

This study enhances understanding and directs academics and experts toward recognising current developments and future directions in big data analytics by utilising deep learning techniques.

Finally, the presented work provides the possible research guidelines to the upcoming research works related to COVID epidemic.

The following is how the manuscript is structured: Sect.  2 discusses about the background knowledge in context of big data analytics and covid-19 disease diagnosis along with a deep comprehensive study. Section  3 illustrates the analysis part and the further section explains about the future implications. The final Sect.  4 elaborates the conclusion of the study.

2 Background and Systematic Literature Review

The idea of “Big Data” has emerged due to the abundance of data usage in everyday life. The complexity of big data has a noteworthy impact on how well typical warehouses can collect, manage, monitor, and analyse it. Later on, big data, as defined by IBM’s big data analytics division, is a phrase used to describe datasets that are larger than those found in conventional databases. The dataset is created at enormous sizes and has significant variety, velocity, and volume. Big data aids analysts, researchers, and businesses in the decision-making process by applying a range of methodologies, including statistics, predictive analytics, machine learning, data mining, deep learning analytics, and text analytics. It is possible to use big data analytics in a wide variety of ways like Managing financial crises, medical research, Education, Banking, Natural language processing, and Data administration. Owing to the compensations of big data analytics, academia currently applies big data analytics specially to address the intractable issue like COVID disease identification. The comprehensive review of big data related to healthcare domain, big data characteristics, and issues associated with processing contemporary data are covered in the parts that follow (Fig.  1 ).

figure 1

Volume of data transferred from 2010 and 2025 by IDC over decades

2.1 An Overview of Big Data: Definitions and Characteristics

Over the past few decades data has grown incredibly at an unforeseen rate. The volume of data is predicted to expand, as illustrated in Fig.  2 , from a few zettabytes in 2010 to 163 zettabytes in 2025, according to the survey taken from International Data Corporation (IDC). As a result, the capacity of data storage has expanded from megabytes to exabytes, and it is anticipated that it will approach zettabytes yearly in the coming years. As in prior, relational databases with rows and columns and ordered formats were used to store and display data, which was often obtained from internal operations. Many scholars now claim that the overwhelming of this data is unstructured, and handling it requires the use of non-relational (NoSQL) databases [ 37 , 39 ]. This data may be categorized as machine-to-machine data, web human-generated data, web social media data, and biometric data and transaction data.

figure 2

Characteristics of big data

Because there are so many uses, academics have added additional elements to big data characteristics. Actually, 4Vs, 5Vs, 6Vs, 7Vs, and 10Vs have been added to the initial 3Vs concept. In Fig.  2 , a list of the most often employed Vs is provided.

The elements that served as the foundation for this article's documentary are discussed in this part. For our survey to be conducted in a methodical manner and to identify the survey’s subject, inclusion and exclusion criteria were developed to indicate which features would be valued and which would not. Using relevant keywords like “Big Data,” “Machine learning,” “Data mining,” “disease diagnosis,” “image processing,” “Covid disease,” “deep learning,” “prediction,” and “detection,” among others, a thorough search was conducted using popular scientific bibliographical databases as PubMed, Clarivate Web of Science (WoS), and Google Scholar. Here, the inclusion and exclusion criteria were discussed henceforth we could acquire a substantial number of studies for analysis. To determine if the study was appropriate for inclusion, the abstracts of the publications that were searched were carefully examined. The studies related to big data analytics and its application to several domains were considered suitable for inclusion in this comprehensive study. The whole research articles were downloaded after choosing the relevant studies. This systematic literature analysis does not take into account secondary reports like short communications, non-peer-reviewed letter, editorials, and news articles. The selected works are published between 2015 and 2022. Research papers presented at renowned, highly referenced conferences in the field as well as in peer-reviewed publications were also included.

2.2 Class Imbalance Problem Associated with Big Data

At a rate never previously witnessed, databases are expanding in size and complexity, resulting in ever-larger datasets is defined as Big Data. The abundance samples for the various classes could not be balanced, which is a typical difficulty for categorization, especially with Big Data. Due to this prejudice in favour of the dominant class and contempt for the minority one, imbalanced categorization was first established a number of decades ago. Despite the fact that there are more imbalanced classification algorithms than ever before, they are still primarily concerned with small datasets rather than the new reality of big data. Johnson et al., [ 34 ] have addressed the class imbalance problem in training predictive models. To assess data sampling techniques for addressing high class imbalance the authors examined three data sets of varied complexity using deep neural networks based big data analytics. Although this issue has been extensively researched, the big data era is bringing up more extreme degrees of imbalance that are getting harder to forecast. In the year 2021, Juez et al. [ 35 ] have carried out the experimental analysis using ensemble classifier and resampling models which is conducted on imbalanced Big Data. The entire implementation is performed on spark clusters and compares with the Bayesian model in terms of classifier performance and time consumed. By interlinking Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) for the attack detection system, Al et al. [ 36 ] created the Hybrid Deep Learning (HDL) technique. To lessen the impact of data imbalance, Synthetic Minority Oversampling Technique (SMOTE) sampling techniques along with Tomek-Links was employed. The Google Colab environment used PySpark, which gives Python provision on the Apache Spark platform. The CIDDS-001 data set served as the basis for the model's multiclass assessment, while the UNS-NB15 data set served as the basis for the model's binary classification assessment. An automated method for the detection of financial fraud, that connected to credit card transactions, has been demonstrated by Gupta et al. [ 38 ]. In order to predict fraud, the authors introduced machine learning models such as random forests, support vector machines, logistic regression, and naive bayes on big data.

An assorted bagging-based stacked ensemble model was demonstrated by Sobanadevi et al. [ 40 ] in context of fraud detection while proceeding credit card transaction. The approach is made up of a bagging strategy that tries to deal with heterogeneous base learners and data imbalance problem that serve as decision rule extraction methods on complicated data. Even while banks always work to increase security to avoid fraud, sometimes criminals are able to get beyond these measures. Johnson et al. [ 41 ], have developed Deep Neural Network based on Bayesian a posteriori probability for defining threshold strategies on training the imbalanced dataset. The best parameter method uses training or validation of information to discover the threshold of classification that exploits the geometric mean. Prior probability of the positive class has been used as a classification threshold in the Prior threshold technique, which eliminates the need for optimization. To take into consideration random error, several designs are investigated, and all tests are performed 30 times. The optimal threshold and the positive class prior exhibit a substantial correlation, according to linear models and visualizations. Adaptive synthesis reliant on big data analytics has been developed by Javaid et al. [ 42 ] for the detection of electricity theft (ETD). The authors developed a deep Siamese network by combining LSTM and CNN to distinguish fraud and genuine customers. In particular, the duty of parameters mining from weekly energy usage patterns is assigned to a CNN component, whereas the LSTM component handles the sequential knowledge. The DSN then applies the final decision after considering the shared characteristics offered by the CNN-LSTM.

Data source from the State Grid Corporation of China (SGCC), which contains information on power use over 1035 days for two classes: regular and fraudulent, was used by Arif et al. [ 43 ]. This study proposes a methodology for detecting electrical theft that involves the following four steps: smoothing, data balance, extraction of features, and categorisation. The missing values are replaced with the help of smoothing process, then resampling approach was executed. To classify normal and theft data, residual network was employed to extract related features and are subjected to the corresponding learning algorithm presented. Arif et al. [ 44 ] created the Tomek Link Borderline Synthetic Minority Oversampling Technique with Support Vector Machine (TBSSVM) and Temporal CNN with improved Multi-Layer Perceptron for the goal of detecting energy theft (TCN-EMLP). The former evenly distributes the test dataset's occurrences of the majority and minority classes. The former distinguishes between genuine and deceptive consumers. Furthermore, the assignment of various weights causes deep learning models to have large volatility in their final outcomes. In order to lower the large variance, an average ensemble method was used on Pakistan Residential Electricity Consumption (PRECON) and State Grid Cooperation of China (SGCC) datasets. Hou et al. [ 45 ] have developed unified model to lessen the data imbalance issue. The author suggested a two-stage training method using self-attention-based time-varying forecast approach. In order to mine frequent patterns from time series, the authors utilized an encoder-decoder component with the multi self-attention technique. Then, in order to improve the outcomes of specific periods and address the imbalance problem, the author suggested a time-varying optimization approach. Furthermore, to emphasize the significance of similar historical values to foresee outcomes, the authors suggest inverse spatial interest in place of conventional attention.

2.3 Big Data Characteristics in Sentimental Analysis

Numerous types of programs employ big data to extract important information that is used to make business choices, monitor certain behaviours, or identify potential security threats. Sentiment analysis (SA) is the most active research disciplines dependent on big data, despite the fact that their involvement would be very advantageous. To perform sentimental analysis on the gathered stockTwits dataset, Sohangir et al. [ 28 ] developed the deep learning technique. In the classification of opinion mining on stock market data, doc2vec was utilized together with the incorporation of LSTM and CNN. Hassib et al. [ 29 ] have focused on data imbalance problem in handling financial big data. Optimization approach was developed using data mining techniques, whereas local optima problem was resolved. Here, pre-processing step was carried out with the help of LSH-SMOTE algorithm. Then, grey wolf optimization was adopted to the bidirectional recurrent neural network in enhancing the global optimum solution. In the network big data environment, Liu et al.’s [ 30 ] comment texts for emotional analysis were initiated in the year 2020. The model being demonstrated was created using the combined Bag of Words (CBOW) language and the deep learning methodology. The feedforward neural networks were used to produce the vector representation of text. The Convolutional Neural Network (CNN), which was trained using the labelled training set, then captures the text semantic characteristics. The Softmax classifier of traditional CNN was the final component to add the Dropout approach, which can successfully avoid the model from over-fitting and has enhanced classification performance. Zhai et al. [ 31 ] developed Multi-Attention Fusion Modeling (Multi-AFM), which combines global and local attention to produce a believable contextual representation via gating unit control. Hybrid Lexicon-Naive Bayesian Classifier (HL-NBC) technique for sentiment analysis was explored by Rodrigues et al. [ 32 ]. Additionally, subject categorization, which divides tweets into several groups and blocks out unnecessary comments, occurred before the sentiment analysis algorithm. Lau et al. [ 33 ] created a parallel aspect-oriented sentiment analysis technique to glean customer insights from a substantial number of online product reviews. The suggested architecture puts into practice a broad empirical assessment of a co-evolving extreme learning machine-supported sentiment-enhanced sales forecasting method.

2.4 Big Data Analytics in Handling Financial Crisis

Budgetary credit rating is one of the most severe threats financial institutions must face financial institutions must deal with; however, with these different data sets, typical predictive algorithms and classifiers could not be fair or reliable enough to assess credit risk. Due to the enormously rising financial information from various sources like big data. Big data mining has been studied by Zhou et al. [ 21 ] in the financial and banking industries for risk management prevention. Particle Swarm Optimization (PSO) was used to enhance a Backpropagation (BP) neural network in the model that is being presented in order to categorize financial risks. On the dataset of on-balance sheet and off-balance sheet items, Hadoop HDFS and Apache Spark are combined and the entire system operates in a nonlinear parallel optimization approach. A data-driven operation framework was introduced by Zhang et al. [ 22 ] which combines the framework with a time series data streaming massive data infrastructure and extricates the relationships of complicated significant components. The QuantCloud platform is the name of this integrated system was implemented on New York Stock Exchange (NYSE) data. In essence, QuantCloud manages massive volumes of diverse market information in a data-parallel manner while carrying out extensive event processing in a data-driven manner. Data cleaning process was taken place and the data modelling was carried out using autoregressive-moving average (ARMA) method. The research framework for a system providing regional financial data services in a smart financial context was designed by Wensheng et al. [ 23 ]. From three perspectives—functional scope, service model, and operating principle—it explores the underlying condition of building a rural financial information service platform in China and draws advancement conclusions from the current state of the platform’s development and deployment there. To apply the graph embedding technique, an intelligent and distributed Big Data system was developed by Zhou et al. [ 24 ] for identifying financial fraud on the Internet. Node2Vec was created to acquire and express the spatial information in the commercial graph structure into low-dimensional sparse matrices, enabling the effective identification and prediction of data samples of the large-scale dataset with the deep neural network.

To handle the big dataset in simultaneously, the method was spread and carried out on clusters of Apache Spark GraphX and Hadoop. For Big Data analytics, Dos et al. [ 25 ] presented a data distribution strategy made up of hybridized Cloud Computing and Volunteer Computing infrastructures. A model for resource allocation in hybrid infrastructure and services was developed, as well as an HR Alloc Method for choosing where to deposit data in big data applications. In order to study three underlying patterns, namely investing pedigrees, investment firms, and structural gaps, Yang et al. [ 26 ] presented rapid networking methodologies from financial big data. They first described a pedigree classification system to recognize financial pedigrees, which draws inspiration from disjoint collections and route compression. Second, by offering a pruning approach and a data format known as “2-tuple list,” researchers established a linear-time structure mining algorithm in the network (SMAN) for analysing investment businesses and hierarchical gaps from the investment pedigree. Ruan et al. developed three fuzzy correlation measurement techniques in [ 27 ]: the centroid-based portion, the integral-based measure, and the cut-based ration. Utilizing data from the Stock Price Index SSI and exchange rates of foreign currencies over the Chinese Yuan from 22 January 2013 to 17 May 2018, we evaluated the performance of our algorithms. More significantly, the SSI and these significant exchange rates were found to be directly correlated. The use of maximum, minimum, or terminal values in daily exchange rates and stock prices affects the significant Granger causality of exchange rates over SSI, but there was no opposite causality from SSI to exchange rates. While the Euro over the Chinese Yuan was adversely associated with the SSI and was documented as a Granger source to the SSI with an implication level of 1 percent, the Hong Kong Dollar over the Chinese Yuan and the U.S. Dollar over the Chinese Yuan are positively connected with the SSI. The comparison of a wide range of applications related to big data paradigm is illustrated in Table 1 .

2.5 Data Management Using Spark to Manage Big Data Streaming

Relying on the Apache Spark open-standard Big Data processing platform that was installed in the cloud and focuses on smearing machine learning models to streaming Big Data, a real-time remote health status prediction system was created by Nair et al. [ 68 ]. When someone tweets about their health in this system, the program instantaneously receives, extracts, and analyzes those features to estimate the user’s health state. The user is then immediately contacted to take the necessary action. Kılınç et al. [ 70 ] have offered an automated approach for identifying fake news detection scheme. The machine learning library of Apache Spark employed the Nave Bayes technique for training and testing both sentiment classification and fraudulent account detection systems. Ramírez et al. [ 71 ] have developed incremental model based on nearest neighbour algorithm in case of applying to demanding schemes. The presented distributed classifier was executed in the Apache spark platform to stream the big data. The authors also provided a method of progressive instance selection for large data sources that modify and delete out-of-date instances from the particular instance continually. This lessens the original classifier's high processing demands, making it appropriate for the situation under consideration. Park et al. [ 72 ] have designed big data analytics in the spark framework where streaming was performed in memory-based cluster-based computing standard. Spark was offered by the open-source ASF (Apache Software Foundation) community.

Carcillo et al. [ 73 ] developed a Scalable Real-time Fraud Finder (SCARFF) using a substantial dataset of real credit card transactions. SCARFF integrates Big Data technologies (Kafka, Spark, and Cassandra) with a machine learning technique that handles asymmetry and reinforcement delay. Rathore et al. [ 74 ] proposed a real-time Big Data stream computing method by implementing the Hadoop MapReduce comparable models on graphics processing units (GPUs). Spark with GPU was paired with a parallel and distributed Hadoop ecosystem environment to increase the system’s power and ability to handle the enormous volume of high-speed streaming. By breaking large Big Data records into fixed-size blocks, a MapReduce-like method for GPUs was created to calculate statistical parameters. Spark Streaming-based system for monitoring online Internet traffic was suggested by Zhou et al. in [ 75 ]. The collector, message system, and stream processor are the three components that make up the system. The TCP performance monitoring was chosen by the authors as a unique use case to demonstrate how their suggested method might be used for network monitoring. A distributed technique called SWEclat was suggested by Xiao et al. [ 76 ] for mining frequent item sets across enormous streaming data. The program stores the dataset in a vertical data structure and processes streaming data using a sliding window. In order to split these RDDs for distributed processing, this method was created by Apache Spark and stores streaming data and datasets in vertical data format using Spark RDD.

2.6 Map Reduce-Based Method in Different Streams of Big Data Environment

A weighted distributed long short-term memory archetypal called WND-LSTM was published by Xia et al. [ 46 ] relying on a MapReduce parallel computing environment. More specifically, the Hadoop distributed computing platform developed a decentralized forecasting method for anticipating traffic flow using MapReduce to address the challenges with storage and computation associated with managing massive amounts of traffic flow data with an independent learning approach. Bawankule et al. [ 47 ] introduced the Historical data based Reduce tasks scheduling (HDRTS) method to decrease the operational skew caused by the Reduce phase of the MapReduce process. To dynamically manage intrusion detection, Asif et al. [ 48 ] demonstrated the MapReduce-Based Intelligence Mechanism for Intrusion Detection (MR-IMID). The suggested MR-IMID reliably processes large data sets using technology that is readily available. In-depth research on task scheduling in diverse environments was provided by Wang et al. [ 49 ], who also suggests the task scheduling algorithm HTD. Pandey et al. [ 50 ] have developed a deadline-aware, energy-efficient MR scheduling problem for the Hadoop YARN architecture. The time-indexed binary decision variables to describe the scheduling issue under consideration was analysed as an integer program. Then, using the knowledge that tasks have varied energy usage values on various computers, to schedule map and decrease tasks on the heterogeneous cluster machines, a heuristic technique is developed.

The study by Baruah et al. [ 51 ] focused on the deep neuro-fuzzy network technique for forecasting student performance that is based on fractional competitive multi-verse optimization. The mapper and reducer stages of the MapReduce framework were developed in order to execute the performance of student prediction approach utilizing the deep learning classifier. Narayana et al. [ 54 ] have developed a classification solution using the Ant Cat Swarm Optimization-enabled Deep Recurrent Neural Network (ACSO-enabled Deep RNN) via Map Reduce framework, which integrates the Ant Lion Optimization strategy with the Cat Swarm Optimization technique. Massive data categorization and feature selection are carried out using the Map Reduce framework. The features are chosen via fuzzy clustering based on black hole entropy and Pearson correlation. In the reduction part, categorization is done with the aid of a Deep RNN that was trained using a custom ACSO algorithm. The Gaussian Relevance Vector MapReduce-based Annealed Glowworm Optimization Scheduling (GRVM-AGS) hybrid model, developed by Patan et al. [ 52 ], was intended to progress the distribution of big medical data files across several clinicians while needing less time and more efficient scheduling. A GRVM approach was initially established for the predictive analysis of incoming medical data.

Arunadevi et al. [ 58 ] have described a Cuckoo Search Augmented Map Reduce for Predictive Scheduling (CSA-MRPS) system. Foresight research applied MapReduce processes on discretized data using Multi-Objective Ranked Cuckoo Search Optimization (MRCSA). Every sort of area’s pure spectra and perturbed/mixed spectra are qualitatively recorded using the extended extreme learning machine (IELM) technique, which was created by Roy et al. [ 56 ]. The Jahazpur mineralized belt has been extensively mapped using a MapReduce model that combines the IELM technique and AVIRIS-NG (Airborne Visible-Infrared Imaging Spectrometer-Next Generation) observation. Mapreduce (MR) model has been used to the huge data of maritime navigation by Pham et al. [ 57 ]. Specifically, the author uses a popular clustering method called K-means, which is based on the MR model, to analyze the data of marine traffic in the South Vietnam Sea region. Ramsingh et al. [ 55 ] propose the NBC-TFIDF (Naive Bayes Classifier—Term Frequency Inverse Document Frequency) technique and a MapReduce-Based Hybrid. Using a Map Reduce-Based Hybrid NBC, the data are categorized according to the polarity score assigned to each sentence in the social media data. Chawla et al. [ 53 ] have presented MuSe, an efficient distributed RDF storage solution for loading and querying RDF data using Hadoop MapReduce.

Using the ACO-GA method with HDFS map-reduce, Kumar et al.’s proposal for an enhanced query optimization procedure in big data (BD) is presented in [ 59 ]. The dataframe is initially pre-processed by utilizing the SHA-512 method to get the hash value (HV) and the HDFS map-reduce function to eliminate redundant data. The optimized query followed using ACO-GA algorithm. For the map-reduce framework-based training of deep belief networks (DBNs), Agarwal et al. [ 60 ] proposed two simultaneous models. For the layer-by-layer training of DBNs in both models, which followed the positive and negative phases, we employed several computers. Studies using data sets from the Toronto Emotional Speech Set and the Ryerson Audio-Visual Database of Emotional Speech and Song have revealed that the first proposed model, the First Parallel Map-Reduced-Based Deep Belief Network, is effective (FParMRBDBN). The hybrid Database-MapReduce system proposed by Pang et al. [ 61 ] would combine the benefits of both systems. The query optimizer AQUA+ that the authors proposed was made specifically for the hybrid system. Maheswari et al. [ 62 ] developed a Kernelized Spectral Clustering based Conditional Maximum Entropy MapReduce (KSC-CMEMR) method for reducing dimensionality.

2.7 Big Data in Healthcare Diagnosis

Big data analytics in healthcare follow a structure that is largely the same as traditional healthcare informatics. Independent systems with installed advanced analytics were frequently utilized in healthcare initiatives. Distributed processing is used when computation needs to be implemented across a number of locations owing to “big data.”

2.7.1 Big Data Application for Heart Disease Diagnosis

Thanga et al. [ 63 ] offer a huge health implementation model based on optimum artificial neural network (OANN) for the diagnosis of heart disease, which is regarded as the worst disease in the entire globe. The two main parts of the proposed OANN are the distance-based misclassified instance removal (DBMIR) approach and the teaching and learning-based optimization (TLBO) method for ANN (TLBO-ANN). The conceptual system is developed using a Big Data framework such as Apache Spark. The offline forecasting stage and the online forecast stage are how the presented OANN model functions. Ed et al. [ 64 ] established a real-time cardiac disease predicting framework that is based on Apache Spark, a potent large-scale distributed computing platform that can be successfully used to process streaming data events as opposed to machine learning employing in-memory operations. The technology is made up of two basic components: data storage and visualisation and streaming processing. The first predicts cardiac illness by applying a classification model to data events using Spark MLlib and Spark streaming. The significant quantity of created data is stored by the seconds using Apache Cassandra.

According to Vaishali et al. [ 65 ], a significant collection of medical records is employed as input. It is meant to extract the required data from the records of heart disease patients in this particular dataset using the map reduction approach. Heart disease [ 115 , 116 ] is a severe health problem and the leading cause of death worldwide. Early heart illness diagnosis has become very important in the realm of medical research. The RR interval, QRS interval, and QT interval are a few features that are looked at to help diagnose heart illness. In the testing phase, the map reduction methodology is utilized to identify the disease and minimize the dataset after the classification algorithm determines if the patient is normal or abnormal. Rastogi et al. [ 66 ], to predict a patient’s risk of getting heart disease, input factors such their gender, cholesterol, blood pressure, TTH, and stress can be taken into account. Data mining (DM) techniques such Naive Bayes, decision trees, support vector machines, and logistic regression are studied using the heart disease database. Nayak et al. [ 67 ] emphasize the importance of early diagnosis of cardiac abnormalities in order to successfully treat these illnesses. This paper employs a number of data mining classification techniques, such as Decision Tree classification, Naive Bayes classification, Support Vector Machine classification, and k-NN classification, to identify and take precautions against diseases at an early stage, so they can be treated and prevented.

2.7.2 Kidney Disease Classification with the Help of Big Data

Sisodia et al. [ 82 ] propose a healthcare multi-phase architecture (HCMP) for predicting chronic renal illness. Data collection, data storage, management, processing, analysis, and report preparation are the six layers of the HCMP architecture. The data-storage and data-management layers were implemented on a heterogeneous Hadoop cluster, and the profiling approaches were used to account for three situations in order to ascertain the capacity ratio of each DataNode in the cluster. The MySymptom algorithm and MapReduce, which is used for parallel data processing, have been used to filter the renal dataset of patients according to their symptoms at the data processing layer. Diez et al. [ 83 ] focuses on an area of medicine called kidney transplant therapy for end-stage renal illness, where machine learning is starting to be utilised as a supplemental resource to estimate or make judgments. Abdelaziz and others [ 84 ], utilised CKD prediction accuracy in a cloud setting. The stakeholders in health care in smart cities believe the cloud-IOT-based prediction of hazardous illnesses like CKD to be a significant challenge. Chronic kidney disease (CKD) prediction is given as an example of a health service offered using cloud computing. Cloud computing enables residents of smart cities to estimate the presence of CKD at anytime and anywhere. By combining two intelligent techniques, namely linear regression (LR) and neural network, this research provides a hybrid intelligent model for predicting CKD based on cloud-IOT.

2.7.3 Brain Tumour in Big Data Applications

Mansour et al. [ 85 ] introduce an artificial intelligence and big data analytics-based ICH e-diagnosis (AIBDA-ICH) model using CT images. IoMT devices are used in the given model's data collection process. In the presented AIBDA-ICH technique, the damaged areas in the CT scans are located using a graph cut-based segmentation model. The main purpose of the Hadoop Ecosystem and its parts is to manage massive amounts of data. The capsule network (CapsNet) model is also employed as a feature extractor to obtain a useful collection of feature vectors. Last but not least, a fuzzy deep neural network (FDNN) model was used for classification by the aforementioned AIBDA-ICH approach. The improved results of the AIBDA-ICH technique were validated through a large number of simulations, and the outcomes are examined from a variety of perspectives. Big data and image processing have been combined for the segmentation and classification of brain tumours by HS et al. [ 86 ]. Using the MATLAB Hadoop system, a big data analysis is performed on the image of the brain tumor. The BraTS dataset is delivered to the Hadoop and Matlab Distributed Computing Server (MDCS) system for processing, which is performed by one master node and four slave nodes (multimode) in an MDCS configuration. The findings of this inquiry are broken down into its component pieces using the dual-tree complex Gabor wavelet transform (DTCGWT). Malignant and benign brain tumors are distinguished from each other using the generated feature vectors by the CNN model. If a malignant brain tumour was found, the image will be segmented using the fuzzy level set method based on the manta ray foraging algorithm (FLSM-MRF).

2.7.4 Diabetic Disease Identification with the Help of Big Data Analytics

To assist the healthcare system in making speedy and informed judgments, Saluja et al. [ 69 ] used MapReduce. Every day, a tremendous amount of data is generated in the life sciences. Using MapReduce, the vast amount of clinical data is divided into a variety of categories, making it easy to understand for future use. The major goal is to filter different aspects of diabetes and heart disease while using clinical data to build healthcare information for designing a health service that is physician. Subramaniyan et al. [ 77 ] used big data with Large-scale computers and machine learning to provide predictive analytics for extracting inherent information. A prerequisite for Big Data computing, cloud computing has arisen as a service-oriented computing architecture for processing enormous amounts of quickly expanding data at a quicker scale. Big data frameworks like Hadoop and Spark can be utilized in conjunction with machine learning approaches. AlZubi et al. [ 78 ] use big data and classification techniques, such as effective map reduction technologies, to find diabetes. The map reduction concept was utilized to effectively generate the little piece of data after the data were initially composed from a large dataset. Following this, the acquired dataset is normalised to remove any noise that was present. The ant bee colony strategy, which makes advantage of ant traits like wandering, was then used to choose the statistical attributes. Support vector machine with multilayer neural network was employed to train the chosen features.

In the Hadoop framework, Hatua et al. [ 79 ] have developed a Diabetic Retinopathy (DR) detection approach that is quick and accurate and can detect the first indications of diabetes from retinal images of the eye. According to the suggested method, there are five levels of diabetic retinopathy: Proliferative DR, Mild DR, Moderate DR, Severe DR, and No DR. For the purpose of categorizing images of diabetic retinopathy, the suggested method separates feature extraction, feature reduction, and image classification into three distinct procedures. Each of the images for diabetic retinopathy is represented by the Histogram of Oriented Gradients (HOG) at the beginning of the method. Principal Component Analysis is employed to diminish the dimension of HOG characteristics. The method's final step is a classifier called K-Nearest Neighbours (KNN). Sivakumar et al. [ 80 ] presented the Equidistant Heuristic and Duplex Deep Neural Network (EH-DDNN) technique for the early detection and prognosis of diabetic illness. First, the feature selection method known as Equidistant Heuristic Pruning (EHP) is discussed using the Big Data dataset as input. The EHP separately divides the incoming data matrix into rows and columns. EHP divides neighbourhood assessments while dropping communication overhead and time, considerably boosting computation correlation. It does this by utilizing conditional non-alignment assessment and heuristics methodologies. The result is fewer features that are better for early prediction and the elimination of redundant and unneeded attributes. A Duplex Deep Neural Network (DDNN) is then developed for early assessments by fusing the intrinsic characteristics with nonlinear processing features and linear response. The efficiency of classification and clustering methods was evaluated for diabetic medical data by Mamatha and colleagues [ 81 ] which is detailed in Table 2 .

2.8 Data Mining Approaches in Dealing with COVID Crisis

Big data presents a huge prospect for physicians, epidemiologists, and experts in health policy to make selections based on the best available knowledge, thereby improving patient care. Big data is not just a contemporary reality for researchers and scientists; it is also a requirement that must be correctly comprehended and used in the search for new knowledge. Big data will be essential to managing the COVID-19 epidemic in today's digital environment. The generic progression of COVID detection and prediction techniques is offered in the following figure.

Figure  3 depicts the fundamental architecture of COVID diagnosis using big data analytics. In this section, we conduct a literature review to highlight the significance of various research in the area of COVID-19-based big data analysis.

figure 3

General architecture of COVID diagnosis using big data analytics

2.8.1 COVID Diagnosis Using Machine Learning

Big data applications are essential for managing pandemic circumstances, such as predicting COVID-19 outbreaks and locating COVID-19 cases and patterns of transmission. The textual clinical report was categorized using a machine learning (ML) method according to Ramanathan et al.’s [ 94 ] documents mining approach from 1994. The ensemble ML classifier technique was used to abstract features from the term frequency-inverse document frequency (TF/IDF) advanced approaches, a productive way of information retrieval from the Corona dataset. The three-way COVID-19 steps are divided into categories based on how the feature was extracted using a machine learning-based information retrieval approach. The TF/IDF for coronavirus classification and prediction is developed to quantify and statistically analyze the text data mining of the COVID-19 patient record list. The deep learning neural network-based method nCOVnet developed by Panwar et al. [ 96 ] may be used to swiftly screen for COVID-19 by evaluating the X-rays of patients and looking for visual indicators seen in the chest radiography imaging of COVID-19 patients.

Twitter data was analysed using the R programming language, according to Kaur et al. (1997). We have gathered Twitter information based on hashtag terms like COVID-19, coronavirus, fatalities, new case, and recovered. The Hybrid Heterogeneous Support Vector Machine (H-SVM) algorithm was developed in this study to categorize sentiment ratings into three groups: neutral, negative, and positive. Wang et al. [ 98 ] examine the feature selection algorithm of big data samples in order to select typical economic indicators from among the numerous economic statistics of the sports industry and show the growth trend of the sports industry. A deep learning approach based on large data feature selection is advised. Deep learning and data fusion are used after initially developing a framework for huge data feature selection. This strategy incorporates a number of forward-looking elements as well as a particular reference point for data on the evolution of the sports industry.

2.8.2 COVID Diagnosis Using Deep Learning

Big data technologies are a vital tool in the war against COVID-19 in a range of enticing applications, including as pandemic monitoring, viral sensing and therapy, and diagnostic support. In order to simulate the increase rate in the number of covariates, Chew et al. [ 87 ] state that a hybrid deep-learning model known as ODANN has been built to handle daily COVID-19 time-series records and massive quantities of COVID-19 related Twitter data at the same time. The neural networks (NN) that make up this model are combined with an analytical framework and feature extraction techniques. Elghamrawy et al. from [ 88 ] offers a comprehensive analysis of the part Deep Learning and Big Data Analytics play in disease prevention. Additionally, a model (DLBD-COV) based on Big Data analytics and influenced by H2O's Deep Learning is recommended for early patient detection using X-ray and CT images. The scalable handling machine learning framework is used to build the proposed diagnostic model (H2O). CNNs and generative adversarial networks (GAN) perform classification in contrast to each other. The remedy proposed by Jamshidi et al. [ 89 ] uses artificial intelligence to combat the pathogen (AI) [ 117 ].

Several Deep Learning (DL) methods have been shown to accomplish this goal, including the Extreme Learning Machine (ELM), Long/Short-Term Memory, and Generative Adversarial Networks (GANs). It describes an integrated bioinformatics approach where different informational elements from a variety of structured and unstructured data sources are combined to develop platforms that are practical for academics and medical professionals to use. Oh et al. [ 90 ] recommend employing a patch-based convolutional neural network technique with a manageable amount of trainable parameters for the diagnosis of COVID-19. The methodology was based on our statistical analysis of the possible imaging biomarkers identified in the CXR radiographs. Using deep learning and laboratory data, Alakus et al.’s clinical prediction models [ 91 ] identify the patients who are most likely to experience a COVID-19 illness. The precision, F1-score, recall, AUC, and accuracy scores of our models were used to evaluate their predictive ability. According to Awan et al.’s [ 99 ] considerable data framework, a Deep Transfer Learning (DTL) technique using Convolutional Neural Network (CNN) three architectures—InceptionV3, ResNet50, and VGG19—was deployed using COVID-19 chest X-ray images. The three models are correctly evaluated in the COVID-19 and conventional X-ray image classes [ 118 ].

2.8.3 Influence of COVID-19 Epidemic Situation Based on Big Data

Luo et al. [ 92 ] exhibited deep learning methods for assessing diners’ evaluations of restaurant characteristics and locating reviews with discrepant ratings. The method of Prasanth et al. [ 93 ] for foreseeing future occurrences of infection is offered. This strategy is based on the analysis of data obtained from the search phrases of people in the affected region. The study practices data on COVID-19 spread from the European Centre for Disease Prevention and Control as well as pertinent Google Patterns of specific search phrases connected to the pandemic in order to estimate upcoming trends of daily new cases, cumulative cases, and deaths for India, the USA, and the UK. In order to do this, the network parameters of the Long Short-Term Memory (LSTM) network are enhanced using a hybrid Grey Wolf Optimizer (GWO)-LSTM model (GWO). Ghosh et al. [ 95 ] investigated data from Twitter microblogs to ascertain how COVID-19 impacts people’s mental health, specifically depression (tweets). Recurrent neural networks are used in our special pipeline to accurately identify depressed tweets with a 99.42% accuracy rate (in the form of long-short term memories, or LSTM). The detailed view and impact of the COVID-19 epidemic in different streams collected since 2021 is illustrated in Table 3 .

2.8.4 Impact of Imaging Modalities in COVID Diagnosis Using Big Data Analytics

Wang et al. [ 100 ] developed COVID-Net, a deep convolutional neural network architecture that is available on the internet source, to recognize COVID-19 instances from chest X-ray (CXR). The most COVID-19-positive instances are found in COVIDx, an open access benchmark dataset we developed, which consists of 13,975 CXR images from 13,870 patient cases. Yang et al. [ 101 ] used four powerful pre-trained CNN architectures: VGG16, DenseNet121, ResNet50, and ResNet152 for the COVID-19 CT-scan binary classification task. The Fast.AI ResNet framework will be used to establish the networks' ideal design, pre-processing, and training settings almost immediately. We also employed transfer learning techniques to get over the shortage of data and cut the training time. The binary and multi-class classification of X-ray imaging tasks were carried out using modified VGG16 deep transfer learning architecture. Ohata et al. [ 102 ] have created an automated COVID-19 infection detection system based on chest X-ray. The datasets produced for this experiment consist of 194 X-ray images of patients with coronavirus and 194 X-ray images of healthy people. The authors adopted the concept of transfer learning for this, because there are not enough photographs of patients with COVID-19 that are available to the general population. The scientists used several ImageNet-trained convolutional neural network (CNN) architectures and modified them to work as feature extractors for the X-ray images.

Chowdhury et al. [ 103 ] proposed Parallel-Dilated COVIDNet, a parallel-dilated convolutional neural network (CNN)-based COVID-19 identification system from chest X-ray images (PDCOVIDNet). Before applying the suggested method to categorize the freely accessible chest X-ray collection, it must first be fully preloaded and improved. Convolution dilation rate variation is used as the proof-of-concept for using PDCOVIDNet to derive radiological features for COVID-19 identification. A precise framework for a convolutional neural network to differentiate COVID-19 patients from supplementary pneumonia cases was offered by Farooq et al. [ 107 ]. Awasthi et al. [ 108 ] have developed a compact, mobile-friendly, and successful deep learning algorithm for the detection of COVID-19 using lung US data. Three distinct courses were featured in this challenge, including COVID-19, pneumonia, and healthy. The developed network, known as Mini-COVIDNet, was compared to both lightweight and more sophisticated heavyweight neural network models. Hasan et al. [ 109 ] discuss the possibility of using convolutional neural networks to predict COVID-19 patients from a CT. The cutting-edge strategy for predicting COVID-19 is based on the most recent CNN architectural change (DenseNet-121). 2D Sparse Matrix Profile DenseNet has been found by Liu et al. [ 110 ] for the diagnosis of COVID-19 utilizing CT scans. According to Xiao et al. [ 111 ], a CNN with a parallel attention module (PAM-DenseNet) may perform well on coarse labels without manually specified infection zones. The dense connectivity structure achieves feature map reuse by adding direct connections from all previous levels to all following layers, and it may be able to extract representative features from fewer CT slices. The impact of imaging techniques in COVID diagnosis using big data analytics is described in Table 4 .

3 Analysis from the State-of-the-Art

Data mining techniques were used to examine a huge number of publications for big data analytics, and 25 research that were connected to a certain set of COVID-related issues were chosen. The performance metrics has been derived on the basis of four basic metrics namely f-measure, specificity, accuracy and sensitivity. Among the total number of samples in the test, Accuracy represents the ratio of correct predictions. The evaluation undertaken by the previous studies for COVID diagnosis with and without big data analytics using data mining algorithms with the obtained performance measure is graphically displayed below.

The total number of classification model taken according to this survey paper is pictorially shown in the pie chart.

From the analysis, it describes algorithms is utilized utmost and used in current period for big data analytics. There are around 114 research articles are taken from learning models like DT, KNN, RF, NB, LR, ANN, DNN, CNN and SVM in the context of big data analytics. The vast majority of research methodologies used between 2015 and 2022 are based on deep learning techniques.

3.1 Challenges and Future Implications

The invention and performance analysis of innovative data mining-based algorithms for COVID detection, prediction, and classification, as well as survey and review studies, have all contributed to the field of COVID diagnosis and prediction, as can be shown in [ 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 95 , 96 , 97 , 98 , 99 , 100 , 101 , 102 , 103 , 104 , 105 , 106 , 107 , 108 , 109 , 110 , 111 , 112 , 113 , 114 ]. We have provided an overview of cutting-edge strategies for combating the COVID-19 pandemic in this publication. Here, Big data applications for the COVID-19 illness, including as outbreak forecasting, tracking viral dissemination, diagnosis and treatment, and vaccine and medication discovery, have also been discussed. We have also spoken about the difficulties that must be solved if big data and AI are to effectively combat the COVID-19 epidemic. Finally, we have emphasized critical take always and suggestions for the legal and scientific communities. On the basis of this analysis, it is determined that big data analytics has a significant impact and a promising future for research in the area of the COVID-19 illness diagnosis model. Big Data Analytics is used to extract useful information from COVID data, assisting COVID patients in determining the severity of the condition before it affects a person.

The novel coronavirus (COVID-19) pandemic has caused unprecedented disruption around the world, and it is clear that Artificial Intelligence (AI) will be critical in helping us to address this global challenge. AI can play a major role in identifying potential treatments for COVID-19, developing better diagnostic tests and predicting future outbreaks. It can also help healthcare organizations more effectively manage their resources by providing real-time data on patient health status and enabling them to make decisions quickly based on the latest available information.

AI technologies such as machine learning algorithms have demonstrated great promise in aiding with disease diagnosis and prognosis. For example, AI systems are being used to identify patterns from medical images that might otherwise go unnoticed by human doctors or nurses when diagnosing patients with COVID-19 symptoms; this could potentially reduce misdiagnoses while speeding up treatment times significantly. In addition, predictive analytics models powered by AI are being used to forecast future outbreaks of the virus so governments can take steps early enough to contain them before they spread too far beyond control measures already put into place.

Finally, natural language processing techniques enable automated chatbots which provide accurate answers about COVID prevention methods as well as other useful information about social distancing guidelines or local testing sites for those who need it most—all without requiring any direct contact between individuals during these uncertain times where physical contact should be avoided at all costs if possible! This ensures people get access to reliable advice while minimizing further risk of infection through face-to-face interactions whenever possible—an invaluable asset during these trying times indeed!

Overall then it is clear just how powerful a tool Artificial Intelligence is proving itself out here: not only does its use aid healthcare professionals diagnose illnesses faster but also helps predict future outbreaks accurately so we may combat them more efficiently than ever before ultimately making our collective fight against COVID much easier going forward (Figs. 4 , 5 ).

figure 4

Comparison of data mining technique in COVID diagnosis using big data analytics

figure 5

Number of Published research articles taken for analysis

4 Conclusion

Our thorough analysis of the literature led us to the conclusion that most recent research either assesses the effectiveness of prevalent data mining-based COVID detection, prediction, and classification algorithms or provides succinct explanations of a small number of these tactics. To our knowledge, none of them, however, provides a thorough overview and comparison of the methods now in use and the pertinent difficult problems in big data analytics. Given the difficulties in the field of COVID identification, prediction, and classification based on big data analytics, we present a complete classification and comparison of existing methodologies utilising essential factors. Additionally, we provided a thorough state-of-the-art analysis on the creation of global systems for COVID diagnosis and prediction. We use big data analytics to break down machine learning and deep learning algorithms in order to accurately forecast when the COVID illness will manifest. The publications included in this study were acquired from a number of sources, including Wiley, IEEE Xplore, Nature, and ScienceDirect. To the best of our knowledge, imaging-based clinical diagnosis in the context of COVID diagnosis has not been directly addressed by the field of medical big data analytics. The relevant concerns are also examined, and a few unresolved problems are identified for more study in the area of big data analytics for the detection of communicable disease outbreaks. Additionally, it has been revealed that applying big data analytics in the case of the COVID pandemic can solve the current issue with illness diagnosis.

Data Availability

Data sharing not applicable to this article as no datasets were generated.

Ranjan, J., Foropon, C.: Big data analytics in building the competitive intelligence of organizations. Int. J. Inf. Manage. 56 , 102231 (2021)

Article   Google Scholar  

Mohamed, A., Najafabadi, M.K., Wah, Y.B., Zaman, E.A.K., Maskat, R.: The state of the art and taxonomy of big data analytics: view from new big data framework. Artif. Intell. Rev. 53 (2), 989–1037 (2020)

Mariani, M.M., Wamba, S.F.: Exploring how consumer goods companies innovate in the digital age: the role of big data analytics companies. J. Bus. Res. 121 , 338–352 (2020)

Mikalef, P., Krogstie, J.: Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities. Eur. J. Inf. Syst. 29 (3), 260–287 (2020)

Holmlund, M., Van Vaerenbergh, Y., Ciuchita, R., Ravald, A., Sarantopoulos, P., Ordenes, F.V., Zaki, M.: Customer experience management in the age of big data analytics: a strategic framework. J. Bus. Res. 116 , 356–365 (2020)

Wong, Z.S., Zhou, J., Zhang, Q.: Artificial intelligence for infectious disease big data analytics. Infect. Dis. Health 24 (1), 44–48 (2019)

Manogaran, G., Shakeel, P.M., Baskar, S., Hsu, C.H., Kadry, S.N., Sundarasekar, R., Kumar, P.M., Muthu, B.A.: FDM: fuzzy-optimized data management technique for improving big data analytics. IEEE Trans. Fuzzy Syst. 29 (1), 177–185 (2020)

Li, W., Chai, Y., Khan, F., Jan, S.R.U., Verma, S., Menon, V.G., Li, X.: A comprehensive survey on machine learning-based big data analytics for IoT-enabled smart healthcare system. Mobile Netw. Appl. 26 (1), 234–252 (2021)

Yasmin, M., Tatoglu, E., Kilic, H.S., Zaim, S., Delen, D.: Big data analytics capabilities and firm performance: an integrated MCDM approach. J. Bus. Res. 114 , 1–15 (2020)

Ghasemaghaei, M.: The role of positive and negative valence factors on the impact of bigness of data on big data analytics usage. Int. J. Inf. Manage. 50 , 395–404 (2020)

Sousa, M.J., Pesqueira, A.M., Lemos, C., Sousa, M., Rocha, Á.: Decision-making based on big data analytics for people management in healthcare organizations. J. Med. Syst. 43 (9), 1–10 (2019)

Aljumah, A.I., Nuseir, M.T., Alam, M.M.: Traditional marketing analytics, big data analytics and big data system quality and the success of new product development. Business Process Manag. J. 27 , 1108 (2021)

Peters, E., Kliestik, T., Musa, H., Durana, P.: Product decision-making information systems, real-time big data analytics, and deep learning-enabled smart process planning in sustainable industry 4.0. J. Self-Governance Manag. Econ. 8 (3), 16–22 (2020)

Mishra, S., Mishra, B.K., Tripathy, H.K. and Dutta, A.: Analysis of the role and scope of big data analytics with IoT in health care domain. In: Handbook of data science approaches for biomedical engineering, pp. 1–23. Academic Press. (2020)

Rehman, A., Naz, S. and Razzak, I.: Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities. Multimedia Syst 1–33. (2021)

Jia, Q., Guo, Y., Wang, G., Barnes, S.J.: Big data analytics in the fight against major public health incidents (Including COVID-19): a conceptual framework. Int. J. Environ. Res. Public Health 17 (17), 6161 (2020)

Ahn, P.D., Wickramasinghe, D.: Pushing the limits of accountability: big data analytics containing and controlling COVID-19 in South Korea. Account. Audit. Account. J (2021)

Awotunde, J.B., Ogundokun, R.O., Misra, S.: Cloud and IoMT-based big data analytics system during COVID-19 pandemic. In Efficient data handling for massive internet of medical things (pp. 181–201). Springer, Cham. (2021)

Verma, S. and Gazara, R.K.: Big data analytics for understanding and fighting COVID-19. In Computational intelligence methods in COVID-19: Surveillance, prevention, prediction and diagnosis (pp. 333–348). Springer, Singapore. (2021)

Shinde, P.P., Desai, V.P., Katkar, S.V., Oza, K.S., Kamat, R.K., Thakar, C.M.: Big data analytics for mask prominence in COVID pandemic. Mater. Today 51 , 2471–2475 (2022)

Google Scholar  

Zhou, H., Sun, G., Fu, S., Liu, J., Zhou, X., Zhou, J.: A big data mining approach of PSO-based BP neural network for financial risk management with IoT. IEEE Access 7 , 154035–154043 (2019)

Zhang, P., Shi, X., Khan, S.U.: QuantCloud: enabling big data complex event processing for quantitative finance through a data-driven execution. IEEE Transact. Big Data 5 (4), 564–575 (2018)

Wensheng, D.: Rural financial information service platform under smart financial environment. IEEE Access 8 , 199944–199952 (2020)

Zhou, H., Sun, G., Fu, S., Wang, L., Hu, J., Gao, Y.: Internet financial fraud detection based on a distributed big data approach with node2vec. IEEE Access 9 , 43378–43386 (2021)

Dos Anjos, J.C., Matteussi, K.J., De Souza, P.R., Grabher, G.J., Borges, G.A., Barbosa, J.L., Gonzalez, G.V., Leithardt, V.R., Geyer, C.F.: Data processing model to perform big data analytics in hybrid infrastructures. IEEE Access 8 , 170281–170294 (2020)

Yang, L., Yang, Y., Mgaya, G.B., Zhang, B., Chen, L., Liu, H.: Novel fast networking approaches mining underlying structures from investment big data. IEEE Transact Syst Man Cybern. 51 (10), 6319–6329 (2020)

Ruan, J., Jiang, H., Yuan, J., Shi, Y., Zhu, Y., Chan, F.T., Rao, W.: Fuzzy correlation measurement algorithms for big data and application to exchange rates and stock prices. IEEE Trans. Industr. Inf. 16 (2), 1296–1309 (2019)

Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M.: Big data: deep learning for financial sentiment analysis. J. Big Data 5 (1), 1–25 (2018)

Hassib, E.M., El-Desouky, A.I., El-Kenawy, E.S.M., El-Ghamrawy, S.M.: An imbalanced big data mining framework for improving optimization algorithms performance. IEEE Access 7 , 170774–170795 (2019)

Liu, B.: Text sentiment analysis based on CBOW model and deep learning in big data environment. J. Ambient. Intell. Humaniz. Comput. 11 (2), 451–458 (2020)

Zhai, G., Yang, Y., Wang, H., Du, S.: Multi-attention fusion modeling for sentiment analysis of educational big data. Big Data Mining Anal. 3 (4), 311–319 (2020)

Rodrigues, A.P. and Chiplunkar, N.N.: A new big data approach for topic classification and sentiment analysis of Twitter data. Evolut. Intell. 1–11 (2019)

Lau, R.Y.K., Zhang, W., Xu, W.: Parallel aspect-oriented sentiment analysis for sales forecasting with big data. Prod. Oper. Manag. 27 (10), 1775–1794 (2018)

Johnson, J.M., Khoshgoftaar, T.M.: The effects of data sampling with deep learning and highly imbalanced big data. Inf. Syst. Front. 22 (5), 1113–1131 (2020)

Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J.J., García-Osorio, C.: Experimental evaluation of ensemble classifiers for imbalance in Big Data. Appl. Soft Comput. 108 , 107447 (2021)

Al, S., Dener, M.: STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment. Comput. Secur. 110 , 102435 (2021)

Juez-Gil, M., Arnaiz-González, Á., Rodríguez, J.J., López-Nozal, C., García-Osorio, C.: Approx-SMOTE: fast SMOTE for big data on apache spark. Neurocomputing 464 , 432–437 (2021)

Gupta, A., Lohani, M.C., Manchanda, M.: Financial fraud detection using naive bayes algorithm in highly imbalance data set. J. Discrete Math. Sci. Cryptogr. 24 (5), 1559–1572 (2021)

Article   MathSciNet   Google Scholar  

Kwon, J.M., Jung, M.S., Kim, K.H., Jo, Y.Y., Shin, J.H., Cho, Y.H., Lee, Y.J., Ban, J.H., Jeon, K.H., Lee, S.Y., Park, J.: Artificial intelligence for detecting electrolyte imbalance using electrocardiography. Ann. Noninvasive Electrocardiol. 26 (3), e12839 (2021)

Sobanadevi, V. and Ravi, G.: Handling data imbalance using a heterogeneous bagging-based stacked ensemble (HBSE) for credit card fraud detection. In: Intelligence in Big Data Technologies—Beyond the Hype, pp. 517–525. Springer, Singapore. (2021)

Johnson, J.M. and Khoshgoftaar, T.M.: Thresholding strategies for deep learning with highly imbalanced big data. In: Deep Learning Applications, vol 2. Springer, Singapore, pp. 199–227 (2021)

Javaid, N., Jan, N., Javed, M.U.: An adaptive synthesis to handle imbalanced big data with deep siamese network for electricity theft detection in smart grids. J. Parallel Distributed Comput. 153 , 44–52 (2021)

Arif, A., Javaid, N., Aldegheishem, A., Alrajeh, N.: Big data analytics for identifying electricity theft using machine learning approaches in microgrids for smart communities. Concurr. Comput. 33 (17), e6316 (2021)

Arif, A., Alghamdi, T.A., Khan, Z.A., Javaid, N.: Towards efficient energy utilization using big data analytics in smart cities for electricity theft detection. Big Data Res. 27 , 100285 (2022)

Hou, C., Wu, J., Cao, B., Fan, J.: A deep-learning prediction model for imbalanced time series data forecasting. Big Data Mining and Analytics 4 (4), 266–278 (2021)

Xia, D., Zhang, M., Yan, X., Bai, Y., Zheng, Y., Li, Y., Li, H.: A distributed WND-LSTM model on MapReduce for short-term traffic flow prediction. Neural Comput. Appl. 33 (7), 2393–2410 (2021)

Bawankule, K.L., Dewang, R.K. and Singh, A.K.: Historical data based approach to mitigate stragglers from the Reduce phase of MapReduce in a heterogeneous Hadoop cluster. Cluster Comput. 1–19 (2022)

Asif, M., Abbas, S., Khan, M.A., Fatima, A., Khan, M.A., Lee, S.W.: MapReduce based intelligent model for intrusion detection using machine learning technique. J. King Saud Univ.-Comput. Inform. Sci. 34 , 9723 (2021)

Wang, X., Wang, C., Bai, M., Ma, Q., Li, G.: HTD: heterogeneous throughput-driven task scheduling algorithm in MapReduce. Distributed Parallel Databases 40 (1), 135–163 (2022)

Pandey, V., Saini, P.: A heuristic method towards deadline-aware energy-efficient mapreduce scheduling problem in Hadoop YARN. Clust. Comput. 24 (2), 683–699 (2021)

Baruah, A.J., Baruah, S.: Data augmentation and Deep Neuro-Fuzzy network for student performance prediction with MapReduce framework. Int. J. Autom. Comput. 18 (6), 981–992 (2021)

Article   MathSciNet   MATH   Google Scholar  

Patan, R., Kallam, S., Gandomi, A.H., Hanne, T., Ramachandran, M., Gaussian relevance vector MapReduce-based annealed Glowworm optimization for big medical data scheduling. J. Operat. Res. Soc. 1–12. (2021)

Chawla, T., Singh, G., Pilli, E.S.: MuSe: a multi-level storage scheme for big RDF data using MapReduce. J. Big Data 8 (1), 1–26 (2021)

Narayana, S., Chandanapalli, S.B., Rao, M.S., Srinivas, K.: Ant cat swarm optimization-enabled deep recurrent neural network for big data classification based on map reduce framework. Comput. J. 65 , 3167 (2021)

Ramsingh, J., Bhuvaneswari, V.: An efficient map reduce-based hybrid NBC-TFIDF algorithm to mine the public sentiment on diabetes mellitus–a big data approach. J. King Saud University-Comput. Inform. Sci. 33 (8), 1018–1029 (2021)

Roy, S., Bhattacharya, S., Omkar, S.N.: Automated Large-Scale Mapping of the Jahazpur Mineralised Belt by a MapReduce Model with an Integrated ELM method. PFG J. Photogr. Remote Sens. Geoinform. Sci. 90 (2), 191–209 (2022)

Pham, T.A., Dang, X.K., Vo, N.S.: Optimising Maritime Big Data by K-means Clustering with Mapreduce Model. In International Conference on Industrial Networks and Intelligent Systems (pp. 136–151). Springer, Cham, (2022)

Arunadevi, N., Thulasiraaman, V.: Cuckoo search augmented mapreduce for predictive scheduling with big stream data. I. J. Sociotechnol. Knowledge Develop. 14 (1), 1–18 (2022)

Kumar, D., Jha, V.K.: An improved query optimization process in big data using ACO-GA algorithm and HDFS map reduce technique. Distributed Parallel Databases 39 (1), 79–96 (2021)

Agarwal, G. and Om, H.: Parallel training models of deep belief network using MapReduce for the classifications of emotions. Int. J. Syst. Assurance Eng. Manag. 1–16. (2021)

Pang, Z., Wu, S., Huang, H., Hong, Z., Xie, Y.: AQUA+: Query Optimization for Hybrid Database-MapReduce System. Knowl. Inf. Syst. 63 (4), 905–938 (2021)

Maheswari, K., Ramakrishnan, M.: Kernelized Spectral Clustering based Conditional MapReduce function with big data. Int. J. Comput. Appl. 43 (7), 601–611 (2021)

Thanga Selvi, R., Muthulakshmi, I.: An optimal artificial neural network based big data application for heart disease diagnosis and classification model. J. Ambient. Intell. Humaniz. Comput. 12 (6), 6129–6139 (2021)

Ed-Daoudy, A. and Maalmi, K.: Real-time machine learning for early detection of heart disease using big data approach. In 2019 international conference on wireless technologies, embedded and intelligent systems (WITS) (pp. 1–5). IEEE (2019)

Vaishali, G. and Kalaivani, V.: Big data analysis for heart disease detection system using map reduce technique. In 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE'16) (pp. 1–6). IEEE (2016)

Rastogi, R., Chaturvedi, D.K., Satya, S. and Arora, N.: Intelligent heart disease prediction on physical and mental parameters: a ML based IoT and big data application and analysis. In: Machine Learning with Health Care Perspective, pp. 199–236. Springer, Cham (2020)

Nayak, S., Gourisaria, M.K., Pandey, M. and Rautaray, S.S.: Comparative analysis of heart disease classification algorithms using big data analytical tool. In: International Conference on Computer Networks and Inventive Communication Technologies, pp. 582–588. Springer, Cham, (2019)

Nair, L.R., Shetty, S.D., Shetty, S.D.: Applying spark based machine learning model on streaming big data for health status prediction. Comput. Electr. Eng. 65 , 393–399 (2018)

Saluja, M.K., Agarwal, I., Rani, U. and Saxena, A.: Analysis of diabetes and heart disease in big data using MapReduce framework. In: International Conference on Innovative Computing and Communications, pp. 37–51. Springer, Singapore (2021)

Kılınç, D.: A spark-based big data analysis framework for real-time sentiment prediction on streaming data. Software 49 (9), 1352–1364 (2019)

Ramírez-Gallego, S., Krawczyk, B., García, S., Woźniak, M., Benítez, J.M., Herrera, F.: Nearest Neighbor Classification for High-Speed Big Data Streams Using Spark. IEEE Transact. Syst. Man Cybern. 47 (10), 2727–2739 (2017). https://doi.org/10.1109/TSMC.2017.2700889

Park, K., Baek, C. and Peng, L.: A development of streaming big data analysis system using in-memory cluster computing framework: Spark. In: Advanced Multimedia and Ubiquitous Engineering, pp. 157–163. Springer, Singapore, (2016)

Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.A., Caelen, O., Mazzer, Y., Bontempi, G.: Scarff: a scalable framework for streaming credit card fraud detection with spark. Inform. Fusion 41 , 182–194 (2018)

Rathore, M.M., Son, H., Ahmad, A., Paul, A., Jeon, G.: Real-time big data stream processing using GPU with spark over hadoop ecosystem. Int. J. Parallel Prog. 46 (3), 630–646 (2018)

Zhou, B., Li, J., Wang, X., Gu, Y., Xu, L., Hu, Y., Zhu, L.: Online internet traffic monitoring system using spark streaming. Big Data Mining Anal. 1 (1), 47–56 (2018)

Xiao, W., Hu, J.: SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming. J. Supercomput. 76 (10), 7619–7634 (2020)

Subramaniyan, S., Regan, R., Perumal, T. and Venkatachalam, K.: Semi-supervised machine learning algorithm for predicting diabetes using big data analytics. In Business Intelligence for Enterprise Internet of Things, pp. 139–149. Springer, Cham, (2020)

AlZubi, A.A.: Big data analytic diabetics using map reduce and classification techniques. J. Supercomput. 76 (6), 4328–4337 (2020)

Hatua, A., Subudhi, B.N., Veerakumar, T., Ghosh, A.: Early detection of diabetic retinopathy from big data in hadoop framework. Displays 70 , 102061 (2021)

Sivakumar, N.R. and Karim, F.K.D.: An IoT based big data framework using equidistant heuristic and duplex deep neural network for diabetic disease prediction. J. Ambient Intell. Humanized Comput. 1–11. (2021)

Mamatha Bai, B.G., Nalini, B.M., Majumdar, J.: Analysis and detection of diabetes using data mining techniques—a big data application in health care. In: Emerging research in computing, information, communication and applications, pp. 443–455. Springer, Singapore. (2019)

Sisodia, A., Jindal, R.: An effective model for healthcare to process chronic kidney disease using big data processing. J. Ambient Intell. Humanized Comput. 1–17 (2022)

Diez-Sanmartin, C., Sarasa-Cabezuelo, A., Belmonte, A.A.: The impact of artificial intelligence and big data on end-stage kidney disease treatments. Expert Syst. Appl. 180 , 115076 (2021)

Abdelaziz, A., Salama, A.S., Riad, A.M. and Mahmoud, A.N.: A machine learning model for predicting of chronic kidney disease based internet of things and cloud computing in smart cities. In: Security in smart cities: models, applications, and challenges, pp. 93–114. Springer, Cham, (2019)

Mansour, R.F., Escorcia-Gutierrez, J., Gamarra, M., Díaz, V.G., Gupta, D., Kumar, S.: Artificial intelligence with big data analytics-based brain intracranial hemorrhage e-diagnosis using CT images. Neural Comput. Appl. 1–13. (2021)

HS, S.K. and Karibasappa, K.: An approach for brain tumour detection based on dual-tree complex Gabor wavelet transform and neural network using Hadoop big data analysis. Multimedia Tools Appl. 1–24 (2022)

Chew, A.W.Z., Pan, Y., Wang, Y., Zhang, L.: Hybrid deep learning of social media big data for predicting the evolution of COVID-19 transmission. Knowl.-Based Syst. 233 , 107417 (2021)

Elghamrawy, S.: An h 2 o’s deep learning-inspired model based on big data analytics for coronavirus disease (covid-19) diagnosis. In Big data analytics and artificial intelligence against COVID-19: Innovation Vision and Approach, pp. 263–279. Springer, Cham (2020)

Jamshidi, M., Lalbakhsh, A., Talla, J., Peroutka, Z., Hadjilooei, F., Lalbakhsh, P., Jamshidi, M., La Spada, L., Mirmozafari, M., Dehghani, M., Sabet, A.: Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. Ieee Access 8 , 109581–109595 (2020)

Oh, Y., Park, S., Ye, J.C.: Deep learning COVID-19 features on CXR using limited training data sets. IEEE Trans. Med. Imaging 39 (8), 2688–2700 (2020)

Alakus, T.B., Turkoglu, I.: Comparison of deep learning approaches to predict COVID-19 infection. Chaos Solitons Fractals 140 , 110120 (2020)

Luo, Y., Xu, X.: Comparative study of deep learning models for analyzing online restaurant reviews in the era of the COVID-19 pandemic. Int. J. Hosp. Manag. 94 , 102849 (2021)

Prasanth, S., Singh, U., Kumar, A., Tikkiwal, V.A., Chong, P.H.: Forecasting spread of COVID-19 using google trends: a hybrid GWO-deep learning approach. Chaos Solitons Fractals 142 , 110336 (2021)

Ramanathan, S., Ramasundaram, M.: Accurate computation: COVID-19 rRT-PCR positive test dataset using stages classification through textual big data mining with machine learning. J. Supercomput. 77 (7), 7074–7088 (2021)

Ghosh, T., Al Banna, M.H., Al Nahian, M.J., Taher, K.A., Kaiser, M.S. and Mahmud, M.: A hybrid deep learning model to predict the impact of COVID-19 on mental health form social media big data (2021)

Panwar, H., Gupta, P.K., Siddiqui, M.K., Morales-Menendez, R., Singh, V.: Application of deep learning for fast detection of COVID-19 in X-Rays using nCOVnet. Chaos Solitons Fractals 138 , 109944 (2020)

Kaur, H., Ahsaan, S.U., Alankar, B., Chang, V.: A proposed sentiment analysis deep learning algorithm for analyzing COVID-19 tweets. Inf. Syst. Front. 23 (6), 1417–1429 (2021)

Wang, Y., Zeng, D.: Development of sports industry under the influence of COVID-19 epidemic situation based on big data. J. Intell. Fuzzy Syst. 39 (6), 8867–8875 (2020)

Awan, M.J., Bilal, M.H., Yasin, A., Nobanee, H., Khan, N.S., Zain, A.M.: Detection of COVID-19 in chest X-ray images: A big data enabled deep learning approach. Int. J. Environ. Res. Public Health 18 (19), 10147 (2021)

Wang, L., Lin, Z.Q., Wong, A.: Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Sci. Rep. 10 (1), 1–12 (2020)

Yang, D., Martinez, C., Visuña, L., Khandhar, H., Bhatt, C., Carretero, J.: Detection and analysis of COVID-19 in medical images using deep learning techniques. Sci. Rep. 11 (1), 1–13 (2021)

Ohata, E.F., Bezerra, G.M., das Chagas, J.V.S., Neto, A.V.L., Albuquerque, A.B., de Albuquerque, V.H.C. and Reboucas Filho, P.P.: Automatic detection of COVID-19 infection using chest X-ray images through transfer learning. IEEE/CAA Journal of Automatica Sinica, 8(1), 239-248 (2020)

Chowdhury, N.K., Rahman, M., Kabir, M.A.: PDCOVIDNet: a parallel-dilated convolutional neural network architecture for detecting COVID-19 from chest X-ray images. Health Inform. Sci. Syst. 8 (1), 1–14 (2020)

Canayaz, M.: MH-COVIDNet: Diagnosis of COVID-19 using deep neural networks and meta-heuristic-based feature selection on X-ray images. Biomed. Signal Process. Control 64 , 102257 (2021)

Aboutalebi, H., Abbasi, S., Shafiee, M.J. and Wong, A.: COVID-Net CT-S: 3D convolutional neural network architectures for COVID-19 severity assessment using Chest CT Images. arXiv preprint arXiv:2105.01284 . (2021)

Pavlova, M., Terhljan, N., Chung, A.G., Zhao, A., Surana, S., Aboutalebi, H., Gunraj, H., Sabri, A., Alaref, A. and Wong, A.: Covid-net cxr-2: an enhanced deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Front. Med. 9 (2022)

Farooq, M. and Hafeez, A.: Covid-resnet: A deep learning framework for screening of covid19 from radiographs. arXiv preprint arXiv:2003.14395 . (2020)

Awasthi, N., Dayal, A., Cenkeramaddi, L.R., Yalavarthy, P.K.: Mini-COVIDNet: efficient lightweight deep neural network for ultrasound based point-of-care detection of COVID-19. IEEE Trans. Ultrason. Ferroelectr. Freq. Control 68 (6), 2023–2037 (2021)

Hasan, N., Bao, Y., Shawon, A., Huang, Y.: DenseNet convolutional neural networks application for predicting COVID-19 using CT image. SN Comput. Sci. 2 (5), 1–11 (2021)

Liu, Q., Leung, C.K., Hu, P.: A two-dimensional sparse matrix profile DenseNet for COVID-19 diagnosis using chest CT images. IEEE Access 8 , 213718–213728 (2020)

Xiao, B., Yang, Z., Qiu, X., Xiao, J., Wang, G., Zeng, W., Li, W., Nian, Y., Chen, W.: PAM-DenseNet: a deep convolutional neural network for computer-aided COVID-19 diagnosis. IEEE Transact. Cybern. 52 , 12163 (2021)

Tang, S., Wang, C., Nie, J., Kumar, N., Zhang, Y., Xiong, Z., Barnawi, A.: EDL-COVID: Ensemble deep learning for COVID-19 case detection from chest X-ray images. IEEE Trans. Industr. Inf. 17 (9), 6539–6549 (2021)

Abdani, S.R., Zulkifley, M.A. and Zulkifley, N.H.: A lightweight deep learning model for covid-19 detection. In: 2020 IEEE Symposium on Industrial Electronics & Applications (ISIEA) (pp. 1–5). IEEE (2020)

Aminu, M., Ahmad, N.A., Noor, M.H.M.: Covid-19 detection via deep neural network and occlusion sensitivity maps. Alex. Eng. J. 60 (5), 4829–4855 (2021)

Kumar, M.D., Ramana, K.: Cardiac Segmentation from MRI images using Recurrent & Residual Convolutional Neural Network based on SegNet and Level Set methods. Annals of the Romanian Society for Cell Biology, pp.1536–1545, (2021)

Kumar, M.D., Ramana, K.V.: Cardiovascular disease prognosis and severity analysis using hybrid heuristic methods. Multimedia Tools Appl. 80 (5), 7939–7965 (2021)

Mikkili, D.K.: Skin Cancer segmentation with the aid of multiclass dilated D-net framework. Multimedia Tools Appl. (2023). https://doi.org/10.1007/s11042-023-14605-9 .

Rossetti, M., Pareschi, R., Stella, F., et al.: Integrating concepts and knowledge in large content networks. New Gener. Comput. 32 , 309–330 (2014). https://doi.org/10.1007/s00354-014-0407-4

Download references

Not applicable.

Author information

Authors and affiliations.

Department of CSE, Dr.Y.S. Rajasekhar Reddy University College of Engineering & Technology, Acharya Nagarjuna University, Nagarjuna Nagar, Guntur, India

Nagamani Tenali

Computer Science and Engineering (AI&ML), RVR & JC College of Engineering, Chowdavaram, Guntur, India

Gatram Rama Mohan Babu

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nagamani Tenali .

Ethics declarations

Conflict of interest.

The author declares there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

About this article

Tenali, N., Babu, G.R.M. A Systematic Literature Review and Future Perspectives for Handling Big Data Analytics in COVID-19 Diagnosis. New Gener. Comput. 41 , 243–280 (2023). https://doi.org/10.1007/s00354-023-00211-8

Download citation

Received : 05 October 2022

Accepted : 23 February 2023

Published : 16 March 2023

Issue Date : June 2023

DOI : https://doi.org/10.1007/s00354-023-00211-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Big data analytics
  • Deep learning
  • Decision making

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Can J Hosp Pharm
  • v.68(3); May-Jun 2015

Logo of cjhp

Qualitative Research: Data Collection, Analysis, and Management

Introduction.

In an earlier paper, 1 we presented an introduction to using qualitative research methods in pharmacy practice. In this article, we review some principles of the collection, analysis, and management of qualitative data to help pharmacists interested in doing research in their practice to continue their learning in this area. Qualitative research can help researchers to access the thoughts and feelings of research participants, which can enable development of an understanding of the meaning that people ascribe to their experiences. Whereas quantitative research methods can be used to determine how many people undertake particular behaviours, qualitative methods can help researchers to understand how and why such behaviours take place. Within the context of pharmacy practice research, qualitative approaches have been used to examine a diverse array of topics, including the perceptions of key stakeholders regarding prescribing by pharmacists and the postgraduation employment experiences of young pharmacists (see “Further Reading” section at the end of this article).

In the previous paper, 1 we outlined 3 commonly used methodologies: ethnography 2 , grounded theory 3 , and phenomenology. 4 Briefly, ethnography involves researchers using direct observation to study participants in their “real life” environment, sometimes over extended periods. Grounded theory and its later modified versions (e.g., Strauss and Corbin 5 ) use face-to-face interviews and interactions such as focus groups to explore a particular research phenomenon and may help in clarifying a less-well-understood problem, situation, or context. Phenomenology shares some features with grounded theory (such as an exploration of participants’ behaviour) and uses similar techniques to collect data, but it focuses on understanding how human beings experience their world. It gives researchers the opportunity to put themselves in another person’s shoes and to understand the subjective experiences of participants. 6 Some researchers use qualitative methodologies but adopt a different standpoint, and an example of this appears in the work of Thurston and others, 7 discussed later in this paper.

Qualitative work requires reflection on the part of researchers, both before and during the research process, as a way of providing context and understanding for readers. When being reflexive, researchers should not try to simply ignore or avoid their own biases (as this would likely be impossible); instead, reflexivity requires researchers to reflect upon and clearly articulate their position and subjectivities (world view, perspectives, biases), so that readers can better understand the filters through which questions were asked, data were gathered and analyzed, and findings were reported. From this perspective, bias and subjectivity are not inherently negative but they are unavoidable; as a result, it is best that they be articulated up-front in a manner that is clear and coherent for readers.

THE PARTICIPANT’S VIEWPOINT

What qualitative study seeks to convey is why people have thoughts and feelings that might affect the way they behave. Such study may occur in any number of contexts, but here, we focus on pharmacy practice and the way people behave with regard to medicines use (e.g., to understand patients’ reasons for nonadherence with medication therapy or to explore physicians’ resistance to pharmacists’ clinical suggestions). As we suggested in our earlier article, 1 an important point about qualitative research is that there is no attempt to generalize the findings to a wider population. Qualitative research is used to gain insights into people’s feelings and thoughts, which may provide the basis for a future stand-alone qualitative study or may help researchers to map out survey instruments for use in a quantitative study. It is also possible to use different types of research in the same study, an approach known as “mixed methods” research, and further reading on this topic may be found at the end of this paper.

The role of the researcher in qualitative research is to attempt to access the thoughts and feelings of study participants. This is not an easy task, as it involves asking people to talk about things that may be very personal to them. Sometimes the experiences being explored are fresh in the participant’s mind, whereas on other occasions reliving past experiences may be difficult. However the data are being collected, a primary responsibility of the researcher is to safeguard participants and their data. Mechanisms for such safeguarding must be clearly articulated to participants and must be approved by a relevant research ethics review board before the research begins. Researchers and practitioners new to qualitative research should seek advice from an experienced qualitative researcher before embarking on their project.

DATA COLLECTION

Whatever philosophical standpoint the researcher is taking and whatever the data collection method (e.g., focus group, one-to-one interviews), the process will involve the generation of large amounts of data. In addition to the variety of study methodologies available, there are also different ways of making a record of what is said and done during an interview or focus group, such as taking handwritten notes or video-recording. If the researcher is audio- or video-recording data collection, then the recordings must be transcribed verbatim before data analysis can begin. As a rough guide, it can take an experienced researcher/transcriber 8 hours to transcribe one 45-minute audio-recorded interview, a process than will generate 20–30 pages of written dialogue.

Many researchers will also maintain a folder of “field notes” to complement audio-taped interviews. Field notes allow the researcher to maintain and comment upon impressions, environmental contexts, behaviours, and nonverbal cues that may not be adequately captured through the audio-recording; they are typically handwritten in a small notebook at the same time the interview takes place. Field notes can provide important context to the interpretation of audio-taped data and can help remind the researcher of situational factors that may be important during data analysis. Such notes need not be formal, but they should be maintained and secured in a similar manner to audio tapes and transcripts, as they contain sensitive information and are relevant to the research. For more information about collecting qualitative data, please see the “Further Reading” section at the end of this paper.

DATA ANALYSIS AND MANAGEMENT

If, as suggested earlier, doing qualitative research is about putting oneself in another person’s shoes and seeing the world from that person’s perspective, the most important part of data analysis and management is to be true to the participants. It is their voices that the researcher is trying to hear, so that they can be interpreted and reported on for others to read and learn from. To illustrate this point, consider the anonymized transcript excerpt presented in Appendix 1 , which is taken from a research interview conducted by one of the authors (J.S.). We refer to this excerpt throughout the remainder of this paper to illustrate how data can be managed, analyzed, and presented.

Interpretation of Data

Interpretation of the data will depend on the theoretical standpoint taken by researchers. For example, the title of the research report by Thurston and others, 7 “Discordant indigenous and provider frames explain challenges in improving access to arthritis care: a qualitative study using constructivist grounded theory,” indicates at least 2 theoretical standpoints. The first is the culture of the indigenous population of Canada and the place of this population in society, and the second is the social constructivist theory used in the constructivist grounded theory method. With regard to the first standpoint, it can be surmised that, to have decided to conduct the research, the researchers must have felt that there was anecdotal evidence of differences in access to arthritis care for patients from indigenous and non-indigenous backgrounds. With regard to the second standpoint, it can be surmised that the researchers used social constructivist theory because it assumes that behaviour is socially constructed; in other words, people do things because of the expectations of those in their personal world or in the wider society in which they live. (Please see the “Further Reading” section for resources providing more information about social constructivist theory and reflexivity.) Thus, these 2 standpoints (and there may have been others relevant to the research of Thurston and others 7 ) will have affected the way in which these researchers interpreted the experiences of the indigenous population participants and those providing their care. Another standpoint is feminist standpoint theory which, among other things, focuses on marginalized groups in society. Such theories are helpful to researchers, as they enable us to think about things from a different perspective. Being aware of the standpoints you are taking in your own research is one of the foundations of qualitative work. Without such awareness, it is easy to slip into interpreting other people’s narratives from your own viewpoint, rather than that of the participants.

To analyze the example in Appendix 1 , we will adopt a phenomenological approach because we want to understand how the participant experienced the illness and we want to try to see the experience from that person’s perspective. It is important for the researcher to reflect upon and articulate his or her starting point for such analysis; for example, in the example, the coder could reflect upon her own experience as a female of a majority ethnocultural group who has lived within middle class and upper middle class settings. This personal history therefore forms the filter through which the data will be examined. This filter does not diminish the quality or significance of the analysis, since every researcher has his or her own filters; however, by explicitly stating and acknowledging what these filters are, the researcher makes it easer for readers to contextualize the work.

Transcribing and Checking

For the purposes of this paper it is assumed that interviews or focus groups have been audio-recorded. As mentioned above, transcribing is an arduous process, even for the most experienced transcribers, but it must be done to convert the spoken word to the written word to facilitate analysis. For anyone new to conducting qualitative research, it is beneficial to transcribe at least one interview and one focus group. It is only by doing this that researchers realize how difficult the task is, and this realization affects their expectations when asking others to transcribe. If the research project has sufficient funding, then a professional transcriber can be hired to do the work. If this is the case, then it is a good idea to sit down with the transcriber, if possible, and talk through the research and what the participants were talking about. This background knowledge for the transcriber is especially important in research in which people are using jargon or medical terms (as in pharmacy practice). Involving your transcriber in this way makes the work both easier and more rewarding, as he or she will feel part of the team. Transcription editing software is also available, but it is expensive. For example, ELAN (more formally known as EUDICO Linguistic Annotator, developed at the Technical University of Berlin) 8 is a tool that can help keep data organized by linking media and data files (particularly valuable if, for example, video-taping of interviews is complemented by transcriptions). It can also be helpful in searching complex data sets. Products such as ELAN do not actually automatically transcribe interviews or complete analyses, and they do require some time and effort to learn; nonetheless, for some research applications, it may be a valuable to consider such software tools.

All audio recordings should be transcribed verbatim, regardless of how intelligible the transcript may be when it is read back. Lines of text should be numbered. Once the transcription is complete, the researcher should read it while listening to the recording and do the following: correct any spelling or other errors; anonymize the transcript so that the participant cannot be identified from anything that is said (e.g., names, places, significant events); insert notations for pauses, laughter, looks of discomfort; insert any punctuation, such as commas and full stops (periods) (see Appendix 1 for examples of inserted punctuation), and include any other contextual information that might have affected the participant (e.g., temperature or comfort of the room).

Dealing with the transcription of a focus group is slightly more difficult, as multiple voices are involved. One way of transcribing such data is to “tag” each voice (e.g., Voice A, Voice B). In addition, the focus group will usually have 2 facilitators, whose respective roles will help in making sense of the data. While one facilitator guides participants through the topic, the other can make notes about context and group dynamics. More information about group dynamics and focus groups can be found in resources listed in the “Further Reading” section.

Reading between the Lines

During the process outlined above, the researcher can begin to get a feel for the participant’s experience of the phenomenon in question and can start to think about things that could be pursued in subsequent interviews or focus groups (if appropriate). In this way, one participant’s narrative informs the next, and the researcher can continue to interview until nothing new is being heard or, as it says in the text books, “saturation is reached”. While continuing with the processes of coding and theming (described in the next 2 sections), it is important to consider not just what the person is saying but also what they are not saying. For example, is a lengthy pause an indication that the participant is finding the subject difficult, or is the person simply deciding what to say? The aim of the whole process from data collection to presentation is to tell the participants’ stories using exemplars from their own narratives, thus grounding the research findings in the participants’ lived experiences.

Smith 9 suggested a qualitative research method known as interpretative phenomenological analysis, which has 2 basic tenets: first, that it is rooted in phenomenology, attempting to understand the meaning that individuals ascribe to their lived experiences, and second, that the researcher must attempt to interpret this meaning in the context of the research. That the researcher has some knowledge and expertise in the subject of the research means that he or she can have considerable scope in interpreting the participant’s experiences. Larkin and others 10 discussed the importance of not just providing a description of what participants say. Rather, interpretative phenomenological analysis is about getting underneath what a person is saying to try to truly understand the world from his or her perspective.

Once all of the research interviews have been transcribed and checked, it is time to begin coding. Field notes compiled during an interview can be a useful complementary source of information to facilitate this process, as the gap in time between an interview, transcribing, and coding can result in memory bias regarding nonverbal or environmental context issues that may affect interpretation of data.

Coding refers to the identification of topics, issues, similarities, and differences that are revealed through the participants’ narratives and interpreted by the researcher. This process enables the researcher to begin to understand the world from each participant’s perspective. Coding can be done by hand on a hard copy of the transcript, by making notes in the margin or by highlighting and naming sections of text. More commonly, researchers use qualitative research software (e.g., NVivo, QSR International Pty Ltd; www.qsrinternational.com/products_nvivo.aspx ) to help manage their transcriptions. It is advised that researchers undertake a formal course in the use of such software or seek supervision from a researcher experienced in these tools.

Returning to Appendix 1 and reading from lines 8–11, a code for this section might be “diagnosis of mental health condition”, but this would just be a description of what the participant is talking about at that point. If we read a little more deeply, we can ask ourselves how the participant might have come to feel that the doctor assumed he or she was aware of the diagnosis or indeed that they had only just been told the diagnosis. There are a number of pauses in the narrative that might suggest the participant is finding it difficult to recall that experience. Later in the text, the participant says “nobody asked me any questions about my life” (line 19). This could be coded simply as “health care professionals’ consultation skills”, but that would not reflect how the participant must have felt never to be asked anything about his or her personal life, about the participant as a human being. At the end of this excerpt, the participant just trails off, recalling that no-one showed any interest, which makes for very moving reading. For practitioners in pharmacy, it might also be pertinent to explore the participant’s experience of akathisia and why this was left untreated for 20 years.

One of the questions that arises about qualitative research relates to the reliability of the interpretation and representation of the participants’ narratives. There are no statistical tests that can be used to check reliability and validity as there are in quantitative research. However, work by Lincoln and Guba 11 suggests that there are other ways to “establish confidence in the ‘truth’ of the findings” (p. 218). They call this confidence “trustworthiness” and suggest that there are 4 criteria of trustworthiness: credibility (confidence in the “truth” of the findings), transferability (showing that the findings have applicability in other contexts), dependability (showing that the findings are consistent and could be repeated), and confirmability (the extent to which the findings of a study are shaped by the respondents and not researcher bias, motivation, or interest).

One way of establishing the “credibility” of the coding is to ask another researcher to code the same transcript and then to discuss any similarities and differences in the 2 resulting sets of codes. This simple act can result in revisions to the codes and can help to clarify and confirm the research findings.

Theming refers to the drawing together of codes from one or more transcripts to present the findings of qualitative research in a coherent and meaningful way. For example, there may be examples across participants’ narratives of the way in which they were treated in hospital, such as “not being listened to” or “lack of interest in personal experiences” (see Appendix 1 ). These may be drawn together as a theme running through the narratives that could be named “the patient’s experience of hospital care”. The importance of going through this process is that at its conclusion, it will be possible to present the data from the interviews using quotations from the individual transcripts to illustrate the source of the researchers’ interpretations. Thus, when the findings are organized for presentation, each theme can become the heading of a section in the report or presentation. Underneath each theme will be the codes, examples from the transcripts, and the researcher’s own interpretation of what the themes mean. Implications for real life (e.g., the treatment of people with chronic mental health problems) should also be given.

DATA SYNTHESIS

In this final section of this paper, we describe some ways of drawing together or “synthesizing” research findings to represent, as faithfully as possible, the meaning that participants ascribe to their life experiences. This synthesis is the aim of the final stage of qualitative research. For most readers, the synthesis of data presented by the researcher is of crucial significance—this is usually where “the story” of the participants can be distilled, summarized, and told in a manner that is both respectful to those participants and meaningful to readers. There are a number of ways in which researchers can synthesize and present their findings, but any conclusions drawn by the researchers must be supported by direct quotations from the participants. In this way, it is made clear to the reader that the themes under discussion have emerged from the participants’ interviews and not the mind of the researcher. The work of Latif and others 12 gives an example of how qualitative research findings might be presented.

Planning and Writing the Report

As has been suggested above, if researchers code and theme their material appropriately, they will naturally find the headings for sections of their report. Qualitative researchers tend to report “findings” rather than “results”, as the latter term typically implies that the data have come from a quantitative source. The final presentation of the research will usually be in the form of a report or a paper and so should follow accepted academic guidelines. In particular, the article should begin with an introduction, including a literature review and rationale for the research. There should be a section on the chosen methodology and a brief discussion about why qualitative methodology was most appropriate for the study question and why one particular methodology (e.g., interpretative phenomenological analysis rather than grounded theory) was selected to guide the research. The method itself should then be described, including ethics approval, choice of participants, mode of recruitment, and method of data collection (e.g., semistructured interviews or focus groups), followed by the research findings, which will be the main body of the report or paper. The findings should be written as if a story is being told; as such, it is not necessary to have a lengthy discussion section at the end. This is because much of the discussion will take place around the participants’ quotes, such that all that is needed to close the report or paper is a summary, limitations of the research, and the implications that the research has for practice. As stated earlier, it is not the intention of qualitative research to allow the findings to be generalized, and therefore this is not, in itself, a limitation.

Planning out the way that findings are to be presented is helpful. It is useful to insert the headings of the sections (the themes) and then make a note of the codes that exemplify the thoughts and feelings of your participants. It is generally advisable to put in the quotations that you want to use for each theme, using each quotation only once. After all this is done, the telling of the story can begin as you give your voice to the experiences of the participants, writing around their quotations. Do not be afraid to draw assumptions from the participants’ narratives, as this is necessary to give an in-depth account of the phenomena in question. Discuss these assumptions, drawing on your participants’ words to support you as you move from one code to another and from one theme to the next. Finally, as appropriate, it is possible to include examples from literature or policy documents that add support for your findings. As an exercise, you may wish to code and theme the sample excerpt in Appendix 1 and tell the participant’s story in your own way. Further reading about “doing” qualitative research can be found at the end of this paper.

CONCLUSIONS

Qualitative research can help researchers to access the thoughts and feelings of research participants, which can enable development of an understanding of the meaning that people ascribe to their experiences. It can be used in pharmacy practice research to explore how patients feel about their health and their treatment. Qualitative research has been used by pharmacists to explore a variety of questions and problems (see the “Further Reading” section for examples). An understanding of these issues can help pharmacists and other health care professionals to tailor health care to match the individual needs of patients and to develop a concordant relationship. Doing qualitative research is not easy and may require a complete rethink of how research is conducted, particularly for researchers who are more familiar with quantitative approaches. There are many ways of conducting qualitative research, and this paper has covered some of the practical issues regarding data collection, analysis, and management. Further reading around the subject will be essential to truly understand this method of accessing peoples’ thoughts and feelings to enable researchers to tell participants’ stories.

Appendix 1. Excerpt from a sample transcript

The participant (age late 50s) had suffered from a chronic mental health illness for 30 years. The participant had become a “revolving door patient,” someone who is frequently in and out of hospital. As the participant talked about past experiences, the researcher asked:

  • What was treatment like 30 years ago?
  • Umm—well it was pretty much they could do what they wanted with you because I was put into the er, the er kind of system er, I was just on
  • endless section threes.
  • Really…
  • But what I didn’t realize until later was that if you haven’t actually posed a threat to someone or yourself they can’t really do that but I didn’t know
  • that. So wh-when I first went into hospital they put me on the forensic ward ’cause they said, “We don’t think you’ll stay here we think you’ll just
  • run-run away.” So they put me then onto the acute admissions ward and – er – I can remember one of the first things I recall when I got onto that
  • ward was sitting down with a er a Dr XXX. He had a book this thick [gestures] and on each page it was like three questions and he went through
  • all these questions and I answered all these questions. So we’re there for I don’t maybe two hours doing all that and he asked me he said “well
  • when did somebody tell you then that you have schizophrenia” I said “well nobody’s told me that” so he seemed very surprised but nobody had
  • actually [pause] whe-when I first went up there under police escort erm the senior kind of consultants people I’d been to where I was staying and
  • ermm so er [pause] I . . . the, I can remember the very first night that I was there and given this injection in this muscle here [gestures] and just
  • having dreadful side effects the next day I woke up [pause]
  • . . . and I suffered that akathesia I swear to you, every minute of every day for about 20 years.
  • Oh how awful.
  • And that side of it just makes life impossible so the care on the wards [pause] umm I don’t know it’s kind of, it’s kind of hard to put into words
  • [pause]. Because I’m not saying they were sort of like not friendly or interested but then nobody ever seemed to want to talk about your life [pause]
  • nobody asked me any questions about my life. The only questions that came into was they asked me if I’d be a volunteer for these student exams
  • and things and I said “yeah” so all the questions were like “oh what jobs have you done,” er about your relationships and things and er but
  • nobody actually sat down and had a talk and showed some interest in you as a person you were just there basically [pause] um labelled and you
  • know there was there was [pause] but umm [pause] yeah . . .

This article is the 10th in the CJHP Research Primer Series, an initiative of the CJHP Editorial Board and the CSHP Research Committee. The planned 2-year series is intended to appeal to relatively inexperienced researchers, with the goal of building research capacity among practising pharmacists. The articles, presenting simple but rigorous guidance to encourage and support novice researchers, are being solicited from authors with appropriate expertise.

Previous articles in this series:

Bond CM. The research jigsaw: how to get started. Can J Hosp Pharm . 2014;67(1):28–30.

Tully MP. Research: articulating questions, generating hypotheses, and choosing study designs. Can J Hosp Pharm . 2014;67(1):31–4.

Loewen P. Ethical issues in pharmacy practice research: an introductory guide. Can J Hosp Pharm. 2014;67(2):133–7.

Tsuyuki RT. Designing pharmacy practice research trials. Can J Hosp Pharm . 2014;67(3):226–9.

Bresee LC. An introduction to developing surveys for pharmacy practice research. Can J Hosp Pharm . 2014;67(4):286–91.

Gamble JM. An introduction to the fundamentals of cohort and case–control studies. Can J Hosp Pharm . 2014;67(5):366–72.

Austin Z, Sutton J. Qualitative research: getting started. C an J Hosp Pharm . 2014;67(6):436–40.

Houle S. An introduction to the fundamentals of randomized controlled trials in pharmacy research. Can J Hosp Pharm . 2014; 68(1):28–32.

Charrois TL. Systematic reviews: What do you need to know to get started? Can J Hosp Pharm . 2014;68(2):144–8.

Competing interests: None declared.

Further Reading

Examples of qualitative research in pharmacy practice.

  • Farrell B, Pottie K, Woodend K, Yao V, Dolovich L, Kennie N, et al. Shifts in expectations: evaluating physicians’ perceptions as pharmacists integrated into family practice. J Interprof Care. 2010; 24 (1):80–9. [ PubMed ] [ Google Scholar ]
  • Gregory P, Austin Z. Postgraduation employment experiences of new pharmacists in Ontario in 2012–2013. Can Pharm J. 2014; 147 (5):290–9. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Marks PZ, Jennnings B, Farrell B, Kennie-Kaulbach N, Jorgenson D, Pearson-Sharpe J, et al. “I gained a skill and a change in attitude”: a case study describing how an online continuing professional education course for pharmacists supported achievement of its transfer to practice outcomes. Can J Univ Contin Educ. 2014; 40 (2):1–18. [ Google Scholar ]
  • Nair KM, Dolovich L, Brazil K, Raina P. It’s all about relationships: a qualitative study of health researchers’ perspectives on interdisciplinary research. BMC Health Serv Res. 2008; 8 :110. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Pojskic N, MacKeigan L, Boon H, Austin Z. Initial perceptions of key stakeholders in Ontario regarding independent prescriptive authority for pharmacists. Res Soc Adm Pharm. 2014; 10 (2):341–54. [ PubMed ] [ Google Scholar ]

Qualitative Research in General

  • Breakwell GM, Hammond S, Fife-Schaw C. Research methods in psychology. Thousand Oaks (CA): Sage Publications; 1995. [ Google Scholar ]
  • Given LM. 100 questions (and answers) about qualitative research. Thousand Oaks (CA): Sage Publications; 2015. [ Google Scholar ]
  • Miles B, Huberman AM. Qualitative data analysis. Thousand Oaks (CA): Sage Publications; 2009. [ Google Scholar ]
  • Patton M. Qualitative research and evaluation methods. Thousand Oaks (CA): Sage Publications; 2002. [ Google Scholar ]
  • Willig C. Introducing qualitative research in psychology. Buckingham (UK): Open University Press; 2001. [ Google Scholar ]

Group Dynamics in Focus Groups

  • Farnsworth J, Boon B. Analysing group dynamics within the focus group. Qual Res. 2010; 10 (5):605–24. [ Google Scholar ]

Social Constructivism

  • Social constructivism. Berkeley (CA): University of California, Berkeley, Berkeley Graduate Division, Graduate Student Instruction Teaching & Resource Center; [cited 2015 June 4]. Available from: http://gsi.berkeley.edu/gsi-guide-contents/learning-theory-research/social-constructivism/ [ Google Scholar ]

Mixed Methods

  • Creswell J. Research design: qualitative, quantitative, and mixed methods approaches. Thousand Oaks (CA): Sage Publications; 2009. [ Google Scholar ]

Collecting Qualitative Data

  • Arksey H, Knight P. Interviewing for social scientists: an introductory resource with examples. Thousand Oaks (CA): Sage Publications; 1999. [ Google Scholar ]
  • Guest G, Namey EE, Mitchel ML. Collecting qualitative data: a field manual for applied research. Thousand Oaks (CA): Sage Publications; 2013. [ Google Scholar ]

Constructivist Grounded Theory

  • Charmaz K. Grounded theory: objectivist and constructivist methods. In: Denzin N, Lincoln Y, editors. Handbook of qualitative research. 2nd ed. Thousand Oaks (CA): Sage Publications; 2000. pp. 509–35. [ Google Scholar ]

IMAGES

  1. Workflow of literature review, data collection and analysis.

    literature review data collection

  2. How to conduct a Systematic Literature Review

    literature review data collection

  3. 50 Smart Literature Review Templates (APA) ᐅ TemplateLab

    literature review data collection

  4. Iterative Cycle of Literature Review and Data Collection

    literature review data collection

  5. Sample Selection in Systematic Literature Reviews of Management

    literature review data collection

  6. systematic literature review steps

    literature review data collection

VIDEO

  1. QUALITATIVE RESEARCH: Methods of data collection

  2. The content of the literature review

  3. Data Collection & Analysis

  4. Day 4 session1 FDP on " Systematic Literature Review & Data Collection" @ REC

  5. 12 Important Practice Questions /Research Methodology in English Education /Unit-1 /B.Ed. 4th Year

  6. SHAHE IP54 300mm Digital Caliper with USB Data Line Review

COMMENTS

  1. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  2. Literature review as a research methodology: An ...

    This is why the literature review as a research method is more relevant than ever. A literature review can broadly be described as a more or less systematic way of collecting and synthesizing previous research (Baumeister & Leary, 1997; Tranfield, ... In some cases, a research question requires a more creative collection of data, in these cases ...

  3. Chapter 5: Collecting data

    Review authors often have different backgrounds and level of systematic review experience. Using a data collection form ensures some consistency in the process of data extraction, and is necessary for comparing data extracted in duplicate. ... Systematic review authors can uncover suspected misconduct in the published literature. Misconduct ...

  4. Chapter 9 Methods for Literature Reviews

    The most prevalent one is the "literature review" or "background" section within a journal paper or a chapter in a graduate thesis. ... research methods, data collection techniques, and direction or strength of research outcomes (e.g., positive, negative, or non-significant) in the form of frequency analysis to produce quantitative ...

  5. Literature Reviews, Theoretical Frameworks, and Conceptual Frameworks

    A literature review should connect to the study question, guide the study methodology, and be central in the discussion by indicating how the analyzed data advances what is known in the field. A theoretical framework drives the question, guides the types of methods for data collection and analysis, informs the discussion of the findings, and ...

  6. Data Management and Repositories for Literature Reviews

    3.7 Systematic Review Data Repository. The Systematic Review Data Repository (SRDR) is a tool for the extraction and management of data for systematic reviews or meta-analyses. The extraction of data (see Figure 13.2) is seen as the most cumbersome part of systematic reviews and this tool supports this process.

  7. Literature Review Research

    Literature Review is a comprehensive survey of the works published in a particular field of study or line of research, usually over a specific period of time, in the form of an in-depth, ... research approaches and data collection and analysis techniques), enables researchers to draw on a wide variety of knowledge ranging from the conceptual ...

  8. 5. The Literature Review

    A literature review may consist of simply a summary of key sources, but in the social sciences, a literature review usually has an organizational pattern and combines both summary and synthesis, often within specific conceptual categories.A summary is a recap of the important information of the source, but a synthesis is a re-organization, or a reshuffling, of that information in a way that ...

  9. Systematic Reviews: Step 7: Extract Data from Included Studies

    A librarian can advise you on data extraction for your systematic review, including: What the data extraction stage of the review entails; Finding examples in the literature of similar reviews and their completed data tables; How to choose what data to extract from your included articles ; How to create a randomized sample of citations for a ...

  10. Literature Review

    Sources for a Literature Review will come from a variety of places, including: •Books Use the Library Catalog to see what items McDermott Library has on your topic or if McDermott Library has a specific source you need. The WorldCat database allows you to search the catalogs on many, many libraries. WorldCat is a good place to find out what books exist on your topic.

  11. Data Collection Methods and Tools for Research; A Step-by-Step Guide to

    Data Collection Methods and Tools for Research; A Step-by-Step Guide to Choose ... papers, the literature review section is based on secondary data sources. Thus, secondary data is an essential part of research that can help to get information from past studies as basis conduction for . Data Collection Methods and Tools for Research; A Step-by ...

  12. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a rel-evant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public ...

  13. Literature review

    3.5 Data collection methods. 3.5.1 Literature review. A literature review is often undertaken prior to empirical research as it provides a synthesis of the extant knowledge on a given topic. The scope of a literature review can vary. The emphasis may be on a review of research methods to determine which approach to adopt or examination of ...

  14. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    Data collection... | Find, read and cite all the research you need on ResearchGate. ... when there is a comprehensive literature review about the topic of study or to utilize it .

  15. Primary and Secondary Data Collection to Conduct Researches ...

    For Secondary Data Collection. To collect the pertinent secondary data for this study a substantive, thorough, sophisticated literature review is done (under subheading 22.3 Review of Related Literature) to answer the research related questions raised to achieve the objective of this research.

  16. Data Quality in Health Research: Integrative Literature Review

    Thus, through an integrative literature review, the main objective of this work is to identify and evaluate digital health technology interventions designed to support the conduct of health research based on data quality. Methods. Study Design. ... Data Collection. First, 2 independent reviewers with expertise in information and data science ...

  17. Data Collection Instrument and Procedure for Systematic Reviews in the

    In this paper we describe the instrument and proce-dure used to collect and evaluate data from individual studies of intervention effectiveness, a key step in the methods used to develop the Guide. The form illustrates the Task Force's approach to categorizing information about study design, content, and quality of the scientific literature.

  18. A Systematic Literature Review and Future Perspectives for ...

    Big data analytics in cancer disease-based systematic literature review is offered, acting as a road map for experts in the area to spot and deal with problems caused by new developments. A comprehensive analysis of the issues and challenges posed by deep learning-based healthcare big data analytics is given, along with a look ahead.

  19. The retention duration of digital images in picture archiving and

    The selection process was carried out in three stages and was reported based on the PRISMA flowchart and the data were extracted using the data collection form. Results. Based on the database search 2867 articles were identified, of which 13 articles were eligible for inclusion. Searching for gray literature identified 7 relevant sources.

  20. Qualitative Research: Data Collection, Analysis, and Management

    INTRODUCTION. In an earlier paper, 1 we presented an introduction to using qualitative research methods in pharmacy practice. In this article, we review some principles of the collection, analysis, and management of qualitative data to help pharmacists interested in doing research in their practice to continue their learning in this area.