Bioinformatics, Big Data, and Cancer

big data in bioinformatics research

Cancer researchers mine "big data" to answer complex biological questions.

The volume of biological data collected during the course of biomedical research has exploded, thanks in large part to powerful new research technologies.

The availability of these data, and the insights they may provide into the biology of disease, has many in the research community excited about the possibility of expediting progress toward precision medicine—that is, tailoring prevention, diagnosis, and treatment based on the molecular characteristics of a patient’s disease.

big data in bioinformatics research

NCI’s Rare Cancer Clinics Fostering Collaboration

Clinics bring together clinicians, patients, and advocates.

The return on investment from aggregating and sharing research data is particularly high for rare cancers, such as those that occur in children.

Mining the sheer volume of "big data" to answer the complex biological questions that will bring precision medicine into the mainstream of clinical care, however, remains a challenge. Nowhere is this challenge more evident than in oncology, as much of these data will come from studies of patients with cancer.

Seeking Answers from Big Data in the Era of Precision Medicine

Cancer data can be fragmented and compartmentalized, and many stakeholders are trying to overcome the challenges this poses for advancing research forward. To accelerate progress, cancer researchers need access to curated data from across many different institutions. Establishing an infrastructure to help researchers store, analyze, integrate, access, and visualize large amounts of biological data and related information is the focus of bioinformatics .

Bioinformatics uses advanced computing, mathematics, and different technological platforms to physically store, manage, analyze, and understand the data.

Currently, researchers use many different tools and platforms to store and analyze biological data, including data from whole genome sequencing, advanced imaging studies, comprehensive analyses of the proteins in biological samples, and clinical annotations.

It is often difficult to integrate and analyze data from these various platforms, however, and often researchers don't have access to the raw or primary data created by other studies or lack the computational tools and infrastructure necessary to integrate and analyze it.

In recent years, there has been a boom in the use of virtual repositories—or "data clouds"—to integrate and improve access to research data. Many of these efforts are still in their early stages, and questions remain about the optimal way to organize and coordinate clouds and their use.

NCI's Role in Cancer Bioinformatics

A yellow, circular icon with a photo of two women looking over notes. Above them are the words Maximize Data Utility.

NCI Data Initiatives and the National Cancer Plan

As a federal agency, NCI is uniquely positioned to improve data sharing, analysis, and visualization. These efforts align with the National Cancer Plan’s goal to maximize data utility. Read about the plan and this goal.

NCI has played a leading role in advancing the science of genomics , proteomics , imaging, and metabolomics , among other areas, to increase our understanding of the molecular basis of cancer.

The NCI Center for Biomedical Informatics and Information Technology (CBIIT) oversees the institute’s bioinformatics-related initiatives.

The National Cancer Informatics Program (NCIP) is involved in numerous research areas, including genomic, clinical, and translational studies, and is exploring how to improve data sharing, analysis, and visualization. For instance, NCIP operates NCIP Hub , a centralized resource designed to create a community space to promote learning and the sharing of data and bioinformatics tools by cancer researchers. NCIP Hub itself is an experiment to see if the cancer research community finds the social and community aspects of the program useful for team science and multi-investigator research teams.

The Cancer Data Science Laboratory (CDSL) , in NCI's Center for Cancer Research, develops and uses computational approaches to analyze and integrate laboratory and patient data from cancer genomics and other "omics" research. These computational approaches and algorithms can then be used to address fundamental research questions about the origin, evolution, progression, and treatment of cancer.

Under The Cancer Genome Atlas (TCGA) , a research program that was supported by NCI and the National Human Genome Research Institute, researchers have conducted comprehensive molecular analyses of more than 11,000 patients using tumor and healthy tissue samples. More than 1,000 studies have been published based on TCGA-collected data.

big data in bioinformatics research

Fueling Progress against Childhood Leukemia

TARGET initiative leads to clinical trials of targeted therapies.

Similarly, under NCI's Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program, researchers have identified genetic alterations in pediatric cancers, most of which are from children in clinical trials conducted by the Children's Oncology Group .

NCI’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a collaborative consortium of institutions and investigators that perform genomic and proteomic analyses to better understand the molecular basis of cancer. Proteomics data generated by CPTAC research projects are made publicly available in a repository that is accessible by the global research community.

Data from these initiatives and other NCI-supported studies have helped researchers better understand the biology of different cancers and identify potential new targets for therapies.

In some respects, however, these studies have only scratched the surface of what can be learned from the vast amount of data collected as part of this research. As a result, there has been a new push in the research community to find ways to make these data, and the tools to analyze them, more widely accessible.

Democratizing Big Data for Cancer Research

As a federal agency, NCI is uniquely positioned to democratize access to cancer research data. NCI's Office of Data Sharing (ODS) coordinates data sharing policies across NCI and the cancer research community. ODS manages NCI data submissions and access to online databases, provides education and outreach for NCI data sharing policies, and examines the uptake and use of NCI data. 

NCI has launched several initiatives to provide researchers with easier access to data from TCGA, TARGET, and other NCI-funded research, and the resources to analyze the data.

The NCI Cancer Research Data Commons (CRDC) is a data science infrastructure that connects cancer research data collections with analytical tools. The CRDC can be used to store, analyze, share, and visualize cancer research data types, including proteomics, animal models, and epidemiological cohorts. The CRDC includes the NCI Genomic Data Commons, the NCI Cloud Resources, the Data Commons Framework, and other projects.

  • The NCI Genomic Data Commons (GDC)  provides a single source for data from NCI-funded initiatives and cancer research projects, as well as the analytical tools needed to mine them. The GDC includes data from TCGA, TARGET, and the Genomics Evidence Neoplasia Information Exchange (GENIE). The GDC will continue to grow as NCI and individual researchers and research teams contribute high-quality, harmonized data from their cancer research projects. The GDC provides the cancer genomics repository for projects falling under the NIH Genomic Data Sharing Policy .  

New DAVE Tools Released for Genomic Data Commons

Online, open-access resource provides broad access to genomic analysis tools.

  • NCI provides resources that use cloud technology to provide researchers with access to genomic and other data from NCI-funded studies. These NCI Cloud Resources are used to explore innovative methods for accessing, sharing, and analyzing molecular data. Each resource, implemented through commercial cloud providers, operates under common standards but have distinct designs and means of sharing data and analytical tools, with the goal of identifying the most effective means for using cloud technology to advance cancer research.

The Data Commons Framework provides the core components for building and expanding the CRDC, including services for securing, finding, and annotating data, as well as user workspaces for analyzing data and sharing results.

To enable integration of data from CRDC repositories, a Cancer Data Aggregator (CDA) will support search and analysis across distinct data types. The CDA will allow researchers to combine data from diverse scientific domains and perform integrated analyses which can be shared with collaborators.

Protecting Patient Privacy

An important aspect of data sharing is the ability to link data at the patient level across disparate data sources without exposing identifiable information. NCI is evaluating approaches and creating software for generating unique patient identifiers that can be used to link patient data from different sources without sharing identifiable information beyond the organizations authorized to hold such information. The software-generated identifiers will preserve the privacy of cancer patients who share their data with the cancer research community.

NCI Cancer Research Data Ecosystem Infographic

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 05 September 2022

Big data in basic and translational cancer research

  • Peng Jiang   ORCID: orcid.org/0000-0002-7828-5486 1 ,
  • Sanju Sinha 1 ,
  • Kenneth Aldape 2 ,
  • Sridhar Hannenhalli 1 ,
  • Cenk Sahinalp   ORCID: orcid.org/0000-0002-2170-2808 1 &
  • Eytan Ruppin   ORCID: orcid.org/0000-0002-7862-3940 1  

Nature Reviews Cancer volume  22 ,  pages 625–639 ( 2022 ) Cite this article

41k Accesses

63 Citations

209 Altmetric

Metrics details

  • Cancer epigenetics
  • Cancer genomics
  • Cancer therapy
  • Computational biology and bioinformatics

Historically, the primary focus of cancer research has been molecular and clinical studies of a few essential pathways and genes. Recent years have seen the rapid accumulation of large-scale cancer omics data catalysed by breakthroughs in high-throughput technologies. This fast data growth has given rise to an evolving concept of ‘big data’ in cancer, whose analysis demands large computational resources and can potentially bring novel insights into essential questions. Indeed, the combination of big data, bioinformatics and artificial intelligence has led to notable advances in our basic understanding of cancer biology and to translational advancements. Further advances will require a concerted effort among data scientists, clinicians, biologists and policymakers. Here, we review the current state of the art and future challenges for harnessing big data to advance cancer research and treatment.

Similar content being viewed by others

big data in bioinformatics research

Causal machine learning for predicting treatment outcomes

big data in bioinformatics research

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

big data in bioinformatics research

Feasibility of functional precision medicine for guiding treatment of relapsed or refractory pediatric cancers

Introduction.

Cancer is a complex process, and its progression involves diverse processes in the patient’s body 1 . Consequently, the cancer research community generates massive amounts of molecular and phenotypic data to study cancer hallmarks as comprehensively as possible. The rapid accumulation of omics data catalysed by breakthroughs in high-throughput technologies has given rise to the notion of ‘big data’ in cancer, which we define as a dataset with two basic properties; first, it contains abundant information that can give novel insights into essential questions, and second, its analysis demands a large computer infrastructure beyond equipment available to an individual researcher — an evolving concept as computational resources evolve exponentially following Moore’s law. A model example of such big data is the dataset collected by The Cancer Genome Atlas (TCGA) 2 . TCGA contains 2.5 petabytes of raw data — an amount 2,500 times greater than modern laptop storage in 2022 — and requires specialized computers for storage and analysis. Further, between its initial release in 2008 to March 2022, at least 10,242 articles and 11,054 NIH grants cited TCGA according to a PubMed search, demonstrating its transformative value as a community resource that has markedly driven cancer research forward.

Big data are not unique to the cancer field, and play an essential role in many scientific disciplines, notably cosmology, weather forecasting and image recognition. However, datasets in the cancer field differ from those in other fields in several key aspects. First, the size of cancer datasets is typically markedly smaller. For example, in March 2022, the US National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database 3 — the largest genomics data repository to our knowledge — contained approximately 1.1 million samples with ‘cancer’ as a keyword. However, ImageNet, the largest public repository for computer vision, contains 15 million images 4 . Second, cancer research data are typically heterogeneous and may contain many dimensions measuring distinct aspects of cellular systems and biological processes. Modern multi-omics workflows may generate genome-wide mRNA expression, chromatin accessibility and protein expression data on single cells 5 , together with a spatial molecular readout 6 . The comparatively limited data size in each modality and the high heterogeneity among them necessitate the development of innovative computational approaches for integrating data from different dimensions and cohorts.

The subject of big data in cancer is of immense scope, and it is impossible to cover everything in one review. We therefore focus on key big-data analyses that led to conceptual advances in our understanding of cancer biology and impacted disease diagnosis and treatment decisions. Further, we detail reviews in the pertaining sections to direct interested readers to relevant resources. We acknowledge that our limited selection of topics and examples may omit important work, for which we sincerely apologize.

In this Review, we begin by describing major data sources. Next, we review and discuss data analysis approaches designed to leverage big datasets for cancer discoveries. We then introduce ongoing efforts to harness big data in clinically oriented, translational studies, the primary focus of this Review. Finally, we discuss current challenges and future steps to push forward big data use in cancer.

Common data types

There are five basic data types in cancer research: molecular omics data, perturbation phenotypic data, molecular interaction data, imaging data, and textual data. Molecular omics data describe the abundance or status of molecules in cellular systems and tissue samples. Such data are the most abundant type generated in cancer research from patient or preclinical samples, and include information on DNA mutations (genomics), chromatin or DNA states (epigenomics), protein abundance (proteomics), transcript abundance (transcriptomics) and metabolite abundance (metabolomics) (Table  1 ). Early studies relied on data from bulk samples to provide insights into cancer progressions, tumour heterogeneity and tumour evolution, by using well-designed computational approaches 7 , 8 , 9 , 10 . Following the development of single-cell technologies and decreases in sequencing costs, current molecular data can be generated at multisample and single-cell levels 11 , 12 and reveal tumour heterogeneity and evolution at a much higher resolution. Furthermore, genomic and transcriptomic readouts can include spatial information 13 , revealing cancer clonal evolutions within distinct regions and gene expression changes associated with clone-specific aberrations. Although more limited in resolution, conventional bulk analyses are still useful for analysing large patient cohorts as the generation of single-cell and spatial data is costly and often feasible for only a few tumours per study.

Perturbation phenotypic data describe how cell phenotypes, such as cell proliferation or the abundance of marker proteins, are altered following the suppression or amplification of gene levels 14 or drug treatments 15 , 16 . Common phenotyping experiments include perturbation screens using CRISPR knockout 17 , interference or activation 18 ; RNA interference 19 ; overexpression of open reading frames 20 ; or treatment with a library of drugs 15 , 16 . As a limitation, the generation of perturbation phenotypic data from clinical samples is still challenging due to the requirement of genetically manipulable live cells.

Molecular interaction data describe the potential function of molecules through their interacting with diverse partners. Common molecular interaction data types include data on protein–DNA interactions 21 , protein–RNA interactions 22 , protein–protein intercations 23 and 3D chromosomal interactions 24 . Similar to perturbation phenotypic data, molecular interaction datasets are typically generated using cell lines as their generation requires a large quantity of material that often exceeds that available from clinical samples.

Clinical data such as health records 25 , histopathology images 26 and radiology images 27 , 28 can also be of considerable value. The boundary between molecular omics and image data is not absolute as both can include information of the other type, for example in datasets that contain imaging scans and information on protein expression from a tumour sample (Table  1 ).

Data repositories and analytic platforms

We provide an overview of key data resources for cancer research organized in three categories. The first category comprises resources from projects that systematically generate data (Table  2 ); for example, TCGA generated transcriptomic, proteomic, genomic and epigenomic data for more than 10,000 cancer genomes and matched normal samples, spanning 33 cancer types. The second category describes repositories presenting processed data from the aforementioned projects (Table  3 ), such as the Genomic Data Commons, which hosts TCGA data for downloading. The third category includes Web applications that systematically integrate data across diverse projects and provide interactive analysis modules (Table  4 ). For example, the TIDE framework systematically collected public data from immuno-oncology studies and provided interactive modules to study pathways and regulation mechanisms underlying tumour immune evasion and immunotherapy response 29 .

In addition to cancer-focused large-scale projects enumerated in Table  2 , many individual groups have deposited genomic datasets that are useful for cancer research in general databases such as GEO 3 and ArrayExpress 30 . Curation of these datasets could lead to new resources for cancer biology studies. For example, the PRECOG database contains 166 transcriptomic studies collected from GEO and ArrayExpress with patient survival information for querying the association between gene expression and prognostic outcome 31 .

Integrative analysis

Although data-intensive studies may generate omics data on hundreds of patients, the data scale in cancer research is still far behind that in other fields, such as computer vision. Cross-cohort aggregation and cross-modality integration can markedly enhance the robustness and depth of big data analysis (Fig.  1 ). We discuss these strategies in the following subsections.

figure 1

Clinical decisions, basic research and the development of new therapies should consider two orthogonal dimensions when leveraging big-data resources; integrating data across many data modalities and integrating data from different cohorts, which may include the transfer of knowledge from pre-existing datasets.

Cross-cohort data aggregation

Integration of datasets from multiple centres or studies can achieve more robust results and potentially new findings, especially where individual datasets are noisy, incomplete or biased with certain artefacts. A landmark of cross-cohort data aggregation is the discovery of the TMPRSS2 – ERG fusion and a less frequent TMPRSS2 – ETV1 fusion as oncogenic drivers in prostate cancer. A compendium analysis across 132 gene-expression datasets representing 10,486 microarray experiments first identified ERG and ETV1 as highly expressed genes in six independent prostate cancer cohorts 32 , further studies identified their fusions with TMPRSS2 as the cause of ERG and ETV1 overexpression. Another example is an integrative study of tumour immune evasion across many clinical datasets that revealed that SERPINB9 expression consistently correlates with intratumoural T cell dysfunction and resistance to immune checkpoint blockade 29 . Further studies found SERPINB9 activation to be an immune checkpoint blockade resistance mechanism in cancer cells 29 and immunosuppressive cells 33 .

A general approach for cross-cohort aggregation is to obtain public datasets that are related to a new research topic or have similar study designs to a new dataset. However, use of public data for a new analysis is challenging because the experimental design behind each published dataset is unique, requiring labour-intensive expert interpretation and manual standardization. A recent framework for data curation provides natural language processing and semi-automatic functions to unify datasets with heterogeneous meta-information into a format usable for algorithmic analysis 34 (Framework for Data Curation in Table  3 ).

Although data aggregation may generate robust hypotheses, batch effects caused by differences in laboratories, individual researcher’s techniques or platforms or other non-biological factors may mask or reduce the strength of signals uncovered 35 , and correcting for these effects is therefore a critical step in cross-cohort aggregations 36 , 37 . Popular batch effect correction approaches include the ComBat package, which uses empirical Bayes estimators to compute corrected data 36 , and the Seurat package, which creates integrated single-cell clusters anchored on similar cells between batches 38 . Despite the availability of batch correction methods, analysis of both original and corrected data is essential to draw reliable conclusions as batch correction can introduce false discoveries 39 .

Cross-modality data integration

Cross-modality integration of different data types is a promising and productive approach for maximizing the information gained from data as the information embedded in each data type is often complementary and synergistic 40 . Cross-modality data integration is exemplified by projects such as TCGA, which provides genomic, transcriptomic, epigenomic and proteomic data on the same set of tumours (Table  2 ). Cross-modality integration has led to many novel insights regarding factors associated with cancer progression. For example, the phosphorylation status of proteins in the EGFR signalling pathway — an indicator of EGFR signalling activity — is highly correlated with the expression of genes encoding EGFR ligands in head and neck cancers but not receptor expression, copy number alterations, protein levels or phosphorylations 41 , suggesting that patients should be stratified to receive anti-EGFR therapies on the basis of ligand abundance instead of receptor status.

A recent example of cross-modality data integration used single-cell multi-omics technologies that allowed genome-wide transcriptomics and chromatin accessibility data to be measured together with a handful of proteins of interest 42 . The advantages of using cross-modality data were clear as during cell lineage clustering, CD8 + T cell and CD4 + T cell populations could be clearly separated in the protein data but were blended when the transcriptome was analysed 42 . Conversely, dendritic cells formed distinct clusters when assessed on the basis of transcriptomic data, whereas they mixed with other cell types when assessed on the basis of cell-surface protein levels. Chromatin accessibility measured by assay for transposase-accessible chromatin using sequencing (ATAC-seq) further revealed T cell sublineages by capturing lineage-specific regulatory regions. For each cell, the study first identified neighbouring cells through similarities in each data modality. Then, the study defined the weights of the different data modalities in the lineage classification as their accuracy for predicting molecular profiles of the target cell from the profiles of neighbouring cells. The resulting cell clustering, using the weighted distance averaged across single-cell RNA, protein and chromatin accessibility data, was then shown to improve cell lineage separation 42 .

Another common type of multimodal data analysis involves integrating molecular omics data and data on physical interaction networks (typically those involving protein–protein or protein–DNA interactions) to understand how individual genes interact with each other to drive oncogenesis and metastasis 43 , 44 , 45 , 46 . For example, an integrative pan-cancer analysis of TCGA detected 407 master regulators organized into 24 modules, partly shared across cancer types, that appear to canalize heterogeneous sets of mutations 47 . In another study, an analysis of 2,583 whole-tumour genomes across 27 cancers by the Pan-Cancer Analysis of Whole Genomes Consortium revealed rare mutations in the promoters of genes with many interactions (such as TP53 , TLE4 and TCF4 ), and these mutations correlated with low downstream gene expression 45 . These examples of integrating networks and genomics data demonstrate a promising way to identify rare somatic mutations with a causal role in oncogenesis.

Knowledge transfer through data reuse

Existing data can be leveraged to make new discoveries. For example, cell-fraction deconvolution techniques can infer the composition of individual cell types in bulk-tumour transcriptomics profiles 48 . Such methods typically assemble gene expression profiles of diverse cell types from many existing datasets and perform regression or signature-enrichment analysis to deconvolve cell fractions 49 or lineage-specific expression 50 , 51 in a bulk-tumour expression profile.

Other data reuse examples come from single-cell transcriptomics data analysis. As single-cell RNA sequencing (scRNA-seq) has a high number of zero counts (dropout) 52 , analyses based on a limited number of genes may lead to unreliable results 53 , and genome-wide signatures from bulk data can therefore complement such analyses. For example, the transcriptomic data atlas collected from cytokine treatments in bulk cell cultures has enabled the reliable inference of signalling activities in scRNA-seq data 34 . Further, single-cell signalling activities inferred through bulk data have been used to reveal therapeutic targets, such as FIBP , to potentiate cellular therapies in solid tumours and molecular programmes of T cells that are resilient to immunosuppression in cancer 54 . In another example, the analysis of more than 50,000 scRNA-seq profiles from 35 pancreatic adenocarcinomas and control samples revealed edge cells among non-neoplastic acinar cells, whose transcriptomes have drifted towards malignant pancreatic adenocarcinoma cells 55 ; TCGA bulk pancreatic adenocarcinoma data were then used to validate the edge-cell signatures inferred from the single-cell data.

Data reuse can assist the development of new experimental tests. For example, existing tumour whole-exome sequencing data were used to optimize a circulating tumour DNA assay by maximizing the number of alterations detected per patient, while minimizing gene and region selection size 56 . The resulting circulating tumour DNA assay can provide a comprehensive view of therapy resistance and cancer relapse and metastasis by detecting alterations in DNA released from multiple tumour regions or different tumour sites 57 .

Although the data scale in cancer research is typically much smaller than in other fields, the number of input features, such as genes or imaging pixels, can be extremely high. Training a machine learning model with a high number of input dimensions (a large number of features) and small data size (a small number of training samples) is likely to lead to overfitting, in which the model learns noise from training data and cannot generalize on new data 58 . Transfer learning approaches are a promising way of addressing this disparity related to data reuse. These approaches involve training a neural network model on a large, related dataset, and then fine-tuning the model on the smaller, target dataset. For example, most cancer histopathology artificial intelligence (AI) frameworks start from pretrained architectures from ImageNet — an image database containing 15 million images with detailed hierarchical annotations 4 — and then fine-tune the framework on new imaging datasets of smaller sizes. As a further example of this approach, a few-shot learning framework enabled the prediction of drug response using data from only several patient-derived samples and a model pretrained using in vitro data from cell lines 59 . Despite these successful applications, transfer learning should be used with caution as it may produce mostly false predictions when data properties are markedly different between the pretraining set and the new dataset. Training a lightweight model 60 or augmenting the new dataset 61 are alternative solutions.

Data-rich translational studies

Many clinical diagnoses and decisions, such as histopathology interpretations, are inherently subjective and rely on interpreters’ experience or the availability of standardized diagnostic nomenclature and taxonomy. Such subjective factors may bring interpretive error 62 , 63 , 64 and diagnostic discrepancies, for example when senior stature can have an undue influence on diagnostic decisions — the so-called big-dog effect 65 . Big-data approaches can provide complementary options that are systematic and objective to guide diagnosis and clinical decisions.

Diagnostic biomarkers trained from data cohorts

A major focus of translational big-data studies in cancer has been the development of genomics tests for predicting disease risk, some of which have already been approved by the US Food and Drug Administration (FDA) and commercialized for clinical use 66 . Distinct from biomarker discoveries through biological mechanisms and empirical observations, big data-derived tests analyse genome-scale genomics data from many patients and cohorts to generate a gene signature for clinical assays 67 . Such predictors mainly help clinicians determine the minimal therapy aggressiveness needed to minimize unnecessary treatment and side effects. The success of such tests depends on their high negative predictive value — the proportion of negative tests that reflect true negative results — so as not to miss patients who need aggressive therapy options 66 .

Some early examples of diagnostic biomarker tests trained from big data include prognosis assays for patients with oestrogen receptor (ER)- or progesterone receptor (PR)-positive breast cancer, such as Oncotype DX 68 , 69 , MammaPrint 67 , 70 , EndoPredict 71 and Prosigna 72 . These tests are particularly useful as adjuvant endocrine therapy alone can bring sufficient clinical benefit to ER/PR-positive, HER2-negative patients with early-stage breast cancer 73 . Thus, patients stratified as being at low risk can avoid unnecessary additional chemotherapy. Predictors for other cancer types include Oncotype DX biomarkers for colon cancer 74 and prostate cancer 75 and Pervenio for early-stage lung cancer 76 .

In the early applications discussed above, large-scale data from genome-scale experiments served in the biomarker discovery stage but not in their clinical implementation. Owing to the high cost of genome-wide experiments and patent issues, the biomarker tests themselves still need to be performed through quantitative PCR or NanoString gene panels. However, the rapid decline of DNA sequencing costs in recent years could allow therapy decisions to be informed directly by genomics data and bring notable advantages over conventional approaches 77 . Gene alterations relevant to therapy decisions could involve diverse forms, including single-nucleotide mutations, DNA insertions, DNA deletions, copy number alterations, gene rearrangements, microsatellite instability and tumour mutational burden 78 , 79 , 80 . These alterations can be detected by combining hybridization-based capture and high-throughput sequencing. The MSK-IMPACT 81 and FoundationOne CDx 82 tests profile 300–500 genes and can use DNA from formalin-fixed, paraffin-embedded tumour specimens to detect oncogenic alterations and identify patients who may benefit from various therapies.

Variant interpretation in clinical decisions is still challenging as the oncogenic impact of each mutation depends on its clonality 83 , zygosity 84 and co-occurrences with other mutations 85 . Sequencing data can uncover tumorigenic processes (such as DNA repair defects, exogenous mutagen exposure and prior therapy histories 81 ) by identifying underlying mutational signatures, such as DNA substitution classes and sequence contexts 86 . Future computational frameworks for therapy decisions should therefore consider many dimensions of variants and inferred biological processes, together with other clinical data, such as histopathology data, radiology images and health records.

Data-rich assays that complement precision therapies currently focus on specific genomic aberrations. However, epigenetic therapies, such as inhibitors that target histone deacetylases 87 , have a genome-wide effect and are typically combined with other treatments, and therefore current genomics assays may not readily evaluate their therapeutic efficacy. We could not find any clinical datasets of histone deacetylase inhibitors deposited in the NCBI GEO database when writing this Review, indicating there are many unexplored territories of data-driven predictions for this broad category of anticancer therapies.

Clinical trials guided by molecular data

Genome-wide and multimodal data have begun to play a role in matching patients in prospective multi-arm clinical trials, particularly those investigating precision therapies. For example, the WINTHER trial prospectively matched patients with advanced cancer to therapy on the basis of DNA sequencing (arm A, through Foundation One assays) or RNA expression (arm B, comparing tumour tissue with normal tissue through Agilent oligonucleotide arrays) data from solid tumour biopsies 88 . Such therapy matches by omics data typically lead to off-label drug use. The WINTHER study concluded that both data types were of value for improving therapy recommendations and patient outcomes. Furthermore, there were no significant differences between DNA sequencing and RNA expression with regard to providing therapies with clinical benefits 88 , which was corroborated by a later study 89 .

Other, similar trials have demonstrated the utility of matching patients for off-label use of targeted therapies on the basis of genome-wide genomics or transcriptomics data 89 , 90 , 91 , 92 (Fig.  2 ). In these studies, the fraction of enrolled patients who had therapies matched by omics data ranged from 19% to 37% (WINTHER, 35% 88 ; POG, 37% 89 ; MASTER, 31.8% 92 ; MOSCATO 01, 19.2%  90 ; CoPPO, 20% 91 ). Among these matched patients, about one third demonstrated clinical benefits (WINTHER, 25% 88 ; POG, 46% 89 ; MASTER, 35.7% 92 ; MOSCATO 01, 33% 90 ; CoPPO, 32% 91 ). Except for the POG study, all studies used the end point defined by the Von Hoff model, which compares progression-free survival (PFS) for the trial (PFS2) with the PFS recorded for the therapy preceding enrolment (PFS1) and defines clinical benefit as a PFS2/PFS1 ratio of more than 1.3 (ref. 93 ).

figure 2

Recent umbrella clinical trials 88 , 89 , 90 , 91 , 92 have focused on multi-omics profiling of the tumours of enrolled patients by generating and analysing genome-wide data — including data from DNA sequencing, gene expression profiling, and copy number profiling — to prioritize treatments. After multi-omics profiling, a multidisciplinary molecular tumour board led by clinicians selects the best therapies on the basis of the current known relationships between drugs, genes and tumour vulnerabilities. For each therapy, the relevant altered vulnerabilities could include direct drug targets, genes in the same pathway, indirect drug targets upregulated or downregulated by drug treatment, or other genes interacting with the drug targets through physical or genetic interactions. This process then results in patients being treated with off-label targeted therapies. The end points for evaluating clinical efficacy include the ratio of the progression-free survival (PFS) associated with omics data-guided therapies (PFS2) and the PFS associated with previous therapy (PFS1), or differences in survival between patients treated with omics data-guided therapies and patients treated with therapies guided by physician’s choice alone.

A recent study demonstrated the feasibility and value of an N -of-one strategy that collected multimodal data, including immunohistochemistry data for multiple protein markers, RNA levels and genomics alterations in cell-free DNA from liquid biopsies 94 (Fig.  2 ). A broad multidisciplinary molecular tumour board (MTB) then made personalized decisions using these multimodal omics data. Overall, patients who received MTB-recommended treatments had significantly longer PFS and overall survival than those treated by independent physician choice. Similarly, another study also demonstrated overall survival benefits brought by MTB recommendations 95 .

With these initial successes, emerging clinical studies aim to collect additional data beyond bulk-sample sequencings — such as tumour cell death response following various drug treatments 96 or scRNA-seq data collected on longitudinal patient samples — to study therapy response and resistance mechanisms 97 . Besides omics data generated from tumour samples, cross-modality data integration is a potential strategy to improve therapy recommendations. One such promising direction involves the study and application of synthetic lethal interactions 98 , 99 , 100 , 101 , 102 , 103 , 104 , which, once integrated with tumour transcriptomic profiles, can accurately score drug target importance and predict clinical outcomes for many anticancer treatments, including targeted therapies and immunotherapies 98 . We foresee that new data modalities and assays will provide additional ways to design clinical trials.

Artificial intelligence for data-driven cancer diagnosis

Genomics datasets, such as gene expression levels or mutation status, can typically be aligned to each other on gene dimensions. However, data types in clinical diagnoses, such as imaging data or text reports, may not directly align across samples in any obvious way. AI approaches based on deep neural networks (Fig.  3a ) are an emerging method for integrating these data types for clinical applications 105 .

figure 3

a | A common artificial intelligence (AI) framework in cancer detection uses a convolutional neural network (CNN) to detect the presence of cancer cells from a diagnostic image. CNNs use convolution (weighted sum of a region patch) and pooling (summarize values in a region to one value) to encode image regions into low-dimensional numerical vectors that can be analysed by machine learning models. The CNN architecture is typically pretrained with ImageNet data, which is much larger than any cancer biology imaging dataset. To increase the reliability of the AI framework, the input data can be augmented through rotation or blurring of tissue images to increase data size. The data are separated into non-overlapping training, tuning and test sets to train the AI model, tune hyperparameters and estimate the prediction accuracy on new inputs, respectively. False-positive predictions are typically essential data points for retraining the AI model. b | An example of the application of AI in informing clinical decisions, as per the US Food and Drug Administration-approved AI test Paige Prostate. From one needle biopsy sample, the pathologist can decide whether cancer cells are present. If the results are negative (‘no cancer’) or if the physician cannot make a firm diagnosis (‘defer’), the Paige Prostrate AI can analyse the image and prompt the pathologist with regard to potential cancer locations if any are detected. The alternative procedure involves evaluating multiple biopsy samples and performing immunohistochemistry tests on prostate cancer markers, independently from the AI test 185 .

The most popular application of AI for analysing imaging data involves clinical outcome prediction and tumour detection and grading from tissue stained with haematoxylin and eosin (H&E) 26 . In September 2021, the FDA approved the use of the AI software Paige Prostate 106 to assist pathologists in detecting cancer regions from prostate needle biopsy samples 107 (Fig.  3b ). This approval reflects the accelerating momentum of AI applications on histopathology images 108 to complement conventional pathologist practices and increase analysis throughput, particularly for less experienced pathologists. The CAMELYON challenge for identifying tumour regions provided 1,399 manually annotated whole-slide H&E-stained tissue images of sentinel lymph nodes from patients with breast cancer for training AI algorithms 109 . The top performers in the challenge used deep learning approaches, which achieved similar performance in detecting lymph node metastasis as expert pathologists 110 . Other studies have trained deep neural networks to predict patient survival outcomes 111 , gene mutations 112 or genomic alterations 113 , on the basis of analysing a large body of H&E-stained tissue images with clinical outcome labels or genomics profiles.

Besides histopathology, radiology is another application of AI imaging analysis. Deep convolutional neural networks that use 3D computed tomography volumes have been shown to predict the risk of lung cancer with an accuracy comparable to that of predictions by experienced radiologists 114 . Similarly, convolutional neural networks can use computed tomography data to stratify the survival duration of patients with lung cancer and highlight the importance of tumour-surrounding tissues in risk stratification 115 .

AI frameworks have started to play an important role in analysing electronic health records. A recent study evaluating the effect of different eligibility criteria on cancer trial outcomes using electronic health records of more than 60,000 patients with non-small-cell lung cancer revealed that many patient exclusion criteria commonly used in clinical trials had a minimal effect on trial hazard ratios 25 . Dropping these exclusion criteria would only marginally decrease the overall survival and result in more inclusive trials without compromising patient safety and overall trial success rates 25 . Besides images and health records, AI trained on other data types also has broad clinical applications, such as early cancer detection through liquid biopsies capturing cell-free DNA 116 , 117 or T cell receptor sequences 118 , or genomics-based cancer risk predictions 119 , 120 . Additional examples of AI applications in cancer are available in other reviews 40 , 121 .

New AI approaches have started to play a role in biological knowledge discovery. The saliency map 122 and class activation map 123 can highlight essential portions of input images that drive predicted outcomes. Also, in a multisample cohort, clustering data slices on the basis of deep learning-embedded similarities can reveal human-interpretable features associated with a clinical outcome. For example, clustering similar image patches related to colorectal cancer survival prediction revealed that high-risk survival predictions are associated with a tumour–adipose feature, characterized by poorly differentiated tumour cells adjacent to adipose tissue 124 . Although the molecular mechanisms underlying this association are unclear, this study provided an example of finding imaging features that could help cancer biologists pinpoint new disease mechanisms.

Despite the promising results described above, few AI-based algorithms have reached clinical deployment due to several limitations 26 . First, the performance of most AI predictors deteriorates when they are applied to test data generated in a setting different from that in which their training data are generated. For example, the performance of top algorithms from the CAMELYON challenge dropped by about 20% when they were evaluated on the basis of data from other centres 108 . Such a gap may arise from differences in image scanners (if imaging data are being evaluated), sample collection protocols or study design, emphasizing the need for reliable data homogenization. Second, supervised AI training requires a large amount of annotated data, and acquiring sufficient human-annotated data can be challenging. In imaging data, if a feature for a particular diagnosis is present in only a fraction of image regions, an algorithm will need many samples to learn the task. Furthermore, if features are not present in the training data, the AI will not make meaningful predictions; for example, the AI framework of AlphaFold2 can predict wild type protein structures with high accuracy, but it cannot predict the impact of cancer missense mutations on protein structures because the training data for AlphaFold2 do not contain altered structures of these mutated proteins 125 .

Many studies of AI applications that claim improvements lack comparisons with conventional clinical procedures. For example, the performance study of Paige Prostate evaluated cancer detection using an H&E-stained tissue image from one needle biopsy sample 126 . However, the pathologist may make decisions on the basis of multiple needle biopsy samples and immunohistochemistry stains for suspicious samples instead of relying on one H&E-stained tissue image (Fig.  3b ). Therefore, rigorous comparison with conventional clinical workflows is necessary for each application before the advantage of any AI framework is claimed.

New therapy development aided by big-data analysis

Developing a new drug is costly, is time-intensive and suffers from a high failure rate 127 . The development of new therapies is a promising direction for big-data applications. To our knowledge, no FDA-approved cancer drugs have been developed primarily through big-data approaches; however, some big data-driven preclinical studies have attracted the attention of the pharmaceutical industry for further development and may soon make impactful contributions to clinics 128 .

Big data have been used to aid the repurposing of existing drugs to treat new diseases 129 , 130 and the design of synergistic combinations 131 , 132 , 133 , 134 . By creating a network of 1.2 billion edges among diseases, tissues, genes, pathways and drugs by mining more than 40 million documents, one study revealed that the combination of vandetanib and everolimus could inhibit ACVR1, a drug efflux transporter, as a potential therapy for diffuse intrinsic pontine glioma 135 .

Recent studies have combined pharmacological data and AI to design new drugs (Fig.  4 ). A deep generative model was used to design new small molecules inhibiting the receptor tyrosine kinase DDR1 on the basis of information on existing DDR1 inhibitors and compound libraries, with the lead candidate demonstrating favourable pharmacokinetics in mice 136 . Deep generative models are neural networks with many layers that learn complex characteristics of specific datasets (such as high-dimensional probability distributions) and can use them to generate new data similar to the training data 137 . For each specific drug design application, such a framework can encode distinct data into the neural network parameters and thus naturally incorporate many data types. A network aiming to find novel kinase inhibitors, for example, may include data on the structure of existing kinase inhibitors, non-kinase inhibitors and patent-protected molecules that are to be avoided 136 .

figure 4

The variational autoencoder, trained with the structures of many compounds, can encode a molecular structure into a latent space of numerical vectors and decode this latent space back into the compound structure. For each target, such as the receptor tyrosine kinase DDR1, the variational autoencoder can create embeddings of compound categories, such as existing kinase inhibitors, patented compounds and non-kinase inhibitors. Sampling the latent space for compounds that are similar to existing on-target inhibitors and not patented compounds or non-kinase inhibitors can generate new candidate kinase inhibitors for downstream experimental validation. Adapted from ref. 136 , Springer Nature Limited.

AI can also be used for the virtual screening of bioactive ligands on target protein structures. Under the assumption that biochemical interactions are local among chemical groups, convolutional neural networks can comprehensively integrate training data from previous virtual screening studies to outperform previous docking methods based on minimizing empirical scores 138 . Similarly, a systematic evaluation revealed that deep neural networks trained using large and diverse datasets composed of molecular descriptors and drug biological activities could predict the activity of test-set molecules better than other approaches 139 .

Big data in front of narrow therapeutic bottlenecks

During dynamic tumour evolution, cancers generally become more heterogeneous and harbour a more diverse population of cells with different treatment sensitivities. Drug resistance can eventually evolve from a narrow bottleneck of a few cells 140 . Furthermore, the difference between a treatment dose with antitumour effects and toxicity leading to either clinical trial failure or treatment cessation is small 66 . These two challenges are common reasons for anticancer therapy failures as increasing drug combinations to target rare cancer cells will quickly lead to unacceptable toxic effects. An essential question is whether big data can bring solutions to overcome heterogeneous tumour evolution towards drug resistance while avoiding intolerable toxic effects.

Ideally, well-designed drug combinations should target various subsets of drug-tolerant cells in tumours and induce robust responses. Computational methods have been developed to design synergistic drug pairs 131 , 141 ; however, drug synergy may not be predictable for certain combinations even with comprehensive training data. A recent community effort assessed drug synergy prediction methods trained on AstraZeneca’s large drug combination dataset, consisting of 11,576 experiments from 910 combinations across 85 molecularly characterized cancer cell lines 134 . The results showed that none of the methods evaluated could make reliable predictions for approximately 20% of the drug pairs whose targets independently regulate downstream pathways.

There could be a theoretical limitation of the power of drug combinations in killing heterogeneous tumour cells while avoiding toxic effects on normal tissues. A recent study mining 15 single-cell transcriptomics datasets revealed that inhibition of four cell-surface targets is necessary to kill at least 80% of tumour cells while sparing at least 90% of normal cells in tumours 142 . However, a feasible drug-target combination may not exist to kill a higher fraction of tumour cells while sparing normal cells.

An important challenge accompanying therapy design efforts is the identification of genomic biomarkers that could predict toxicity. A community evaluation demonstrated that computational methods could predict the cytotoxicity of environmental chemicals on the basis of the genotype data of lymphoblastoid cell lines 143 . Further, a computational framework has been used to predict drug toxicity by integrating information on drug-target expression in tissues, gene network connectivity, chemical structures and toxicity annotations from clinical trials 144 . However, these studies were not explicitly designed for anticancer drugs, which are challenging with regard to toxicity prediction due to their extended cytotoxicity profiles.

Challenges and future perspectives

While many big-data advancements are encouraging and impressive, considerable challenges remain regarding big-data applications in cancer research and the clinic. Omics data often suffer from measurement inconsistencies between cohorts, marked batch effects and dependencies on specific experimental platforms. Such a lack of consistency is a major hurdle towards clinical translation. Consensus on the measurement, alignment and normalization of tumour omics data will be critical for each data type 35 . Besides these technical challenges, structural and societal challenges also exist and may impede the progress of the entire cancer data science field. We discuss these in the following subsections.

Less-than-desirable data availability

A key challenge of cancer data science is the insufficient availability of data and code. A recent study found that machine learning-based studies in the biomedical domain compare poorly with those in other areas regarding public data and source code availability 145 . Sometimes, the clinical information accompanying published cancer genomics data is not provided or complete, even when security and privacy issues are resolved. One possible reason for this bottleneck is related to data release policies and data stewardship costs. Although many journals require the public release of data, such requirements are often met by deposition of data into repositories that require author and institutional approval-of-access requests due to intellectual property and various other considerations. Furthermore, deposited data may be missing critical information, such as missing cell barcodes for single-cell sequencing data or low-resolution images in the case of histopathology data.

In our opinion, the mitigation of these issues will require the enforcement of policies regarding public data availability by funding agencies and additional community efforts to examine the fulfilment of open data access. For example, a funding agency may suspend a project if the community readers report any violations of data release agreements upon publication of articles. The allocation of budgets in grants for patient de-identification upon manuscript submission and financial incentives for checking data through independent data stewardship services upon paper acceptance could markedly help facilitate data and code availability. One notable advance in data availability through industry–academia alliances has come in the form of data-sharing initiatives; specifically, making large repositories of patient tumour sequencing and clinical data available for online queries to researchers in partner institutions 146 . Such initiatives typically involve query-only access (that is, without allowing downloads), but are an encouraging way to expand the collaborative network between academia and industry entities that generate massive amounts of data.

Data-scale gaps

As mentioned earlier, the datasets available for cancer therapeutics are substantially smaller than those available in other fields. One reason for such a gap is that the generation of medical data depends on professionally trained scientists. To close the data-scale gap, more investments will be required to automate the generation of at least some types of annotated medical data and patient omics data. Rare cancers especially suffer from a lack of preclinical models, clinical samples and dedicated funding 147 . Moreover, the usability of biomedical data is typically constrained by the genetic background of the population. For example, the frequency of actionable mutations may differ among East Asian, European and American populations 148 .

A further reason for the data-scale gap is a lack of data generation standards in cancer clinical and biology studies. For example, most clinical trials do not yet collect omics data from patients. With the exponential decrease in sequencing cost, collection of omics data in clinical trials should, in our opinion, be markedly expanded, and possibly be made mandatory as a standard requirement. Further, current data repositories, such as ClinicalTrials.gov and NCBI GEO, do not have common metalanguage standards, whose incorporation would markedly improve the development of algorithms applied to their analysis. Although semi-automated frameworks are becoming available to homogenize metadata 34 , the foundational solution should be establishing common vocabularies and systematic meta-information standards in critical fields.

Data science and AI are transforming our world through applications as diverse as self-driving cars, facial recognition and language translation, and in the medical world, the interpretation of images in radiology and pathology. We already have available tumour data to facilitate biomedical breakthroughs in cancer through cross-modality integration, cross-cohort aggregation and data reuse, and extraordinary advancements are being made in generating and analysing such data. However, the state of big data in the field is complex, and in our view, we should acknowledge that ‘big data’ in cancer are not yet so big. Future investments from the global research community to expand cancer datasets will be critical to allow better computational models to drive basic research, cancer diagnostics and the development of new therapies.

Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 144 , 646–674 (2011).

Article   CAS   PubMed   Google Scholar  

Weinstein, J. N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45 , 1113–110 (2013).

Article   PubMed   PubMed Central   Google Scholar  

Edgar, R., Domrachev, M. & Lash, A. E. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30 , 207–210 (2002).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Deng, J. et al. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conf. Computer Vis. Pattern Recognit. https://doi.org/10.1109/cvprw.2009.5206848 (2009).

Article   Google Scholar  

Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20 , 257–272 (2019).

Ji, A. L. et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell 182 , 1661–1662 (2020).

Deshwar, A. G. et al. PhyloWGS: reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol. 16 , 35 (2015).

Roth, A. et al. PyClone: statistical inference of clonal population structure in cancer. Nat. Methods 11 , 396–398 (2014).

Miller, C. A. et al. SciClone: inferring clonal architecture and tracking the spatial and temporal patterns of tumor evolution. PLoS Comput. Biol. 10 , e1003665 (2014).

Carter, S. L. et al. Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol. 30 , 413–421 (2012).

Minussi, D. C. et al. Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature 592 , 302–308 (2021).

Laks, E. et al. Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing. Cell 179 , 1207–1221.e22 (2019).

Zhao, T. et al. Spatial genomics enables multi-modal study of clonal heterogeneity in tissues. Nature 601 , 85–91 (2022).

Przybyla, L. & Gilbert, L. A. A new era in functional genomics screens. Nat. Rev. Genet. 23 , 89–103 (2022).

Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483 , 603–607 (2012).

Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171 , 1437–1452.e17 (2017).

Shalem, O., Sanjana, N. E. & Zhang, F. High-throughput functional genomics using CRISPR-Cas9. Nat. Rev. Genet. 16 , 299–311 (2015).

Gilbert, L. A. et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159 , 647–661 (2014).

Tsherniak, A. et al. Defining a cancer dependency map. Cell 170 , 564–576.e16 (2017).

Johannessen, C. M. et al. A melanocyte lineage program confers resistance to MAP kinase pathway inhibition. Nature 504 , 138–142 (2013).

Robertson, G. et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods 4 , 651–657 (2007).

Hafner, M. et al. CLIP and complementary methods. Nat. Rev. Methods Prim. 1 , 20 (2021).

Article   CAS   Google Scholar  

Vidal, M., Cusick, M. E. & Barabási, A.-L. Interactome networks and human disease. Cell 144 , 986–998 (2011).

Kempfer, R. & Pombo, A. Methods for mapping 3D chromosome architecture. Nat. Rev. Genet. 21 , 207–226 (2020).

Liu, R. et al. Evaluating eligibility criteria of oncology trials using real-world data and AI. Nature 592 , 629–633 (2021).

van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27 , 775–784 (2021).

Article   PubMed   Google Scholar  

Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L. H. & Hjwl, A. Artificial intelligence in radiology. Nat. Rev. Cancer 18 , 500–510 (2018).

Gillies, R. J., Kinahan, P. E. & Hricak, H. Radiomics: images are more than pictures, they are data. Radiology 278 , 563–577 (2016).

Jiang, P. et al. Signatures of T cell dysfunction and exclusion predict cancer immunotherapy response. Nat. Med. 24 , 1550–1558 (2018). This integrative study of tumour immune evasion across many clinical datasets reveals that SERPINB9 expression consistently correlates with intratumoural T cell dysfunction and resistance to immune checkpoint blockade .

Parkinson, H. et al. ArrayExpress — a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35 , D747–D750 (2007).

Gentles, A. J. et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat. Med. 21 , 938–945 (2015).

Tomlins, S. A. et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310 , 644–648 (2005). This compendium analysis across 132 gene expression datasets representing 10,486 microarray experiments identifies ERG and ETV1 fused with TMPRSS2 as highly expressed genes in six independent prostate cancer cohorts .

Jiang, L. et al. Direct tumor killing and immunotherapy through anti-serpinB9 therapy. Cell 183 , 1219–1233.e18 (2020).

Jiang, P. et al. Systematic investigation of cytokine signaling activity at the tissue and single-cell levels. Nat. Methods 18 , 1181–1191 (2021). This study describes a transcriptomic data atlas collected from cytokine treatments in bulk cell cultures, which enables the inference of signalling activities in bulk and single-cell transcriptomics data to study human inflammatory diseases .

Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11 , 733–739 (2010).

Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8 , 118–127 (2007).

Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36 , 411–420 (2018).

Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177 , 1888–1902.e21 (2019).

Nygaard, V., Rødland, E. A. & Hovig, E. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17 , 29–39 (2016).

Boehm, K. M., Khosravi, P., Vanguri, R., Gao, J. & Shah, S. P. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer 22 , 114–126 (2022).

Huang, C. et al. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma. Cancer Cell 39 , 361–379.e16 (2021).

Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184 , 3573–3587.e29 (2021). This study integrates multiple single-cell data modalities, such as gene expression, cell-surface protein levels and chromatin accessibilities, to increase the accuracy of cell lineage clustering .

Klein, M. I. et al. Identifying modules of cooperating cancer drivers. Mol. Syst. Biol. 17 , e9810 (2021).

Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10 , 1108–1115 (2013).

Reyna, M. A. et al. Pathway and network analysis of more than 2500 whole cancer genomes. Nat. Commun. 11 , 729 (2020).

Zheng, F. et al. Interpretation of cancer mutations using a multiscale map of protein systems. Science 374 , eabf3067 (2021).

Paull, E. O. et al. A modular master regulator landscape controls cancer transcriptional identity. Cell 184 , 334–351 (2021).

Avila Cobos, F., Alquicira-Hernandez, J., Powell, J. E., Mestdagh, P. & De Preter, K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 11 , 5650 (2020).

Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12 , 453–457 (2015).

Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37 , 773–782 (2019).

Wang, K. et al. Deconvolving clinically relevant cellular immune cross-talk from bulk gene expression using CODEFACS and LIRICS stratifies patients with melanoma to anti-PD-1 therapy. Cancer Discov. 12 , 1088–1105 (2022). Together with Newman et al. (2019), this study demonstrates that assembling gene expression profiles of diverse cell types from existing datasets can enable deconvolution of cell fractions and lineage-specific expression in a bulk-tumour expression profile .

Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11 , 740–742 (2014).

Suvà, M. L. & Tirosh, I. Single-cell RNA sequencing in cancer: lessons learned and emerging challenges. Mol. Cell 75 , 7–12 (2019).

Zhang, Y. et al. A T cell resilience model associated with response to immunotherapy in multiple tumor types. Nat. Med. https://doi.org/10.1038/s41591-022-01799-y (2022). This study uses a computational model to repurpose a vast amount of single-cell transcriptomics data and identify biomarkers of tumour-resilient T cells and new therapeutic targets, such as FIBP , to potentiate cellular immunotherapies .

Gopalan, V. et al. A transcriptionally distinct subpopulation of healthy acinar cells exhibit features of pancreatic progenitors and PDAC. Cancer Res. 81 , 3958–3970 (2021).

Newman, A. M. et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nat. Med. 20 , 548–554 (2014).

Heitzer, E., Haque, I. S., Roberts, C. E. S. & Speicher, M. R. Current and future perspectives of liquid biopsies in genomics-driven oncology. Nat. Rev. Genet. 20 , 71–88 (2019).

Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning (Springer, 2001).

Ma, J. et al. Few-shot learning creates predictive models of drug response that translate from high-throughput screens to individual patients. Nat. Cancer 2 , 233–244 (2021).

Raghu, M., Zhang, C., Kleinberg, J. & Bengio, S. Transfusion: understanding transfer learning for medical imaging. Adv. Neural Inf. Process. Syst . 33 , 3347–3357 (2019).

Google Scholar  

Zoph, B. et al. Rethinking pre-training and self-training. Adv. Neural Inf. Process. Syst . 34 , 3833–3845 (2020).

Meier, F. A., Varney, R. C. & Zarbo, R. J. Study of amended reports to evaluate and improve surgical pathology processes. Adv. Anat. Pathol. 18 , 406–413 (2011).

Nakhleh, R. E. Error reduction in surgical pathology. Arch. Pathol. Lab. Med. 130 , 630–632 (2006).

Nakhleh, R. E. et al. Interpretive diagnostic error reduction in surgical pathology and cytology: guideline from the College of American Pathologists Pathology and Laboratory Quality Center and the Association of Directors of Anatomic and Surgical Pathology. Arch. Pathol. Lab. Med. 140 , 29–40 (2016).

Raab, S. S. et al. The ‘Big Dog’ effect: variability assessing the causes of error in diagnoses of patients with lung cancer. J. Clin. Oncol. 24 , 2808–2814 (2006).

Jiang, P., Sellers, W. R. & Liu, X. S. Big data approaches for modeling response and resistance to cancer drugs. Annu. Rev. Biomed. Data Sci. 1 , 1–27 (2018).

van’t Veer, L. J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 , 530–536 (2002).

Sparano, J. A. et al. Adjuvant chemotherapy guided by a 21-gene expression assay in breast cancer. N. Engl. J. Med. 379 , 111–121 (2018).

Kalinsky, K. et al. 21-gene assay to inform chemotherapy benefit in node-positive breast cancer. N. Engl. J. Med. 385 , 2336–2347 (2021).

Cardoso, F. et al. 70-gene signature as an aid to treatment decisions in early-stage breast cancer. N. Engl. J. Med. 375 , 717–729 (2016).

Filipits, M. et al. A new molecular predictor of distant recurrence in ER-positive, HER2-negative breast cancer adds independent information to conventional clinical risk factors. Clin. Cancer Res. 17 , 6012–6020 (2011).

Parker, J. S. et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27 , 1160–1167 (2009).

Early Breast Cancer Trialists’ Collaborative Group (EBCTCG). Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials. Lancet 365 , 1687–1717 (2005).

You, Y. N., Rustin, R. B. & Sullivan, J. D. Onco type DX® colon cancer assay for prediction of recurrence risk in patients with stage II and III colon cancer: a review of the evidence. Surg. Oncol. 24 , 61–66 (2015).

Klein, E. A. et al. A 17-gene assay to predict prostate cancer aggressiveness in the context of Gleason grade heterogeneity, tumor multifocality, and biopsy undersampling. Eur. Urol. 66 , 550–560 (2014).

Kratz, J. R. et al. A practical molecular assay to predict survival in resected non-squamous, non-small-cell lung cancer: development and international validation studies. Lancet 379 , 823–832 (2012).

Beaubier, N. et al. Integrated genomic profiling expands clinical options for patients with cancer. Nat. Biotechnol. 37 , 1351–1360 (2019).

Snyder, A. et al. Genetic basis for clinical response to CTLA-4 blockade in melanoma. N. Engl. J. Med. 371 , 2189–2199 (2014).

Van Allen, E. M. et al. Genomic correlates of response to CTLA-4 blockade in metastatic melanoma. Science 350 , 207–211 (2015).

Rizvi, N. A. et al. Cancer immunology. Mutational landscape determines sensitivity to PD-1 blockade in non-small cell lung cancer. Science 348 , 124–128 (2015).

Zehir, A. et al. Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients. Nat. Med. 23 , 703–713 (2017).

Li, M. Statistical methods for clinical validation of follow-on companion diagnostic devices via an external concordance study. Stat. Biopharm. Res. 8 , 355–363 (2016).

Litchfield, K. et al. Meta-analysis of tumor- and T cell-intrinsic mechanisms of sensitization to checkpoint inhibition. Cell 184 , 596–614.e14 (2021).

Bielski, C. M. et al. Widespread selection for oncogenic mutant allele imbalance in cancer. Cancer Cell 34 , 852–862.e4 (2018).

El Tekle, G. et al. Co-occurrence and mutual exclusivity: what cross-cancer mutation patterns can tell us. Trends Cancer Res. 7 , 823–836 (2021).

Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500 , 415–421 (2013).

Cheng, Y. et al. Targeting epigenetic regulators for cancer therapy: mechanisms and advances in clinical trials. Signal Transduct. Target. Ther. 4 , 62 (2019).

Rodon, J. et al. Genomic and transcriptomic profiling expands precision cancer medicine: the WINTHER trial. Nat. Med. 25 , 751–758 (2019). This study describes the WINTHER trial, which prospectively matched patients with advanced cancer to therapy on the basis of DNA sequencing or RNA expression data from tumour biopsies and concluded that both data types were of value for improving therapy recommendations .

Pleasance, E. et al. Whole genome and transcriptome analysis enhances precision cancer treatment options. Ann. Oncol. https://doi.org/10.1016/j.annonc.2022.05.522 (2022).

Massard, C. et al. High-throughput genomics and clinical outcome in hard-to-treat advanced cancers: results of the MOSCATO 01 trial. Cancer Discov. 7 , 586–595 (2017).

Tuxen, I. V. et al. Copenhagen Prospective Personalized Oncology (CoPPO) — clinical utility of using molecular profiling to select patients to phase I trials. Clin. Cancer Res. 25 , 1239–1247 (2019).

Horak, P. et al. Comprehensive genomic and transcriptomic analysis for guiding therapeutic decisions in patients with rare cancers. Cancer Discov. 11 , 2780–2795 (2021).

Von Hoff, D. D. et al. Pilot study using molecular profiling of patients’ tumors to find potential targets and select treatments for their refractory cancers. J. Clin. Oncol. 28 , 4877–4883 (2010).

Kato, S. et al. Real-world data from a molecular tumor board demonstrates improved outcomes with a precision N-of-one strategy. Nat. Commun. 11 , 4965 (2020).

Hoefflin, R. et al. Personalized clinical decision making through implementation of a molecular tumor board: a German single-center experience. JCO Precis. Oncol . 1–16 https://doi.org/10.1200/po.18.00105 (2018).

Irmisch, A. et al. The Tumor Profiler Study: integrated, multi-omic, functional tumor profiling for clinical decision support. Cancer Cell 39 , 288–293 (2021).

Cohen, Y. C. et al. Identification of resistance pathways and therapeutic targets in relapsed multiple myeloma patients through single-cell sequencing. Nat. Med. 27 , 491–503 (2021).

Lee, J. S. et al. Synthetic lethality-mediated precision oncology via the tumor transcriptome. Cell 184 , 2487–2502.e13 (2021). This study demonstrates that integrating information regarding synthetic lethal interactions with tumour transcriptomics profiles can accurately score drug-target importance and predict clinical outcomes for a broad category of anticancer treatments .

Zhang, B. et al. The tumor therapy landscape of synthetic lethality. Nat. Commun. 12 , 1275 (2021).

Pathria, G. et al. Translational reprogramming marks adaptation to asparagine restriction in cancer. Nat. Cell Biol. 21 , 1590–1603 (2019).

Feng, X. et al. A platform of synthetic lethal gene interaction networks reveals that the GNAQ uveal melanoma oncogene controls the Hippo pathway through FAK. Cancer Cell 35 , (2019).

Lee, J. S. et al. Harnessing synthetic lethality to predict the response to cancer treatment. Nat. Commun. 9 , 2546 (2018).

Cheng, K., Nair, N. U., Lee, J. S. & Ruppin, E. Synthetic lethality across normal tissues is strongly associated with cancer risk, onset, and tumor suppressor specificity. Sci. Adv. 7 , eabc2100 (2021).

Sahu, A. D. et al. Genome-wide prediction of synthetic rescue mediators of resistance to targeted and immunotherapy. Mol. Syst. Biol. 15 , e8323 (2019).

Elemento, O., Leslie, C., Lundin, J. & Tourassi, G. Artificial intelligence in cancer research, diagnosis and therapy. Nat. Rev. Cancer 21 , 747–752 (2021).

Raciti, P. et al. Novel artificial intelligence system increases the detection of prostate cancer in whole slide images of core needle biopsies. Mod. Pathol. 33 , 2058–2066 (2020).

Office of the Commissioner. FDA authorizes software that can help identify prostate cancer. https://www.fda.gov/news-events/press-announcements/fda-authorizes-software-can-help-identify-prostate-cancer (2021).

Campanella, G. et al. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat. Med. 25 , 1301–1309 (2019).

Litjens, G. et al. 1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. GigaScience 7 , giy065 (2018).

Article   PubMed Central   Google Scholar  

Ehteshami Bejnordi, B. et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318 , 2199–2210 (2017).

Wulczyn, E. et al. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE 15 , e0233678 (2020).

Coudray, N. et al. Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nat. Med. 24 , 1559–1567 (2018).

Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25 , 1054–1056 (2019).

Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25 , 954–961 (2019).

Hosny, A. et al. Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med. 15 , e1002711 (2018).

Zviran, A. et al. Genome-wide cell-free DNA mutational integration enables ultra-sensitive cancer monitoring. Nat. Med. 26 , 1114–1124 (2020).

Mathios, D. et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat. Commun. 12 , 5060 (2021).

Beshnova, D. et al. De novo prediction of cancer-associated T cell receptors for noninvasive cancer detection. Sci. Transl. Med. 12 , eaaz3738 (2020).

Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 18 , 24 (2018).

Ching, T., Zhu, X. & Garmire, L. X. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput. Biol. 14 , e1006076 (2018).

Kann, B. H., Hosny, A. & Hjwl, A. Artificial intelligence for clinical oncology. Cancer Cell 39 , 916–927 (2021).

Kadir, T. & Brady, M. Saliency, scale and image description. Int. J. Comput. Vis. 45 , 83–105 (2001).

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. & Torralba, A. Learning deep features for discriminative localization. 2016 IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) https://doi.org/10.1109/cvpr.2016.319 https://www.computer.org/csdl/proceedings/cvpr/2016/12OmNqH9hnp (2016).

Wulczyn, E. et al. Interpretable survival prediction for colorectal cancer using deep learning. NPJ Digit. Med. 4 , 71 (2020). This study clusters similar image patches related to colorectal cancer survival prediction to reveal that high-risk survival predictions are associated with a tumour–adipose feature, characterized by poorly differentiated tumour cells adjacent to adipose tissue .

Buel, G. R. & Walters, K. J. Can AlphaFold2 predict the impact of missense mutations on structure? Nat. Struct. Mol. Biol. 29 , 1–2 (2022).

US Food and Drug Administration. Evaluation of automatic class III designation for Paige Prostate. https://www.accessdata.fda.gov/cdrh_docs/reviews/DEN200080.pdf (2021).

Calcoen, D., Elias, L. & Yu, X. What does it take to produce a breakthrough drug? Nat. Rev. Drug Discov. 14 , 161–162 (2015).

Jayatunga, M. K. P., Xie, W., Ruder, L., Schulze, U. & Meier, C. AI in small-molecule drug discovery: a coming wave? Nat. Rev. Drug Discov. 21 , 175–176 (2022).

Pushpakom, S. et al. Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discov. 18 , 41–58 (2019).

Jahchan, N. S. et al. A drug repositioning approach identifies tricyclic antidepressants as inhibitors of small cell lung cancer and other neuroendocrine tumors. Cancer Discov. 3 , 1364–1377 (2013).

Kuenzi, B. M. et al. Predicting drug response and synergy using a deep learning model of human cancer cells. Cancer Cell 38 , 672–684.e6 (2020).

Ling, A. & Huang, R. S. Computationally predicting clinical drug combination efficacy with cancer cell line screens and independent drug action. Nat. Commun. 11 , 5848 (2020).

Aissa, A. F. et al. Single-cell transcriptional changes associated with drug tolerance and response to combination therapies in cancer. Nat. Commun. 12 , 1628 (2021).

Menden, M. P. et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat. Commun. 10 , 2674 (2019).

Carvalho, D. M. et al. Repurposing vandetanib plus everolimus for the treatment of ACVR1-mutant diffuse intrinsic pontine glioma. Cancer Discov. https://doi.org/10.1158/2159-8290.CD-20-1201 (2021).

Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37 , 1038–1040 (2019). This study describes a deep generative AI model, which enabled the design of new inhibitors of the receptor tyrosine kinase DDR1 by modelling molecule structures from a compound library, existing DDR1 inhibitors, non-kinase inhibitors and patented drugs .

Ruthotto, L. & Haber, E. An introduction to deep generative modeling. GAMM-Mitteilungen 44 , e202100008 (2021).

Wallach, I., Dzamba, M. & Heifets, A. AtomNet: a deep convolutional neural network for bioactivity prediction in structure-based drug discovery. Preprint at https://arxiv.org/abs/1510.02855 (2015).

Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structure–activity relationships. J. Chem. Inf. Model. 55 , 263–274 (2015).

Dagogo-Jack, I. & Shaw, A. T. Tumour heterogeneity and resistance to cancer therapies. Nat. Rev. Clin. Oncol. 15 , 81–94 (2018).

Bansal, M. et al. A community computational challenge to predict the activity of pairs of compounds. Nat. Biotechnol. 32 , 1213–1222 (2014).

Ahmadi, S. et al. The landscape of receptor-mediated precision cancer combination therapy via a single-cell perspective. Nat. Commun. 13 , 1613 (2022).

Eduati, F. et al. Prediction of human population responses to toxic compounds by a collaborative competition. Nat. Biotechnol. 33 , 933–940 (2015).

Gayvert, K. M., Madhukar, N. S. & Elemento, O. A data-driven approach to predicting successes and failures of clinical trials. Cell Chem. Biol. 23 , 1294–1301 (2016).

McDermott, M. B. A. et al. Reproducibility in machine learning for health research: still a ways to go. Sci. Transl. Med. 13 , eabb1655 (2021).

AP News. Caris Precision Oncology Alliance partners with the National Cancer Institute, part of the National Institutes of Health, to expand collaborative clinical research efforts. Associated Press https://apnews.com/press-release/pr-newswire/technology-science-business-health-cancer-221e9238956a7a4835be75cb65832573 (2021).

Alvi, M. A., Wilson, R. H. & Salto-Tellez, M. Rare cancers: the greatest inequality in cancer research and oncology treatment. Br. J. Cancer 117 , 1255–1257 (2017).

Park, K. H. et al. Genomic landscape and clinical utility in Korean advanced pan-cancer patients from prospective clinical sequencing: K-MASTER program. Cancer Discov. 12 , 938–948 (2022).

Bailey, M. H. et al. Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples. Nat. Commun. 11 , 4748 (2020).

Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6 , 271–281.e7 (2018).

Zare, F., Dow, M., Monteleone, N., Hosny, A. & Nabavi, S. An evaluation of copy number variation detection tools for cancer using whole exome sequencing data. BMC Bioinforma. 18 , 286 (2017).

Pan-cancer analysis of whole genomes. Nature 578 , 82–93 (2020).

Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current state of the science. Nat. Rev. Genet. 17 , 175–188 (2016).

Corces, M. R. et al. The chromatin accessibility landscape of primary human cancers. Science 362 , eaav1898 (2018).

Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523 , 486–490 (2015).

Furey, T. S. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat. Rev. Genet. 13 , 840–852 (2012).

Rotem, A. et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nat. Biotechnol. 33 , 1165–1172 (2015).

Papanicolau-Sengos, A. & Aldape, K. DNA methylation profiling: an emerging paradigm for cancer diagnosis. Annu. Rev. Pathol. 17 , 295–321 (2022).

Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11 , 817–820 (2014).

Cieślik, M. & Chinnaiyan, A. M. Cancer transcriptome profiling at the juncture of clinical translation. Nat. Rev. Genet. 19 , 93–109 (2018).

Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161 , 1202–1214 (2015).

Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8 , 14049 (2017).

Ramsköld, D. et al. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat. Biotechnol. 30 , 777–782 (2012).

Gierahn, T. M. et al. Seq-Well: portable, low-cost RNA sequencing of single cells at high throughput. Nat. Methods 14 , 395–398 (2017).

Rao, A., Barkley, D., França, G. S. & Yanai, I. Exploring tissue architecture using spatial transcriptomics. Nature 596 , 211–220 (2021).

Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353 , 78–82 (2016).

Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363 , 1463–1467 (2019).

Lee, J. H. et al. Fluorescent in situ sequencing (FISSEQ) of RNA for gene expression profiling in intact cells and tissues. Nat. Protoc. 10 , 442–458 (2015).

Ellis, M. J. et al. Connecting genomic alterations to cancer biology with proteomics: the NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discov. 3 , 1108–1112 (2013).

Li, J. et al. TCPA: a resource for cancer functional proteomics data. Nat. Methods 10 , 1046–1047 (2013).

Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14 , 865–868 (2017).

Bendall, S. C. et al. Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332 , 687–696 (2011).

Jackson, H. W. et al. The single-cell pathology landscape of breast cancer. Nature 578 , 615–620 (2020).

Keren, L. et al. A structured tumor-immune microenvironment in triple negative breast cancer revealed by multiplexed ion beam imaging. Cell 174 , 1373–1387.e19 (2018).

Schürch, C. M. et al. Coordinated cellular neighborhoods orchestrate antitumoral immunity at the colorectal cancer invasive front. Cell 183 , 838 (2020).

Beckonert, O. et al. Metabolic profiling, metabolomic and metabonomic procedures for NMR spectroscopy of urine, plasma, serum and tissue extracts. Nat. Protoc. 2 , 2692–2703 (2007).

Jang, C., Chen, L. & Rabinowitz, J. D. Metabolomics and isotope tracing. Cell 173 , 822–837 (2018).

Ghandi, M. et al. Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature 569 , 503–508 (2019).

Uhlén, M. et al. Tissue-based map of the human proteome. Science 347 , 1260419 (2015).

Fedorov, A. et al. NCI Imaging Data Commons. Cancer Res 81 , 4188–4193 (2021).

Cerami, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2 , 401–404 (2012).

Goldman, M. J. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol. 38 , 675–678 (2020).

Jiang, P., Freedman, M. L., Liu, J. S. & Liu, X. S. Inference of transcriptional regulation in cancers. Proc. Natl Acad. Sci. USA 112 , 7731–7736 (2015).

Sun, D. et al. TISCH: a comprehensive web resource enabling interactive single-cell transcriptome visualization of tumor microenvironment. Nucleic Acids Res. 49 , D1420–D1430 (2021).

Kristiansen, G. Markers of clinical utility in the differential diagnosis and prognosis of prostate cancer. Mod. Pathol. 31 , S143–S155 (2018).

Download references

Acknowledgements

The authors are supported by the intramural research budget of the US National Cancer Institute.

Author information

Authors and affiliations.

Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA

Peng Jiang, Sanju Sinha, Sridhar Hannenhalli, Cenk Sahinalp & Eytan Ruppin

Laboratory of Pathology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA

Kenneth Aldape

You can also search for this author in PubMed   Google Scholar

Contributions

P.J. and E.R. designed the scope and structure of the Review, assembled write-up components and finalized the manuscript. C.S. wrote the text on tumour evolution and heterogeneity. S.H. wrote the text on transcriptional dysregulation. P.J. wrote the sections related to spatial genomics and artificial intelligence. P.J., E.R. and K.A. wrote the section on cancer diagnosis and treatment decisions. S.S. and P.J. prepared Tables 1 – 4 .

Corresponding authors

Correspondence to Peng Jiang or Eytan Ruppin .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Reviews Cancer thanks Itai Yanai, Anjali Rao and the other, anonymous, reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Array Express: https://www.ebi.ac.uk/arrayexpress/

CAMELYON: https://camelyon17.grand-challenge.org/

cBioportal: https://www.cbioportal.org/

CCLE: https://depmap.org/portal/ccle/

CPTAC: https://proteomics.cancer.gov/data-portal

CytoSig: https://cytosig.ccr.cancer.gov/

DepMap: https://depmap.org/portal

DNA sequencing costs: https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data

DrugCombDB: http://drugcombdb.denglab.org/

FDC: https://curate.ccr.cancer.gov/

GDC: https://gdc.cancer.gov/

GENIE: https://www.aacr.org/professionals/research/aacr-project-genie

GEO: https://www.ncbi.nlm.nih.gov/geo

Human Protein Atlas: https://www.proteinatlas.org/humanproteome/pathology

ICGC: https://dcc.icgc.org/

IDC: https://datacommons.cancer.gov/repository/imaging-data-commons

LINCS: https://clue.io/

PCAWG: https://dcc.icgc.org/pcawg

PRECOG: https://precog.stanford.edu/

RABIT: http://rabit.dfci.harvard.edu/

TARGET: https://ocg.cancer.gov/programs/target/data-matrix

TCIA: https://www.cancerimagingarchive.net/

TCGA: https://gdc.cancer.gov/

TIDE: http://tide.dfci.harvard.edu/

TISCH: http://tisch.comp-genomics.org/

Tres: https://resilience.ccr.cancer.gov/

UCSC Xena: https://xena.ucsc.edu/

A machine learning method that classifies new data using only a few training samples by transferring knowledge from large, related datasets.

A map of important image locations that support machine learning outputs.

A coarse-resolution map of important image regions for predicting a specific class using activations and gradients in the final convolutional layer.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Jiang, P., Sinha, S., Aldape, K. et al. Big data in basic and translational cancer research. Nat Rev Cancer 22 , 625–639 (2022). https://doi.org/10.1038/s41568-022-00502-0

Download citation

Accepted : 26 July 2022

Published : 05 September 2022

Issue Date : November 2022

DOI : https://doi.org/10.1038/s41568-022-00502-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Integration of pan-omics technologies and three-dimensional in vitro tumor models: an approach toward drug discovery and precision medicine.

  • Pallavi Kulkarni
  • Mahadev Rao

Molecular Cancer (2024)

DNMT3L inhibits hepatocellular carcinoma progression through DNA methylation of CDO1: insights from big data to basic research

  • Xiaokai Yan

Journal of Translational Medicine (2024)

TIMM17A overexpression in lung adenocarcinoma and its association with prognosis

Scientific Reports (2024)

Roadmap for a European cancer data management and precision medicine infrastructure

  • Macha Nikolski
  • Eivind Hovig
  • Gary Saunders

Nature Cancer (2024)

Multi-omics pan-cancer analyses identify MCM4 as a promising prognostic and diagnostic biomarker

Quick links.

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

big data in bioinformatics research

Bionl.ai

Bioinformatics 101: How Big Data is Solving Big Biological Questions

Unlock the mysteries of bioinformatics and discover how it's revolutionizing science and healthcare. Learn how Bionl.ai is making this intricate field accessible to all, from seasoned researchers to aspiring scientists.

What Exactly is Bioinformatics?

Bioinformatics is a multidisciplinary field that unifies biology, computer science, and mathematics to decode the complex language of biological data. Imagine being a detective with an arsenal of advanced computational tools, delving into mysteries that range from genetic codes to evolutionary patterns. This is not merely a subject for academic discussion; it's a practical field with real-world applications. Platforms like Bionl.ai are pioneering ways to make bioinformatics approachable, even for those without a background in programming or data science.

Why Should This Matter to You?

You might be asking, "Why is bioinformatics relevant to me?" The answer lies in the profound impact this field has on our daily lives. Whether it's the development of personalized medicine, the identification of genetic predispositions to certain diseases, or even contributions to sustainable agriculture, bioinformatics is everywhere. Companies like Bionl are at the forefront of democratizing this science, empowering healthcare professionals and scientists to conduct advanced research without the need for specialized programming skills.

The Biology-Data Science Intersection

How big data comes into play.

The term "Big Data" often evokes images of vast server farms and complex algorithms. In bioinformatics, Big Data refers to the enormous volumes of biological data that scientists must sift through to find actionable insights. Platforms like Bionl are instrumental in this process, offering intuitive, no-code solutions that transform what could be a daunting task into a manageable, even straightforward, endeavor.

Traditional Biology vs. Computational Biology

The landscape of biological research has undergone a seismic shift thanks to computational biology. In the traditional paradigm, research was laborious, time-consuming, and often limited in scope. Computational biology, facilitated by bioinformatics, leverages computational power to conduct analyses on a scale and at speeds previously unimaginable. Bionl epitomizes this transition by offering a single workspace where scientists can perform a wide range of functionalities without the constraints of traditional methods.

Crucial Applications of Bioinformatics

Genomic sequencing.

The mapping of the human genome was a watershed moment in science, but it was also a project that required over a decade and billions of dollars to complete. Fast forward to today, and bioinformatics has made genomic sequencing faster, cheaper, and more accessible than ever. Bionl significantly contributes to this accessibility by allowing even non-specialists to conduct complex genomic analyses through natural language prompts.

The Revolution in Drug Discovery

The process of discovering new drugs is incredibly complex and expensive. Bioinformatics has been a game-changer, enabling in-silico (computer-based) experiments that can rapidly identify promising drug candidates. Bionl takes this a step further by providing a platform where healthcare professionals can conduct intricate drug discovery research without having to write a single line of code.

Bioinformatics, Drug Discovery

Evolutionary Biology and Conservation

The study of evolution and biodiversity has been greatly enhanced by bioinformatics, allowing for more nuanced understanding and better conservation efforts. Bionl aligns with these scientific goals by providing an intuitive platform that can handle complex evolutionary biology analyses, thereby contributing to both academic research and practical conservation efforts.

The Bioinformatics Workflow

Collecting the data.

The first stage in any bioinformatics project is the gathering of biological data, which can range from genomic sequences to protein structures. Platforms like Bionl make this step easier by offering tools that simplify the collection and organization of large data sets, even for those who may not have a background in data science.

Crunching the Numbers

Once the data is collected, the computational heavy lifting begins. This involves the use of specialized algorithms to analyze and interpret the data. Bionl excels in this area by offering a user-friendly interface that employs natural language processing to translate user queries into actionable analyses.

Interpreting the Data

The final step involves making sense of the results. The raw data and the computational analyses must be translated into meaningful conclusions that can drive further research or be applied in practical applications. With Bionl , this interpretation is made simpler through the use of intuitive prompts, which guide the user in understanding the outcomes of their analyses.

Challenges and the Road Ahead

Ethical conundrums.

Bioinformatics, despite its many advantages, raises ethical questions concerning data privacy and the potential misuse of genetic information. Bionl is cognizant of these challenges and is committed to fostering ethical practices in bioinformatics research.

Technological Hurdles

As biological data continues to grow exponentially, so do the challenges associated with storing and interpreting this information. Bionl is addressing these challenges head-on by offering scalable solutions that adapt to the ever-increasing demands of bioinformatics research.

Bioinformatics is a rapidly evolving field that holds the promise of revolutionizing our understanding of biology, medicine, and even the very fabric of life itself. Platforms like Bionl.ai are pivotal in this revolution, offering innovative platforms that make bioinformatics accessible to a wider audience. Their mission of empowering scientists and healthcare professionals through natural language prompts is not just an incremental improvement; it's a paradigm shift that holds the promise of accelerating research and discovery in unprecedented ways.

big data in bioinformatics research

Bioinformatics Role in Single-Cell Research

Transcriptomics: on genes and diseases with bionl.ai, a new bioinformatics playground: the y chromosome sequence.

Go to Charlotte.edu

Prospective Students

  • About UNC Charlotte
  • Campus Life
  • Graduate Admissions

Faculty and Staff

  • Human Resources
  • Auxiliary Services
  • Inside UNC Charlotte
  • Academic Affairs

Current Students

  • Financial Aid
  • Student Health

Alumni and Friends

  • Alumni Association
  • Advancement
  • Make a Gift

Big Data in Bioinformatics

Biology is becoming increasingly data-intensive as high-throughput genomic assays become more accessible to greater numbers of biologists. Working with large-scale data sets requires user-friendly yet powerful software tools that stimulate user’s intuition, reveal outliers, detect deeper structures embedded in the data, and trigger insights and ideas for new experiments. We are interested in developing tools and techniques that help biologists manage, integrate, visualize, analyze, and interpret large-scale data sets from genomics.

Tools developed by our Faculty:

– Alternative splicing database Bioviz – Integrated Genome Browser, integrative visual analytics for next-generation genomics CressExpress – CressExpress, gene networks analysis in Arabidopsis GleClubs – Global Ensemble Clusters of Binding Sites, algorithm for predicting cis-regulatory binding sites in prokaryotes PlantGDB – Resource for comparative plant genomics Supramap – Web application for integrating genetic, evolutionary, geospatial, and temporal data

Amino Acid

Faculty in this Research Area:

  • Dr. Daniel Janies , Carol Grotnes Belk Distinguished Professor
  • Dr. Ann Loraine , Associate Professor
  • Dr. Mindy Shi , Assistant Professor
  • Dr. Weijun Luo , Associate Professor
  • Dr. ZhengChang Su , Assistant Professor

Big Data in Bioinformatics and Computational Biology: Basic Insights

Affiliation.

  • 1 University Institute of Biotechnology, Chandigarh University, Mohali, Punjab, India.
  • PMID: 37803117
  • DOI: 10.1007/978-1-0716-3461-5_9

The human genome was first sequenced in 1994. It took 10 years of cooperation between numerous international research organizations to reveal a preliminary human DNA sequence. Genomics labs can now sequence an entire genome in only a few days. Here, we talk about how the advent of high-performance sequencing platforms has paved the way for Big Data in biology and contributed to the development of modern bioinformatics, which in turn has helped to expand the scope of biology and allied sciences. New technologies and methodologies for the storage, management, analysis, and visualization of big data have been shown to be necessary. Not only does modern bioinformatics have to deal with the challenge of processing massive amounts of heterogeneous data, but it also has to deal with different ways of interpreting and presenting those results, as well as the use of different software programs and file formats. Solutions to these problems are tried to present in this chapter. In order to store massive amounts of data and provide a reasonable period for completing search queries, new database management systems other than relational ones will be necessary. Emerging advance programing approaches, such as machine learning, Hadoop, and MapReduce, aim to provide the capacity to easily construct one's own scripts for data processing and address the issue of the diversity of genomic and proteomic data formats in bioinformatics.

Keywords: Big data; Bioinformatics; Genome sequencing; Hadoop; Mapreduce; NGS.

© 2024. The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature.

  • Computational Biology / methods
  • Genomics / methods
  • Proteomics*

Book cover

Advances in Bioinformatics pp 271–277 Cite as

Role of Bioinformatics in Data Mining and Big Data Analysis

  • Santosh Kumar Mishra 3 ,
  • Avinash Singh 4 ,
  • Krishna Bihari Dubey 5 ,
  • Prabir Kumar Paul 6 &
  • Vijai Singh 7  
  • First Online: 06 February 2024

303 Accesses

In the past few decades, tremendous growth has been reported in the biological data due to development in the area of genomics, proteomics, microarray as well as biomedical imaging. These biological data are rapidly increasing but due to the availability of limited tools and techniques, the scientific community is able to generate relevant information from this data to a very limited extent. Due to advancements in the area of information technology, data mining and big data analysis tools are being used for the generation of significant results from biological databases to enrich the bioinformatics knowledge for storing, analyzing, and utilizing these data. With the help of data mining techniques and models, this has been possible to identify novel patterns from large-scale biological data and shifted the focus of the research community towards data-dependent discovery. In this chapter, we tried to give a brief insight into different processes of data exploration of biological data for establishing a bridge between data mining techniques and bioinformatics.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Branco I, Choupina A (2021) Bioinformatics: new tools and applications in life science and personalized medicine. Appl Microbiol Biotechnol 105:937–951

Article   CAS   PubMed   Google Scholar  

Campbell AJ, Cook JA, Adey G, Cuthbertson BH (2008) Predicting death and readmission after intensive care discharge. Br J Anaesth 100(5):656–662

Gupta MK, Chandra P (2020) A comprehensive survey of data mining. Int J Inf Technol 12(4):1243–1257

Google Scholar  

Hashemi A, Vikalo H (2018) Evolutionary self-expressive models for subspace clustering. IEEE Journal of Selected Topics in Signal Processing 12(6):1534–1546

Article   Google Scholar  

Herland M, Khoshgoftaar TM, Wald R (2014) A review of data mining using big data in health informatics. J Big Data 1(1):1–35

Lan K, Wang DT, Fong S, Liu LS, Wong KK, Dey N (2018) A survey of data mining and deep learning in bioinformatics. J Med Syst 42:1–20

Li XL, Ng SK (eds) (2009) Biological data mining in protein interaction networks. Igi Global

Mahmud M, Kaiser MS, McGinnity TM, Hussain A (2021) Deep learning in mining biological data. Cogn Comput 13:1–33

O’Connor LM, O’Connor BA, Lim SB, Zeng J, Lo CH (2023) Integrative multi-omics and systems bioinformatics in translational neuroscience: a data mining perspective. J Pharm Anal 13:836

Article   PubMed   PubMed Central   Google Scholar  

Varshney S, Bharti M, Sundram S, Malviya R, Fuloria NK (2022) The role of bioinformatics tools and technologies in clinical trials. In: Bioinformatics tools and big data analytics for patient care. Chapman and Hall/CRC, pp 1–16

Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L (2020) Review on the application of machine learning algorithms in the sequence data mining of DNA. Front Bioeng Biotechnol 8:1032

Download references

Author information

Authors and affiliations.

Department of Life Sciences, Sharda School of Basic Sciences and Research, Sharda University, Greater Noida, Uttar Pradesh, India

Santosh Kumar Mishra

Department of Biotechnology, Meerut Institute of Engineering & Technology, Meerut, Uttar Pradesh, India

Avinash Singh

Department of Computer Science and Engineering, ABES Institute of Technology, Ghaziabad, Uttar Pradesh, India

Krishna Bihari Dubey

Department of Biotechnology, Manav Rachna International Institute of Research and Studies, Faridabad, Haryana, India

Prabir Kumar Paul

Department of Biosciences, School of Science, Indrashil University, Mehsana, Gujarat, India

Vijai Singh

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Department of Biosciences, Indrashil University, Mehsana, Gujarat, India

Biotechnology, Rama University, Kanpur, Uttar Pradesh, India

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter.

Mishra, S.K., Singh, A., Dubey, K.B., Paul, P.K., Singh, V. (2024). Role of Bioinformatics in Data Mining and Big Data Analysis. In: Singh, V., Kumar, A. (eds) Advances in Bioinformatics. Springer, Singapore. https://doi.org/10.1007/978-981-99-8401-5_14

Download citation

DOI : https://doi.org/10.1007/978-981-99-8401-5_14

Published : 06 February 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-8400-8

Online ISBN : 978-981-99-8401-5

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Undergraduate Research & Fellowships logo

Can Big Data Have a Role in Treating Dementia? That’s What This Northeastern Student Is Hoping to Help Solve

Dementia is a devastating condition that impacts  more than 55 million people  globally, according to the World Health Organization. In the United States alone, it’s estimated that one in nine people over the age of 65 has Alzheimer’s.

Conditions like Alzheimer’s and other diseases that cause cognitive impairments can be difficult to treat. Early symptoms are often subtle and may go undetected by medical professionals. And as these diseases progress, they become even harder and harder to manage. Individuals begin to lose the ability to speak, think, and move before eventually succumbing to the disease.

Ethan Wong, a fourth-year student at Northeastern University, will use the power of “big data” biology and neuroscience to help develop better early intervention models for those suffering with cognitive impairments when he starts his studies at Churchill College at Cambridge University this fall.

Wong is  one of just 16 individuals  around the globe this year to be honored with Cambridge University’s Churchill Scholarship. The illustrious honor was created by Churchill College at the request of Sir Winston Churchill when the college was founded in 1960, according to its website.

This isn’t Wong’s first award. He was  also a 2023 recipient  of the Barry Goldwater Scholarship, which recognizes students pursuing research in math, natural science and engineering.

“If I can do research that gives people one or two extra years to be a father, a mother or a grandparent, I think that’s super worth fighting for,” he says.

Wong, who is set to graduate in May with a major in biology and a minor in data science, has spent his college career at Northeastern University doing research in the neuroscience domain.

He started at Northeastern in 2020 and quickly began doing research at the university’s Laboratory for Movement Neurosciences, learning under professors Gene Tunik and Matthew Yarossi.

Wong focused his studies on the Trail Making Test, a test clinicians use to test cognitive functions in patients.

“It’s a connect-the-dots test,” Wong says. “But the unique thing about connecting the dots, is that it is both a cognitive task and a motor task. You have to not only see what the next number is and remember what it is, but you also have to move your hand.”

One of Wong’s projects involved developing a variation of the TMT that involves physical objects.

“We actually set up two shelves with cans on them, and they were labeled one through 10,” he said. “We also put grocery items on the shelf and the task was for people to take the items off the shelf as quickly as possible in the correct order.”

The project  earned him a PEAK award  from Northeastern University in 2021.

Wong has also completed a co-op at Beth Israel Deaconess Medical Center, working as a patient care tech.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Comput Struct Biotechnol J
  • PMC10582761

Metadata integrity in bioinformatics: Bridging the gap between data and knowledge

Aylin caliskan.

a Department of Bioinformatics, Biocenter, University of Würzburg, 97074 Würzburg, Germany

Seema Dangwal

b Stanford Cardiovascular Institute, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305-5101, United States

Thomas Dandekar

Associated data.

All data of this manuscript are contained in the paper and its supplementary material . Moreover, the case study example used for illustration has already been deposited in the Bio archives, BIORXIV-2021–473021v1-Dandekar.pdf. The discussed RNA-Sequencing data was made publicly available by the respective research groups and can be accessed via the GEO database ( {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 (online available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 ) for the data published by Blanco-Melo et al. (2020) 50 and {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 (online available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 ) for the data published by Han et al. (2021) [51] ).

In the fast-evolving landscape of biomedical research, the emergence of big data has presented researchers with extraordinary opportunities to explore biological complexities. In biomedical research, big data imply also a big responsibility. This is not only due to genomics data being sensitive information but also due to genomics data being shared and re-analysed among the scientific community. This saves valuable resources and can even help to find new insights in silico . To fully use these opportunities, detailed and correct metadata are imperative. This includes not only the availability of metadata but also their correctness. Metadata integrity serves as a fundamental determinant of research credibility, supporting the reliability and reproducibility of data-driven findings. Ensuring metadata availability, curation, and accuracy are therefore essential for bioinformatic research. Not only must metadata be readily available, but they must also be meticulously curated and ideally error-free. Motivated by an accidental discovery of a critical metadata error in patient data published in two high-impact journals, we aim to raise awareness for the need of correct, complete, and curated metadata. We describe how the metadata error was found, addressed, and present examples for metadata-related challenges in omics research, along with supporting measures, including tools for checking metadata and software to facilitate various steps from data analysis to published research.

Graphical Abstract

ga1

  • • Data awareness and data integrity underpins the trustworthiness of results and subsequent further analysis.
  • • Big data and bioinformatics enable efficient resource use by repurposing publicly available RNA-Sequencing data.
  • • Manual checks of data quality and integrity are insufficient due to the overwhelming volume and rapidly growing data.
  • • Automation and artificial intelligence provide cost-effective and efficient solutions for data integrity and quality checks.
  • • FAIR data management, various software solutions and analysis tools assist metadata maintenance.

1. The effect of reusing data on science

Reusing data can help and accelerate science progress but depends on sound data, accessible raw data and correct metadata. Of course, science does not solely progress due to reusing data, since there are many novel discoveries, findings and insights being made every year, and the technological progress is opening new opportunities. The technological progress and the resulting data, which is rapidly accumulating and getting publicly available for other researchers, can contribute to the progress of science and maybe even accelerate the scientific advancement as researchers have the opportunity to reuse existing data for preliminary analyses, which saves valuable time. The recent COVID-19 pandemic has demonstrated the benefits of next-generation sequencing methods [1] , working together [2] and sharing data [3] . That researchers decided to make the initial genome sequence of SARS-CoV-2 accessible for others by uploading it to an open-access database as early as January 2020 was a data-sharing precedent that significantly contributed to subsequent research [3] .

However, global pandemics is not the only area of research profiting from publicly available research data. In omics-research, scientists can save time and resources and even cross-check or validate their results by reusing and reanalysing available data. This can speed up research, since it can replace a part of the laboratory research. For instance, analysing existing data can lead to a new research idea, which is already backed by the reanalysis results and might therefore be more likely to prove successful in subsequent laboratory analyses. In 2012, Kodama et al. demonstrated the advantages of this approach by first performing expression-based genome-wide association studies (eGWAS) on already existing microarray data to find a suitable target for treating diabetes: CD44 [4] . In their subsequent laboratory research, Kodama et al. (2012) proved the importance of the immune-cell receptor CD44 in the pathogenesis of diabetes in both mice and humans [4] . This is also indicated by the reuse of publicly available GEO datasets such as the “ Predicting age from the transcriptome of human dermal fibroblasts ” dataset ( {"type":"entrez-geo","attrs":{"text":"GSE113957","term_id":"113957"}} GSE113957 ) by Fleischer et al. (2018) [5] . The dataset contains RNA-Sequencing data of human fibroblasts donated by 113 “apparently healthy” individuals of all ages between one and 96 years and ten progeria patients. It was published in November of 2018 and reused and cited in 13 PubMed-indexed studies by May 2022 [6] . About a year later (March 2023), the number of studies listed on PubMed that reported the use of the {"type":"entrez-geo","attrs":{"text":"GSE113957","term_id":"113957"}} GSE113957 dataset had increased by six. Thus, the dataset has been reused almost 20 times in less than five years, and that is only considering studies listed in PubMed. The number of studies reusing or repurposing the dataset might be significantly larger since it might have been reused in other studies that are either not indexed in PubMed or not published yet.

This demonstrates the growing importance of reliable publicly available datasets in databases such as the Gene Expression Omnibus (GEO) [7] , [8] , which was brought to the forefront of attention a decade ago [9] . Hence, today, in the time of available and re-analysable data, data integrity has become ever more critical, not only for the original researchers’ own analyses and results but also for every subsequent analysis that might be performed using the data. An essential aspect of this is the metadata. This additional information can increase the value of the data by adding further details. However, incorrect metadata might have the opposite effect as wrong information in the metadata can lead to inaccurate or even false research results. Therefore, in this review, we aim to raise awareness for the importance of metadata as well as possible metadata errors and their potential consequences and the great responsibility of researchers in ensuring the fidelity of their published data.

2. Metadata and their importance for research

The growing importance of metadata and the need for metadata management in research was already known twenty years ago, as a 2004 review on the evolution, current state and future of metadata by Sen indicates [10] . Additionally, metadata is an integral part of the Semantic Web [11] , which was described in a 2001 article in the Scientific American by Berners-Lee et al. [11] , [12] and was envisioned to enhance the World Wide Web by providing machine-understandable information via metadata [13] using the different layers of the semantic web [11] , [12] , [14] . Uniform Resource Identifiers (URIs) are metadata and a significant base layer component of the semantic web, which functions similar to international standard book numbers (ISBNs) [11] , with Universal Resource Identifiers (URLs) being the most common type of URI [12] . The subsequent layers of the semantic web employ technologies that were already available [12] . The eXtensible Markup Language (XML), which allows adding tags or hidden labels [12] , and (in the layer above that layer) Resource Description Framework (RDF), which uses URIs to encode information in triples [12] . These triples are comparable to elementary sentences consisting of a subject, a verb, and an object [12] . This directed, labelled graph data format represents information and metadata. Its specifications define syntax and semantics of the SPARQL Query Language for RDF [15] , which was first introduced via a W3C Candidate Recommendation in 2004 and has subsequently been updated several times [16] . The other layers of the Semantic Web also require metadata to enable the agents, which were envisioned to work in a way resembling a personal assistant [12] , to function [11] . They include the ontology vocabulary, which has been described in detail by Berners-Lee et al. (2001) [11] , [12] , and allows agents to interpret and use the data. Two decades later, digital assistants, for instance, Alexa and Siri, rely on Semantic Web resources to provide structured content [17] . The core principles of the Semantic Web, such as standardised metadata and ontologies, are also crucial for research. In research data management (RDM), metadata is the foundation for making the data findable, accessible, interoperable and reusable (FAIR) [18] . These criteria, which are also referred to as FAIR Data Principles and FAIRness, have been designed, described and introduced by Wilkinson et al. (2016) as guidelines to enhance the reusability of scholarly data [19] and facilitate sharing, exploring and reusing existing research data [18] .

After the original draft, each of the four principles have been refined in the 2016 article introducing the FAIR Guiding Principles by Wilkinson et al. [19] , which are also part of the introduction of the FAIR Cookbook by Rocca-Serra et al. (2023) [20] , and summarized in Table 1 .

The refined FAIR Guiding Principles, as published by Wilkinson et al. (2016) [19] and in the FAIR Cookbook by Rocca-Serra et al. (2023) [20] . Slightly adapted from Wilkinson et al. (2016) [19] .

The importance of machine-readable (meta)data, which is empathised by the FAIR Guidelines, has also been recognised by other concepts before, for instance by the Semantic Web [21] and is also an aspect of the 5 Star Linked Data Principles [22] , [23] for Linked (Open) Data, which will be elaborated below.

However, even in 2022, the term “metadata” is not clearly defined, instead a variety of definitions, standards, contexts and formats exists [24] . In fact, according to Furner (2019), there are 96 separate ISO standards and 46 different definitions for the term “metadata” [25] . Additionally, the term is used both for “data about data content”, also termed “descriptive metadata”, and for “data about data containers”, so-called “structural metadata” [25] . Furthermore, it has also been suggested to expand the definition of metadata to the structured and standard part of documentation, and to consider the creation of metadata to the spiral model used in software development and to take the importance of structured and standard documentation during the extended data life cycle into account [26] .

A general description could be “data about data” [10] , [25] . Sen (2004) [10] explains this using the example of measuring the length of a 5 ft stick: in this case, the data is the number 5, while the information on the measurement (what was measured and in which unit?) is regarded as metadata [10] . This perfectly elucidates the importance of metadata, the information “stick” and “5” is of little use without the additional information that the stick’s length was measured in ft. The same is true for omics data such as RNA-Sequencing data. Knowing the sample name and the counts of the expressed genes is often not sufficient; most analyses require more information, such as the species or the condition of the sample. Additionally, further information can be of great interest, including age, sex, and – especially for samples derived from human donors – the general health status of the donor and possible comorbidities. Regarding the reuse of omics data, there is a plethora of information that might not be of interest for the researchers who created the dataset, but could be included to facilitate further research.

The type of experimental data to be curated also allows ranking the required metadata information on the experiment, allowing dropping of irrelevant information and federating critical information by work flows, masks or data fields.

In order to fully use the potential of publicly available omics data for secondary research, an appropriate annotation is recommended [27] . According to Rajesh et al. (2021), this includes a complete description of the sample type, details on the sample preparation, such as the collection procedure and extraction and assay methods, as well as relevant clinical phenotypes [27] . Additionally, summarized or processed data should be accompanied by metadata containing details about the computational pipeline, for instance the annotation, including which genome build, which gene annotation provenance and which release have been used with which software arguments and versions [27] . The authors also point out that a lack of complete annotations might have a negative impact on follow-up studies intending to reuse the data [27] .

Furthermore, improper annotation and incomplete metadata compromise the reproducibility of the original results [27] . Despite the efforts of the biomedical community to share omics data, these efforts are hindered by the lack of consistency among researchers in ensuring the completeness and complete availability of accompanying metadata for raw omics data [27] . Therefore, Rajesh et al. (2021) highlight the need for proper annotation and applying the FAIR principles (Findable, Accessible, Interoperable, Reusable), which have been introduced by Wilkinson et al. in 2016 [19] . They also emphasise the importance of accurate, complete and consistent metadata and a standardised format for both raw data and metadata, which also implies submitting at least a predetermined minimum of clinical phenotypes, including tissue type, age, sex and ancestry [27] .

In their assessment of open transcriptomics data across 29 studies, Rajesh et al. (2021) found that on average only 65% of the nine clinical phenotypes they examined were shared publicly, ranging from 83.3% to 38.9% of completeness, with a 35% loss of information between publication and repository accounting for a loss of about 45.7% of the total data between the publication and the corresponding publicly available repository entry [27] . They also stress the importance of rigorous standards for sharing metadata in public repositories to prevent errors caused by the laborious and error-prone approach of scraping metadata from the publication [27] . In summary, metadata needs to be open, complete, freely accessible, standardised and stored in an easy-to-use format to allow other scientists to reproduce the findings of the original publication, enable data reuse, and maximize the utility of the shared data [27] .

An important aspect is labelling or sorting the information in the metadata. Ideally, these labels should be standardised to make reusing the data easier. For instance, “tissue type” could be written in several ways, such as “tissue type”, “tissue_type”, and “TissueType” or even “tissue-type” or just “tissue” and these differences might complicate automatic data procession. However, this becomes even more important for terms that are often used interchangeably although they have a difference, such as “race”, “ethnicity”, and “ancestry” [27] . Due to its negative connotation and wrong use it might be best to avoid the term “race”, as Rajesh et al. do in their 2021 publication about omics metadata [27] .

This could be implemented by offering predefined metadata categories with defined names and a definition and an explanation regarding the information that is expected in the respective category, for instance like Rajesh et al. (2021) define and explain their use of the term “ancestry”. Additional strategies might be the use of URIs to embed metadata and employing Wordnet synsets containing cognitive synonyms. The Princeton WordNet, which was started in the 1980 s [28] , links English words (nouns, verbs, adverbs, and adjectives) to a set of synonyms, which are linked via semantic relations, determining the word definitions [29] , and has become a widely used tool in natural language processing (NLP) [28] . Additionally, WordNets have been created for various other languages [28] and Wordnet synsets can even be generated automatically for different languages, including languages with poor resources or endangered languages [30] .

Another example of rapidly growing data in need of maintenance and curation are ontologies. By defining concepts, objects and their properties and their relations, ontologies map different types of knowledge and knowledge categories on to the data. Ontologies are used to model scientific fields in order to facilitate computational processing of free text, and to define a vocabulary for standard data formats [31] . As formal representations of ideas, concepts or objects and their relationships, ontologies are often used as controlled vocabularies in requirements engineering during software development [32] . Controlled vocabularies are defined as organized collections of terms with well-known and determined meaning, without duplicates. They facilitate classifying, querying and retrieving data and their usefulness in requirements engineering has been demonstrated [32] . The use of ontologies in the bioinformatics area of proteomics has been described in detail by Mayer et al. (2014), who described the standardised formats and ontologies used in proteomics as well as the ontology formats and appropriate software and the use of controlled vocabularies in the Human Proteome Organisation-Proteomics Standards Initiative (HUPO-PSI) in great detail [31] .

However, researchers face challenges when their work requires combining ontologies [33] . This is due to the multitude of different overlapping ontologies, which are used to annotate, organise and analyse data generated by biological experiments and harmonize information of biological knowledge bases but vary in completeness and quality [33] . Hence, the Open Biological and Biomedical Ontologies (OBO) project was founded for organising and guiding the development of ontologies based on shared standards and principles [33] . The Ontology Metadata Vocabulary was created to set metadata standards and to increase the FAIRness of ontology databases, thereby enabling access and reuse of ontologies [33] . All of the ontologies within the OBO Foundry have to fulfil certain requirements and principles, which include shared standards for the interrelation of terms [33] . These principles are stewarded by a team of volunteers that also takes care of various other duties, including metadata curation and maintaining the site [33] . Recently, the OBO Foundry principles were operationalized, and the huge task and the significant community effort involved in re-curating the ontology metadata have been described in detail by Jackson et al. (2021) [33] .

Additionally, there are other organizations aiming to provide access to multiple ontologies, such as the National Center for Biomedical Ontology (NCBO), which offers users a uniform mechanism to access a variety of ontologies in different format, including the Open Biological and Biomedical Ontologies (OBO) format and the Web Ontology Language (OWL) format [34] .

Other organisations, such as the National Research Data Infrastructure (NFDI, for “ Nationale Forschungsdaten Infrastruktur ”), aim to standardise and harmonise terminologies and identifiers within their infrastructure, for instance by funding initiatives such as the Persistent Identifier Services for the German National Research Data Infrastructure (PID4NFDI) and the Terminology Services 4 NFDI (TS4NFDI) [35] . Additionally, they actively engage with and provide feedback on EU data legislation, such as the EU Data Act, with the objective of refining legal parameters regarding data access, management, and usage to further scientific research and innovation [36] . This aims to increase FAIRness and data accessibility since 80% of industrial data is currently not being used due to various barriers, such as technical, legal and economic barriers [36] .

The European life sciences infrastructure ELIXIR even offers a FAIR Cookbook (available at https://faircookbook.elixir-europe.org/content/home.html ), an online resource for the Life Sciences offering help and assistance for making and keeping data FAIR [37] . The FAIR Cookbook includes information about the FAIR principles, and various recipes to achieve and optimise Findability, Accessibility, Interoperability, Reusability, Infrastructure, and Assessment [37] .

That the additional information is attached to the correct sample is not only important for the original research but also for future research, both for studies reusing the data and for studies citing the results obtained with the data. The possible impact of wrong conclusions and the subsequent multiplication of error as well as the reluctance of publishing results seen as “negative results”, or of results that might challenge established practices have both already been eloquently described by Ioannidis (2010) [38] .

3. Different types of metadata

There are not only numerous definitions of metadata but also various types of metadata in themselves. The different types of metadata have recently been described by Ulrich et al. (2022), who found 23,233 records for the keyword “metadata” and selected 551 of these records by using suitable keywords and removing duplicates, which were subsequently screened [24] . This resulted in a total 81 records that were subsequently analysed by the researchers [24] . Taking their possible biased selection that resulted in the majority of the papers being from the field of bioinformatics [24] into account, this indicates that defining the term “metadata” is probably even more complicated.

To help researchers decide which information they should or could add as metadata to their experimental data, the FAIR Cookbook contains a recipe for a metadata profile for different types of research data (chapter 11.5.1 of the current FAIR Cookbook (September 2023)) [37] . The FAIR Cookbook contains several extensive but non-exhaustive lists of metadata suggestions for various analyses and differentiates between required and recommended metadata [37] . Table 2 and Table 3 summarise the suggestions for required and recommended metadata, respectively [37] , the complete recipe is available at https://faircookbook.elixir-europe.org/content/recipes/interoperability/transcriptomics-metadata.html#assay-metadata .

Summary of the FAIR Cookbook suggestions for required metadata (modified after chapter “11.5.1 Metadata profile for transcriptomics” of the current FAIR Cookbook (September 2023) [37] .

Summary of the FAIR Cookbook suggestions for recommended metadata (modified after chapter “11.5.1 Metadata profile for transcriptomics” of the current FAIR Cookbook (September 2023) [37] .

Among the required metadata are unique identifiers or short URIs (Uniform Resource Identifiers) [37] , which are also part of the concept of the Semantic Web. Other required metadata fields include not only the more immediate considerations such as sample type, species and disease but also less intuitive parameters. These can include information whether the sample was a biological or technical replicate for assay metadata, or which computational method or algorithm was employed in the analysis [37] .

Although the information of the recommended metadata fields is not strictly necessary for a re-analysis of the data, including as much of the recommended metadata fields as possible can facilitate the re-use of a dataset and help other researchers to gain more insights when re-analysing the provided data.

4. Data governance

Another important aspect in handling research data and metadata stewardship is data governance. In their 2019 review on data governance Abraham et al. bring to attention that the amount of data that is created is rapidly increasing [39] . The amount of created data was said to increase from 4.4 zettabytes in 2013–44 zettabytes in 2020 [39] , which equals 44 trillion (10^12) gigabytes (GB). This is enough storage space for about 6.3 trillion high definition movies of about 7 GB in size or 11 trillion DVDs (4.7 GB). Assuming each song needs about 5 megabytes storage, this would be enough for 8.8 quadrillion hours of music (which is more than one trillion times the current age of the universe, which is estimated to be around 13.8 billion years [40] ).

Abraham et al. (2019) also provide a working definition of the term “data governance” as a “ cross-functional framework for managing data as a strategic enterprise asset ”, which also specifies “ decision rights and accountabilities for an organization’s decision making about its data ”, and “ formalizes data policies, standards, and procedures and monitors compliance ” [39] . Thus, the data needs to be managed in a way that maximises its value and manages data-related risks and challenges such as inaccurate and incomplete data and compliance issues need to be overcome, and the conceptual framework has been described in detail by Abraham et al. (2019) [39] . The consequences or outcomes of data governance include a positive effect on data utilisation, increased data quality, and the management of data-related risks due to a better oversight regarding the data quality and risk-mitigating policies to reduce risks concerning privacy or security breaches [39] . Additionally, it has been shown that organizations that are able to use their data effectively, for instance by tagging the data with metadata, have advantages over their competitors [41] , which demonstrates the importance of having and handling metadata correctly.

In the healthcare sector, the most important challenges are reliability and integrity, as they are related to life and death [42] . Especially sensitive data such as patient data has to be kept in secure places and should only be accessible to authorized parties, and criminal acts, such as the theft of personal medical history, have to be prevented [42] . Thus, a conflict of interest between collecting and using the data and legitimate concerns arises. In healthcare, data collection, sharing and collaboration face challenges when patient consent is necessary [42] . Therefore, data governance policies need to address privacy, security, and accuracy as well as storage, usage and preservation inside the organization, and data access and lifecycle [42] . Additionally, it is crucial to consider data standards and automation strategies in order to effectively manage data [42] . Legislation of data governance, such as the new EU Data Governance Act, address such topics, including the creation and regulation of so called “secure spaces” for sharing and reusing sensitive data such as health data for commercial and altruistic purposes, which also includes scientific research [43] . An important aspect in sharing biomedical data are access barriers, for instance, the data protection principle of purpose limitation, which states that data can only be used for specific purposes [43] . This hinders the use of the data for multiple research purposes as explicit consent for each downstream use is required [43] . Thus, the data-sharing infrastructure, secure data-sharing platforms and data governance need to be adapted to allow “further processing” and reuse of the data by other scientists [43] . At the same time they need to ensure data protection and privacy, which can be achieved by ensuring that the data is only accessible to authorised users for authorised purposes [43] . A regulatory data governance framework for data-sharing infrastructure can facilitate the sharing of data and thus research [43] , and the recent COVID-19 pandemic demonstrated the need for a robust data governance framework [43] .

A possible approach to handle big data while its being generated is described by Zimmerman et al. (2014) [44] . They describe how structural genomics centres use mechanisms to connect results into a unified system by employing laboratory information management system (LIMS) tools and central databases, for instance UniTrack, which unifies and curates data obtained by different laboratories [44] . Other tools, such as LabDB, can automatically or semi-automatically harvest data from laboratory equipment [44] . The reagent tracking module of LabDB can track the use of reagents via unique barcodes [44] . When the barcodes are scanned during the preparation of a stock solution, a new unique barcode for the stock solution is created [44] . This barcode allows tracking the origins of the chemicals and carrying detailed information along the pipeline, which provides much more detailed information about the contents of the stock solution than a hand-written label would [44] . Additionally, the data is linked to later steps, which allows determining whether unsuccessful experiments can be traced back to a certain reagent [44] . Systems such as these could also be used to connect metadata about the origin of a sample, e.g., detailed information about the donor, such as their health condition and possible comorbidities, their age, and other information that might be relevant for research. This would allow tracing a samples history back to its origins and connecting this information to further data such as sequencing results. Therefore, automatically uploading sequencing results and their metadata to a database would become much easier, which might help keeping the metadata correct and complete without adding more effort for the researcher.

5. Good data governance

Another example for data quality criteria are the AHIMA characteristics of data quality by the A merican H ealth I nformation M anagement A ssociation [45] , which coincide with the FAIR principles and show how data handling principles such as the FAIR principles might be implemented in the clinic ( Fig. 1 ). The convergence between the FAIR principles and the AHIMA guidelines underscores the widespread recognition of data quality challenges in the field. The AHIMA guidelines can be seen as an essential checklist for practitioners, highlighting the specific qualities imperative for achieving optimal data quality. According to the AHIMA criteria, data needs to be (1) accurate, therefore correct and free of errors, (2) accessible, so that the data is available when required, (3) comprehensive and contain all required elements, (4) consistent, meaning that the data is “reliable and the same across the patient encounter”, in terms of sequencing data this term could also be used to describe that every sample of a dataset was prepared according to the same protocol or advise a consistent use of categories for the accompanying metadata, (5) current, which in a clinical setting describes that every information is up to date, could be adapted to emphasize that every step and every additional information should be documented, (6) clearly defined, (7) granular, meaning containing the appropriate level of detail, (8) precise, (9) relevant, which is defined as relevant to the purpose it was collected for, although additional, seemingly not relevant information might be useful for other researchers also using the data, and lastly (10) timely, which the AHIMA defines as entered promptly as well as “up-to-date and available within specified and required time frames” [45] , which is a good laboratory practice while generating data and might prevent confusion or even labelling errors.

Fig. 1

Schematic representation of the FAIR principles juxtaposed with comparable guidelines such as AHIMA and the 5 Star Linked (Open) Data Principles. FAIR principles = Findability, Accessibility, Interoperability and Reuse of data; AHIMA (American Health Information Management Association) guidelines for optimal data provision in the clinic. 5 Star Linked (Open) Data Principles for step-wise deployment of open data were suggested by internet constructor Tim Berners-Lee. Own figure.

While unique IDs / URIs are among the required metadata fields and crucial for enhancing the findability of data, it is also important to consider other metadata fields that improve the specific needs of researchers trying to find datasets for specific analyses. During preliminary analyses, researchers might be interested in already available datasets regarding a certain tissue type, a specific disease or a defined age group. Since these search criteria are among the required or recommended metadata fields, an effective search should to include these metadata fields to find suitable datasets and the datasets containing the respective information might easier to find. As rich metadata can enhance the findability of the data, the metadata is an important aspect of data sharing. Additionally, researchers can help other researchers to find their data by adding suitable metadata. Thus, the metadata can affect several aspects of research: In the original research, correct metadata guarantees valid results. In subsequent research, correct (and ideally rich) metadata can affect (1) the findability of the data and the results of database searches, and thus the reuse of the data and (2) the research of others reanalysing the data.

During our data retrieval research for a bioinformatics analysis project, we searched for human lung samples of individuals who were either healthy or infected with SARS-CoV-2. Therefore, for this example, keywords such as “human lung”, “lung tissue”, “human”, “healthy”, “infected with SARS-CoV-2″, “SARS-CoV-2″, would have made the data findable. However, samples only tagged with “COVID-19″ or “Corona” would most likely not turn up among the search results, when using only the before mentioned keywords. Keeping this in mind, researchers sharing their data should include as many suitable keywords as possible (e.g., “SARS-CoV-2″, “COVID-19″, “Corona”, “novel Corona Virus”, …). Additionally, in an ideal world, the search algorithms should be able to find data related these keywords correctly, ideally even if the keywords in the metadata are written slightly differently, contain a typo or are not the actual keyword but a synonym.

This example highlights an important challenge in generating FAIR data: the difficulty of aligning the FAIR principles with the human-centric aspects of data discovery. Musen et al. (2022) underscored that an important aspect of FAIR data is to ensure that metadata encompasses adequate descriptors enabling researchers to find datasets with satisfactory recall and precision [48] . They emphasised the need for machine-processable metadata templates guiding both researchers and data stewards in how community-specific metadata standards should be applied [48] . These templates should contain all necessary community-based details and standards that are required for consistent research metadata. By using such templates, researchers could streamline the process of adding metadata to the respective research data and, at the same time, ensure that the data remains compliant with the FAIR principles [48] . This also highlights the importance of data FAIRification and the effect of metadata on findability.

6. The open data guidelines

The importance of data availability is not only highlighted by the FAIR Principles but also by other concepts, for instance the ‘Open Data’ principle, which refers to non-confidential and non-private data being made available via public means. Although the FAIR principles and open data share similarities, the concepts are not identical, as Jati et al. (2022) elaborated using Kingdon’s multiple streams model [21] .

According to the definition by Geiger and von Lucke, Open Data is defined as making “all stored data of the public sector which could be made accessible by the government in the public interest without any restrictions on usage and distribution” accessible [21] , [49] . The main goals of Open Data is that anyone can freely use, reuse an redistribute the data and the maximization of interoperability [21] . While Open Data aims at providing the public with access to data considered to be in the public interest and excludes confidential, private and classified data, the FAIR Principles were developed in the research environment and focus on challenges in data collection for research [21] .

The FAIR aspects Findability and Accessibility are comparable to the aspect of availability in Open Data [21] . Findable data is defined as data that can easily be found by humans and machines, and these data should also be accessible to users, although FAIR does not specify the type of user accessing the data or the type of data being accessed [21] . In Open Data, the data being accessed is required to be non-confidential, as the focus is on making data available to the public. The FAIR Principles, on the other hand, also consider the need for data protection and access requirements. Therefore, FAIR data can be either public or confidential, while Open Data is required to be free from usage restrictions, non-private and non-confidential [21] .

Additionally, the FAIR Principles promote interoperability, for instance via machine-readable ontologies and metadata, which can be stored in formats also used in the Semantic Web (e.g., RDF, the Research Description Framework to represent interconnected data on the web) [21] . Open Data does not specifically focus on interoperability, although the concept also emphasizes that “anyone should be able to use, reuse and redistribute the data” [21] . Additionally, the 5 Star Linked (Open) Data Principles emphasise enabling other users, both humans and machines, to utilise the data by advocating for the use of machine-readable, non-proprietary data formats [22] , [23] . The focus is on the data being available for everyone and the data being reusable for any purpose [21] , which can be summarized as redistribution neutrality. However, the structure or format of the data is not explicitly defined in Open Data [21] . This is comparable to the FAIR Principles aspect of Reusability, which additionally encompasses data and metadata being defined for reuse and being able to be replicated or used in different environments [21] . Note that findability of data via metadata is a basis for data retrieval, but in practice different from precision and recall, as changing the metadata (adding more information) will improve the discoverability of the respective data in a database but might not necessarily affect precision and recall of the database search itself. While the results of a database search are affected by precision and recall and the available information (e.g., in the metadata). One has to focus both on the aspect of improving the metadata as well as fast recall and high precision (depending on curation and structure of your database).

While it is possible to achieve the goal of Open Data by applying the first three of the FAIR Principles, FAIR data is not necessarily “open” [21] . FAIR does not aim to make data accessible to the general public but invites data ownership by also considering possible access restrictions due to the nature of the data (e.g., sensitive data) and therefore includes that data users might need to be authorised and verified before accessing the data [21] .

6.1. Metadata errors are often only accidentally spotted and difficult to correct

Using two highly cited studies as an example, we intend to raise awareness of the importance of correct metadata and illustrate to the reader why measures to improve software metadata and control the correctness of the data in a timely manner are imperative for good scientific data.

6.2. Finding mistakes in metadata preservation by accident

Mistakes in metadata preservation happen, and their detection is often only possible by accident if datasets are systematically compared. Thus, in our first example on incorrect metadata, we give and review here, that we originally intended to study a scientific question and not the metadata: To find out whether or not alveolar epithelial cells react to SARS-CoV-2, we reanalysed suitable publicly available datasets, including the GEO datasets {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 [50] and {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 [51] . These datasets appeared in flagship publications:

The {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 dataset was generated during the research for the Cell publication “ Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19 ” by Blanco-Melo et al. (2020) [50] and has been publicly available in GEO [7] , [8] since March 2020. The authors revealed a unique and inappropriate inflammatory response, defined by low levels of type I and III interferons juxtaposed to elevated chemokines and high expression of IL-6 compared to controls [50] . The other dataset, {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 , was generated during the research for the Nature publication “ Identification of SARS-CoV-2 inhibitors using lung and colonic organoids ” by Han et al. (2021) [51] and has been publicly available in GEO [7] , [8] since August 2020. Using alveolar type-II-like cells permissive to SARS-CoV-2 infection, the research group performed a high-throughput screen of approved drugs to identify entry inhibitors of SARS-CoV-2 compared to uninfected controls, such as imatinib, mycophenolic acid, and quinacrine dihydrochloride [51] .

6.3. Consequences of metadata errors in high-impact publications

If there is a metadata error, it spreads particularly rapidly if it occurs in high-impact publications: In our example, both high-impact publications have been widely cited: The Cell article by Blanco-Melo et al. [50] , which was published online on May 15 th, 2020, has been cited 2435 times (2425 times in CrossRef, 2417 times in Scopus, and 53 times in PubMed Central (as of April 2023)) according to Cell ’s PlumX Metrics, and the Nature article by Han et al. [51] , which was published online on October 28 th, 2020 and has been accessed 57k times according to Nature ’s own metrics, has been cited 246 times in Web of Science and 267 times in CrossRef (as of April 2023).

6.4. Metadata errors can be identified by systematic comparison

For instance, during our analysis of the data from our examples, we found irregularities in the RNA-Sequencing data of both publications (data derived from Blanco-Melo et al. [50] (2020) and Han et al. [51] (2021)): Two of Han et al.’s human lung tissue samples ( {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 from {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 ) [51] appear to be precisely the same as two other unrelated samples of human lung tissue ( {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 from {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 ) that were generated by Blanco-Melo et al. (2020) during research for their Cell publication [50] . This became obvious during the first steps of the analysis workflow, which was originally performed in 2020 and has been repeated with newer software and GENCODE versions, still yielding the same results. After preparing the human lung tissue RNA-sequencing data of both publications ( {"type":"entrez-geo","attrs":{"text":"GSE147507","term_id":"147507"}} GSE147507 [50] and {"type":"entrez-geo","attrs":{"text":"GSE155241","term_id":"155241"}} GSE155241 [51] ) with STAR alignment (version 2.7.10a) [52] , using the comprehensive gene annotation PRI and the genome sequence, primary assembly (GRCh38) PRI of GENCODE version 39 [53] , [54] and transcript quantification with RSEM (version 1.3.1) [55] , the resulting files were analysed in RStudio. After importing the data via tximport (version 1.24.0) [56] , we performed a DESeq2 analysis (version 1.36.0 [57] , with apeglm version 1.18.0 [58] ). In the resulting heatmap ( Fig. 2 A, generated using the R-package pheatmap, version 1.0.12 [59] ) and the resulting principal component analysis ( Fig. 2 B, using DESeq2 [57] )the samples “control_1″ and “control_3″ as well as “control_2″ and “control_4″ appear to express precisely the same genes.

Fig. 2

Heatmap and principal component analysis visualizing the samples of both studies. (A) The similarity between “control_healthy_1” and “control_healthy_3”, as well as between “control_healthy_2” and “control_healthy_4” became apparent while visualizing the gene expression in all samples as a heatmap. (B) The PCA indicates differences and similarities between the samples. {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 as well as {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 , are superimposed. Three of the COVID-19-infected samples of the data generated by Han et al. (2021) [51] also cluster closely but are not superimposed. Red dots symbolize control samples (“healthy” (not COVID-19 infected) according to the metadata), and blue dots denote COVID-19 positive samples (according to the respective metadata). Own figure.

The principal component analysis (PCA) in Fig. 2 B confirms and visualizes the similarity between the samples of “Publication_1″ (the samples by Blanco-Melo et al. (2020) [50] ) and “Publication _2″ (the samples published by Han et al. (2021) [51] ). {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 , as well as {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 , are superimposed. This is because the respective samples have the same principal component values (see Supplementary Table 1 ). {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 cluster together and are recognised as one branch of the dendrogram, and {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 cluster together as a branch of the dendrogram, resulting in two dendrogram branches for four samples. The similarity of the results was especially confusing as the respective metadata indicated that sample {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 (by Blanco-Melo et al. (2020) [50] ) was obtained from a male donor while the remarkably similar sample {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 (by Han et al. (2021) [51] ) was obtained from a female donor (according to the metadata and according to personal communication). Additionally, both research groups obtained their samples from different institutions. According to their publication, the Blanco-Melo group obtained their healthy lung tissue samples at Mount Sinai and their SARS-CoV-2 infected lung tissue samples as fixed samples from Weill Cornell Medicine [50] . Han and colleagues obtained all of their tissue samples (control and COVID-19 samples) from the Weill Cornell Medicine Department of Pathology [51] . Analysing the samples revealed that {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 as well as {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 show an identical count and sequencing read distribution, which can also be seen both in the heatmap and the PCA ( Fig. 2 ) and in the visualisation of the sex-specific gene expression ( Fig. 3 ). Additionally, we checked each analysis step as well as the complete pipeline internally (two independent people from our US/German team) to verify that the results were precise and reproducible. Our results indicated that the officially different samples were identical, which has been confirmed and corrected by the respective authors, who updated the GEO-entries: {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 is now (since January 5th, 2022) labelled as reanalysis of {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 as reanalysis of {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 , respectively. Subsequently, we directly compared the gene expression of the samples in question and contacted the respective authors and journals regarding the striking similarity between the individual samples.

Fig. 3

Sex-specific gene expression in the samples of Blanco-Melo et al. (2020) [50] (starting with {"type":"entrez-geo","attrs":{"text":"GSM446","term_id":"446"}} GSM446 …) and the samples of Han et al. (2021) [51] (starting with {"type":"entrez-geo","attrs":{"text":"GSM469","term_id":"469"}} GSM469 …). (A) Control samples published by Blanco-Melo et al. (2020) [50] and Han et al. (2021) [51] . XIST is usually expressed in females, as it is responsible for the inactivation of the second X chromosome in females  [60] (e.g., in {"type":"entrez-geo","attrs":{"text":"GSM4697987","term_id":"4697987"}} GSM4697987 , which is a correctly labelled female sample, indicated by the pink bar colour). The other genes are located in the male-specific region of the Y chromosome. Correctly labelled male samples are indicated by blue coloured bars and express Y-specific genes. {"type":"entrez-geo","attrs":{"text":"GSM4697982","term_id":"4697982"}} GSM4697982 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 (that were obtained from female donors according to the metadata), are wrongly labelled as female (indicated by the red colour of the bars and the yellow background). The identical gene expression of {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 and {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 , respectively is highlighted by rectangular frames grouping the respective samples together…). (B) COVID-19 samples published by Blanco-Melo et al. (2020) [50] and Han et al. (2021) [51] . Own figure.

6.5. Objectively identifying metadata on sex

Metadata on sex can be objectively identified in raw data looking at XIST and Y-chromosome specific genes: To demonstrate this again with our example (works, however, for all gene expression data you are interested in to verify correct sex annotated), we compare the sex-specific gene expression of the samples ( Fig. 3 ). Since, according to the metadata, the samples in question were derived from three men and one woman, we compared the expression of X-inactive specific transcript ( XIST ), which is responsible for the dosage equivalence of X-linked genes in both sexes and the inactivation of the second X chromosome in females, and thus typically expressed in females [60] .

Due to the striking similarity in XIST expression between the two officially male samples, {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 , and the complete absence of XIST expression in the officially female sample {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 , we analysed and compared the expression of XIST and several genes located in the male-specific region of the Y chromosome ( Ubiquitously Transcribed Tetratricopeptide Repeat Containing, Y-Linked ( UTY ), Ubiquitin Specific Peptidase 9 Y-Linked ( USP9Y ), Lysine Demethylase 5D ( KDM5D ), Eukaryotic Translation Initiation Factor 1 A Y-Linked ( EIF1AY ), DEAD-Box Helicase 3 Y-Linked ( DDX3Y ), and Ribosomal Protein S4 Y-Linked 1 ( RPS4Y1 )) [61] , [62] . For this second analysis (again working for all gene expression data as an independent verification for the male sex), we decided to include all human samples of both studies ( Fig. 3 ). The samples in Fig. 3 are grouped by research group (rows) and labelled according to their sex in the respective metadata. The samples are grouped into control and COVID-19 samples. The colour of the bars indicates the sex (in case the metadata and the sex-specific gene expression correspond, blue indicates a correctly labelled male donor and pink a correctly labelled female donor) or an error (red bars if the information regarding the donor’s sex does not fit the sex-specific gene expression). Additionally, all samples which appear to be affected by a metadata error are highlighted by a yellow background colour. The similar samples and the sample with the wrongly annotated sex are further emphasised by rectangular frames and labels indicating the error.

The samples starting with {"type":"entrez-geo","attrs":{"text":"GSM469","term_id":"469"}} GSM469 … are part of the dataset by Han et al. (2021) [51] , and the samples starting with {"type":"entrez-geo","attrs":{"text":"GSM446","term_id":"446"}} GSM446 … are part of the dataset by Blanco-Melo et al. (2020) [50] , the sex is indicated according to the data available via the respective publications and the expression of sex-specific genes. The gene expression of the samples in question (indicated by black frames) is exactly similar for all of the analysed genes, which is highly unlikely if the samples were obtained from different individuals and sequenced using different sequencing platforms. After our inquiry regarding similar samples, the authors demonstrated responsibility and rectified the situation. As of January 5 th, 2022, {"type":"entrez-geo","attrs":{"text":"GSM4697983","term_id":"4697983"}} GSM4697983 and {"type":"entrez-geo","attrs":{"text":"GSM4697984","term_id":"4697984"}} GSM4697984 are labelled as reanalysis of {"type":"entrez-geo","attrs":{"text":"GSM4462413","term_id":"4462413"}} GSM4462413 and {"type":"entrez-geo","attrs":{"text":"GSM4462414","term_id":"4462414"}} GSM4462414 , respectively, in the GEO database (indicated by black rectangles in Fig. 3 ). Thus, the striking similarity of the samples and the striking similarity of the raw data was indeed due to the samples in question being the same samples, which is now (since January 5 th, 2022) indicated at the individual samples’ GEO entries. During our second analysis (which is depicted here), we used all human samples (COVID-19 infected and healthy controls) of both publications, instead of only analysing the similar samples, comparing the groups COVID-19 infected vs. healthy (as indicated in Fig. 2 ). The bar plots in Fig. 3 show the expression of XIST and the above-mentioned Y-specific genes. The third healthy sample of Han et al. (2021) [51] , {"type":"entrez-geo","attrs":{"text":"GSM4697982","term_id":"4697982"}} GSM4697982 , which was obtained from a healthy female according to their metadata, shows a significantly lower XIST expression compared to {"type":"entrez-geo","attrs":{"text":"GSM4697987","term_id":"4697987"}} GSM4697987 . This sample was obtained from a female COVID-19 patient according to Han et al.’s data [51] . Additionally, {"type":"entrez-geo","attrs":{"text":"GSM4697982","term_id":"4697982"}} GSM4697982 shows the expression of several Y-specific genes. Hence, we contacted the authors again. They responded immediately and asked for further information regarding our analyses. Thus, we provided detailed information regarding the analyses, including the bar plot shown in Fig. 3 and a list of the Y-specific genes. Since there was no further email exchange (September 2023), we assume they are still checking their data. Recognising the authors’ diligent approach to the GEO database entries, we are confident that they will address and update the information on the reanalysed samples and the inaccurately annotated sex of their third control sample in their publication.

6.6. Identifying incorrect metadata and remaining doubts such as incorrect labels

Though XIST and Y-chromosome specific gene expression allow reconstructing which sex is correctly labelled in the metadata, extensive analysis of the raw data is often required to reconstruct more complex data such as age or pathology. Hence, we advocate in (i) prevention of metadata errors by checks and input routines, (ii) labelling of incorrect metadata, and if necessary (iii) correction of the incorrect metadata. Additionally, (iv) independently audited metadata might get a special annotation or seal of approval so that researchers can easily recognise the reliable data.

The cited authors took corrective measures and updated two of the GEO database entries (updated on January 5th 2023, no further changes (e.g., in the article) as of September 2023). The example illustrates this point, and very often, one cannot blame somebody if such an error occurs as the necessary information for this got lost long before. However, we now should discuss how often such errors occur and what difficulties await researchers trying to correct these errors.

6.7. Strategies for detecting and reducing metadata errors

Metadata errors happen even more often and should be reduced by comparison routines . Is an error in the metadata a rare incident? No, mistakes with metadata happen very often, each time data are stored, links between data and metadata got lost or central metadata (experimental conditions, samples, numbers, clinical features) either not entered or mislabelled. However, we can pinpoint mistakes as the one above only rarely, as this requires close comparisons of the control data sets in unrelated publications – or, to be more general, basic and more refined quality checks on transcriptome data sets and other omics data sets across all public data sets. In the following, we provide a brief overview of such errors to stress the necessity of doing such checks.

6.8. Detecting and reducing nucleotide annotation errors

The notion of abounding mistakes in data and metadata is correct, as one can already see with basic mistakes regarding nucleotide errors [63] , [64] :

Park et al. (2021) developed a semi-automated screening tool to detect nucleotide annotation errors, Seek & Blastn. Alarmingly, they found 712 papers with wrongly identified sequences chosen from five literature corpora (two journals (7399 articles published in Gene (2007–2018) and 3778 published in Oncology Reports (2014–2018)) and three targeted, topic-specific larger corpora using specific keywords (single gene knockdown of 17 specific genes (174 articles across 83 journals), articles related to miR-145 (a total of 163 articles), and articles related to cisplatin or gemcitabine treatment (258 articles) resulting in a total of 11,772 articles) [63] . According to Google Scholar, the 712 problematic articles were cited 17,183 times in March 2021, including clinical trials [63] .

At the time of publication, up to 4% of the problematic papers in each corpus had already been cited at least once by clinical trials [63] . However, Park and colleagues also analysed the articles further and predicted a high probability of 15–35% of these problematic publications to be cited in future clinical research, based on the concepts contained within the articles [63] . Hence, there is serious concern that about a quarter of these publications will likely impair clinical research by misinforming or distracting the development of potential cures, especially as the majority of the problematic articles has remained uncorrected [63] .

6.9. Detecting and reducing annotation errors regarding sex

Toker et al. (2016) [65] used human transcriptomics studies to compare the sex of the subjects that were annotated in the metadata with what they termed the “gene-sex”, the sex of the subject determined by analysing the expression of the female-specific gene XIST and the male-specific genes KDM5D and RPS4Y1 [65] . For their study, they analysed the gene expression of the female specific gene X-inactive specific transcript ( XIST ) and the male-specific genes Ribosomal Protein S4, Y-Linked 1 ( RPS4Y1 ) and Lysine (K)-Specific Demethylase 5D ( KDM5D ) in 70 human gene expression studies (containing a total of 4160 samples) which had the sample donor’s sex annotated [65] . Their analysis revealed that 46% (32 datasets) of the 70 datasets examined in the study contained mislabelled samples, expressing genes they should not be able to express according to the information in the metadata [65] .

As 29 of these datasets were associated with a publication, the authors had a closer look at the respective original studies and tried to find out whether the incorrect annotation was solely due to a miscommunication while uploading the data on the GEO database. Twelve of these 29 studies provided enough information in the publication to show, alarmingly, that the discrepant sex labels had already been present in the publication [65] . Thus, in at least 12 studies, the annotation error had already been present in the original publication and was not due to a miscommunication while uploading the data to the GEO database [65] .

Finally, Toker and colleagues compared four datasets that used samples from the same collection of subjects. Although not all of the four studies analysed all of the available samples of the collection, Toker et al. (2016) reasoned that if the collection contained incorrect metadata, the subsequent error should affect all of the four studies. However, while two analyses contained mismatched samples, [65] the mismatched samples differed between the two datasets [65] . Additionally, the respective samples were correctly annotated in the other two studies, indicating that the samples had been mislabelled in the respective studies instead of an error while recording the subject’s sex [65] .

Checking for gender-labelling errors might be relatively easy and cost-effective. This is indicated by a method for predicting gender-labelling errors using X-chromosome SNPs by Qu et al. (2011) [66] . Their method simultaneously accounts for heterozygosity and relative intensity of X-chromosome SNPs in candidate genes and does not require Y-chromosome data and no additional space for gender-prediction SNPs in the genotyping set [66] . Using only nine X-chromosome SNPs in two candidate genes, they were able to predict several sample switches accurately [66] . Additionally, their prediction step requires no additional experiments in the laboratory and can be performed on various different sample types [66] .

6.10. Detecting and reducing transcriptome metadata errors

Mishandling of metadata is always possible for transcriptome data sets, as well as for all other omics data sets (proteome, phosphor-proteome, metabolome, genomics), and the problem is that wrong metadata, wrong labels, wrong conditions, and mislabelled controls are very difficult to spot and to correct in retrospect. The control is best done using all the information present at submission.

The big wave of incorrect annotation and significant data errors is steadily rising:

The sheer amount of data and its rapid growth further complicates finding such errors.

Around 2013, the amount of publicly accessible gene-expression data sets was about to hit the one million datasets milestone [9] . Using these data was becoming a valid method of gathering research data, as 20% of the data sets deposited in 2005 had been cited by 2010 [9] , [9] . Additionally, 17% of the data sets deposited in 2007 had been cited by the end of 2010 [9] , [9] , underlining the increasing importance of the GEO database. By the end of 2020, GEO entries reached 4 Million, and only a year later, we were at 4,713,471 samples (November 2021), which has by now (September 2023) increased to 6,670,188 samples. The natural increase in data volume without improved metadata reporting quality controls will lead to ever more errors. The impressive number of publications citing the use of the same publicly available dataset (e.g., the comprehensive dataset by Fleischer et al. (2018) [5] ) [6] , which is steadily rising, further demonstrates the urgent need for correct metadata in every dataset but, as our resources for checking are scarce, with special focus on highly cited master datasets as so much research depends on these key datasets.

6.11. Detecting and reducing gene name autocorrect errors

It has been known since 2004 that some gene names, such as MARCH3, SEPT8 and DEC1, can be autocorrected into dates in spreadsheets [67] . In response to this issue, a 2016 publication by Ziemann et al. raised awareness and prompted the Human Gene Name Consortium (HGNC) to rename genes with names less prone to autocorrect [67] , [68] . However, almost twenty years after realizing the problem and five years after that article, Abeysooriya et al. (2021) report that despite awareness of the problem and measures taken by both the HGNC and by software developers, the number of Excel files containing gene name errors even increased [67] . In addition to giving tips for preventing such errors, the authors also set up an automated reporting system, which is available at http://ziemann-lab.net/public/gene_name_errors/ [67] . Another approach besides rising awareness and renaming the respective genes would be to consider foregoing the use of Excel and opting for the CSV-format and the use of software without autoformatting. Additionally, due to CSV being a non-proprietary format, this approach would align with Tim Berners-Lee’s 5 Star Linked Data Principles [22] , [23] for Linked (Open) Data. The cumulative criteria include that the data is required to be available on the Web (the first star). In order to qualify as Open Data it also needs to have an open licence for being available on the web. Furthermore, the data should be in a (non-proprietary) machine-readable format (the second star is obtained for a machine readable format, the third if this format is non-proprietary) [22] , [23] . A fourth star is awarded for the use of RDF and SPARQL (query language to query RDFs), the open standards from the W3C, so that the data can be referenced by others [22] , [23] . Ideally, the data should also be linked to other data to provide context (resulting in a five star rating) [22] , [23] . A combination of both approaches, awareness of the potential side effects of autoformat and autocorrect features as well as the use of non-proprietary data formats that are less prone to autoformat and autocorrect, such as CSV, could help preventing such errors. However, researchers creating and reading CSV data still need to be aware of this potential error source to avoid autoformat and autocorrect errors if they prepare or open CSV-files using spreadsheet software applications that offer autoformat and autocorrect options.

6.12. Detecting and reducing citation errors

Digital object identifiers (DOIs) are part of the bibliographic metadata in Crossref, which is provided by the publishers and not double checked by Crossref [69] . Thus, Crossref faces similar challenges as other databases containing metadata that have not been double-checked. Additionally, DOI-mistakes have been analysed reported in other databases such as Scopus, Web of Science and PubMed [69] , suggesting that this citation-related metadata problem is widespread across databases and not specific to Crossref. In addition to analysing the taxonomy of the DOI errors, Cioffi et al. (2022) also developed a cleaning mechanism that could be used to correct mistakes in DOIs automatically [69] , which gives reason for hoping that their tool and similar approaches might help coping with the flood of data and the concomitant wave of errors.

A 2018 article by Brembs indicates that prestigious journals often appear to struggle achieving equally high reliability regarding data and metadata compared to other journals [70] . Reason include that these journals also receive the highest number of submissions and thus are faced with a monumental task of checking data, metadata and consistency [70] . Moreover, often time is critical, competition fierce and latest methods used are just starting to become reliable charting unknown, new territory. As journals are required to carefully examine each manuscript, including all attached supplementary data and metadata, there is an urgent need for tools that help both scientists and journals to cope with the ever-growing flood of research data. These tools should assist scientists in documenting and archiving their data and allow journals to easily and ultimately automatically double-check and verify this data. Additionally, the use of such tools should allow to update large-scale data including supplements over time as more data become available, marking clearly version history to document. This ensures reproducibility of the reported research as well as confirmatory data gathered only later like is already standard in large-scale database. Thus, ideally, these tools should be standardised, enabling easier and more reliable data review and analysis for all involved parties.

6.13. The battle for improving metadata quality has just started and must not be lost

6.13.1. hope for transcriptome metadata: comparison and consistency checks.

Our review and examples of transcriptome data and errors show that metadata must be rechecked carefully. Moreover, as a general rule, also the repositories should have automatic checking routines for such mistakes. This is easy to achieve, at least for transcriptome data: if every entry is checked for the novelty of the raw data or even the partial overlap with the stored data, it is easy to identify this type of mistake. Sharing omics data such as RNA-sequencing data as publicly available datasets offers excellent value to the research community, as computational analysis of already existing data can save time and resources. Additionally, analysing already available datasets with other methods, new tools, or a different focus can generate new insights and is slowly becoming standard practice and can even reveal further insights [6] .

Based on the studies above, it is reasonable to assume that individual metadata errors are normally distributed and that a fatal binary error, such as wrong sex or mix-up of control and treatment, occurs for a low percentage of publications (at least 1–3%). At the same time, a substantially larger fraction (roughly estimated about 5x more) has minor quality issues.

6.13.2. The importance of spotting metadata errors in all data types

Spotting metadata errors in all data types requires effort but is essential: Unfortunately, not all labelling errors can be found as easily as the wrong sex in transcriptome metadata. Genetic data are in principle comparable by similar techniques (most easily by mapping genomes against each other), but genetic variation is the key in sequencing new DNA, and hence, wrong labelling of sample and treatment and specific sample conditions may go unnoticed as the variation caused by this is hidden in the “expected” “natural” variation. Even with machine-learning techniques, detecting metadata errors can be quite challenging due to the presence of natural variation.

Therefore, machine-learning models may struggle to differentiate between true errors and inherent variability, leading to unsatisfactory predictions [71] .

Most errors regarding the metadata of other omics data types are complicated to spot in retrospect. For instance, exact conditions in proteomics experiments or time points or conditions, sample preparation and handling are challenging. Even more challenging are metabolomics samples, as sophisticated techniques are used, different protocols are available, and sometimes critical information to allow cross-comparisons over different datasets is not available, rendering cross-comparisons impossible. Samples that erroneously got labelled as control samples are harder to identify and might cause even more damage to research, especially if only a limited number of samples with the particular condition are available.

Finally, metadata in imaging data are comparatively easy to spot if the error pertains to the annotation and what is visible on the image. This should improve the more powerful computer-assisted annotation or even automatic annotation becomes. Tools such as the MetaData Editor for microscopy MDEmic, which allow editing and creating of detailed metadata can improve the data interoperability of imaging data [72] and thus help to apply the FAIR (findability, accessibility, interoperability and reusability) principles [19] in research [72] . However, all metadata errors not directly visible from the image, such as sample preparation, harvesting conditions, pharmacological treatment and time points, are again difficult to spot and require extensive cross-comparisons. For other, more functional data and the typical “individual molecular biology experiment”, the same considerations apply. Thus, we might have a reason for cautious optimism as long as more cross-data checks are purposely or even systematically applied.

7. Coping with the Big Data Wave

A possible approach for coping with the Big Data Wave is by keeping metadata quality high and making systematic comparisons. Big data are constantly increasing, which inevitably increases the workload on the people handling them. Unless there are automatic quality controls, cross-comparisons to validate metadata, and checks and counter-checks ensuring that the entered information is correct, we are to drown in errors as the number of personnel involved in databanks is undoubtedly not increasing at the same pace as data generation.

The well-known reproducibility crisis [73] is triggered by a difficult-to-avoid bias in publishing positive results, not showing negative results, or even omitting most of the confounding data. There are the correct reservations of statisticians regarding statistical biases, too small samples, extravagant claims and missing controls [74] , [75] . However, from the start of any scientific study, good data need good curation. If there is no exponential increase in automatic data and metadata quality controls, we will experience a steady decline in data quality, inversely proportional to data growth.

8. Strategies and obstacles on the road to data integrity and reusability

To be valid, data need to be correct, complete, readily available, and accessible, and compatible; otherwise, the data will not easily be used [76] . This also includes the comparability of data, which is necessary for analysing and comparing multiple datasets created by different researchers. A substantial heterogeneity, as reported by Perumal and colleagues regarding the data quality in anthropometric studies [77] , can impair studies based on several datasets and limit research.

Several concepts have been created and continuously developed to meet these requirements. An international data quality (DQ) research collaboration developed a harmonised intrinsic data quality framework (HIDQF) with two contexts for DQ assessment (verification and validation) and three DQ categories (conformance, completeness and plausibility) [78] to assess the intrinsic quality of a dataset. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines have been established to evaluate the data quality in research publications and are continuously being developed [79] . These guidelines serve as a possible resource to assess the data quality in the published literature [78] .

Additionally, the need for a sufficient infrastructure allowing the reuse of scientific data has led to the design of the FAIR (Findability, Accessibility, Interoperability, and Reusability) Data Principles, which take both human researchers and so-called “computational stakeholders” (computational agents and applications for data retrieval and analysis) into account [19] . As Wilkinson et al. (2016) emphasised, FAIRness is vital for proper data management, which is a precondition for other researchers’ reuse of scientific data [19] .

In addition to these guidelines and good practices, several computational methods are being implemented to solve the data dilemma. Various DQ assessment tools, both commercial and open-source [78] , are available and often shared by the developers, e.g., via GitLab [80] . However, Liaw et al. (2021) observed that only a few studies reported their datasets’ quality [78] .

In clinical practice, informatics is being used to tackle a similar problem: Reporting and analysing patient safety events (PSE) and the measures taken to prevent them are often impeded by missing or incomplete data [81] . In their assessment of narrative medication error reports, Yao and colleagues observed that the narrative parts of PSE reports contained extensive and valuable information. At the same time, structured fields were often ignored [81] . A possible solution for this dilemma is the use of natural language processing (NLP) tools since a proof-of-concept study has already demonstrated that existing NLP systems, such as the Averbis Health Discovery tool, can extract medication information from narrative texts, e.g., from unstructured medical discharge letters [82] . Tools like that could be able to compare the text of the publication to the metadata and help annotating the metadata correctly or finding discrepancies between metadata and publication.

Although NLP tools show promising results in automated text extraction [82] , solely relying on machine learning might add potential uncertainty. Additionally, a recent evaluation of the large language models (LLMs) (“chat”) GPT-3.5 and GPT-4.0 has shown that their performance and behaviour can change substantially [83] . Chen et al. (2023) evaluated GPT-versions and report that some tasks were solved substantially worse after a relatively short amount of time (between March and June 2023), which indicates a need to continuously monitor the behaviour of LLMs [83] .

Furthermore, before monitoring the long-time behaviour of an LLM or other ML approach can even take place, the performance of new models has to be evaluated and probably also compared to already existing models performing the same or a similar task. Objectively comparing different ML models can be challenging as different studies might use different metrics [84] . Moreover, different ML models for different uses might have different requirements, which can in turn affect the evaluation of a tool [84] . An example is the balance between precision and recall, the two most used measures for evaluating the performance of applications for pattern recognition and information retrieval [46] .

Precision is defined as the relation of the number of correct results (True Positives, overlap of the two circles in Fig. 4 ) and the number of all results [46] . In a database search, this would equal the number of relevant documents that were retrieved divided by the total number of documents that were retrieved [47] . Recall is defined as the relation of the number of correct results and the number of expected results [46] . In a database search, this would equal the total number of relevant documents that were retrieved divided by the total number of relevant documents [47] . Data that fulfils the criteria of findability and interoperability is persistently identifiable and re-findable, machine-actionable and its metadata is ideally syntactically parseable as well as semantically machine-accessible [37] . This will result in high recall and high precision, meaning that all (or almost all) relevant data have been found, and all (or almost all) of these data were correctly identified as relevant.

Fig. 4

Visualisation of True Positives, True Negatives, False Positives and False Negatives. Own figure.

Maximising recall is related to a low number of false negatives and assigning more instances as positive, which increases the number of “false alarms” [84] . At the same time, a high precision requires a low rate of false positives, which can result in missing some positive events as only very strong positive predictions will be returned [84] . While in cancer detection some false alarms are tolerable, a less severe and more prevalent disease might require a higher precision [84] . Moreover, a combination of low precision and high recall and the resulting false alarms will unnecessarily increase the manual workload and waste time [84] . Incomplete metrics can also lead to confusion or even give a false impression of the model’s performance, as Hicks et al. elegantly elaborate in their 2022 publication in Scientific Reports [84] .

In database searches, precision and recall are also of interest: Researchers investigating the impact of a specific disease on gene expression in a particular tissue type should be able to effortlessly find (almost) all relevant datasets for their analyses. For instance, datasets that include omics data from both healthy and disease-affected samples derived from the specific tissue type. While precision and recall are used to evaluate the accuracy and relevance of the retrieved information, the afore mentioned aspect of findability primarily concerns the discoverability of data, which can be enhanced by adding sufficient metadata.

8.1. Domain-specific simple annotation tools

These can be a pragmatic solution and cover diverse areas:

Bacterial genomes: Since the verification of supporting data and identification of errors and inconsistencies are challenging tasks, Schmedes et al. (2015), have developed an automated, easy-to-use Excel-based tool for the curation of local bacterial genome databases, which can be used as a quality check before downstream analyses are performed [85] . Additionally, they also emphasise the importance and the urgent need for additional tools and quality control practices, and suggest that an upfront quality control of data by public database managers would save downstream resources and provide the end user with better quality data and metadata [85] .

Sequence read archives: Crandall et al. (2023) report missing spatial and temporal metadata in genome-scale genetic diversity data, which hinders the reuse of these data for monitoring programs and other purposes. They report that in 2021, only about 13% of the over 300,000 Samples in the Sequence Read Archive (SRA) that might be relevant to global genetic biodiversity contained information about the precise location and time where they were obtained [86] . Additionally, they observed a rapid decrease in the availability of metadata necessary to restore the missing information [86] . Due to the rapidly declining metadata availability, which they found mirrored in other kinds of biological data, they raise attention for the need for updated data-sharing policies and researcher practices, as metadata contain valuable context which should not be lost to science forever [86] . Besides this (potential) data loss, the absence of appropriate spatiotemporal metadata additionally represents a loss of research effort, which could range from tens to hundreds of millions of dollars and also affects the Indigenous peoples who otherwise could possibly have benefitted from genomic information originating within their territories [86] .

To tackle this problem, the group (12 professional researchers and 13 graduate students) spend about 2300 h trying to restore the missing information during their “datathon” (data restoration competition), which is described in detail in Crandall et al.’s 2021 publication [86] . Their effort to retrieve the missing metadata could rescue over US $2.1 million worth of genomic sequence data [86] , which indicates that trying to restore and correct missing metadata is worth the effort. However, this is not an ideal long-term solution, it would be much better if metadata were shared as diligently as primary data are already shared because only the added metadata make primary data FAIR [86] . Hence, the authors also provide a detailed list of required and recommended metadata that might be of interest for monitoring genetic diversity, including definitions of the terms [86] .

Sequence database cross-check: Missing (meta)data is not the only challenge for bioinformatics. Increasing evidence suggests that sequence databases harbour significant amounts of erroneous information, including spelling errors in protein descriptions, contamination of sequence data, duplication, and inaccurate annotation of protein function [87] . Therefore, Goudey et al. (2022) analyse and describe the interconnectedness and interdependency of different databases and how the relationships between records in these databases can be employed to understand and improve the quality of sequence records and data in sequence data bases [87] . They propose regarding the various sequence databases as parts of a greater whole, instead of seeing them as independent entities that are loosely linked [87] . This sequence database network and the relationships between records and machine-learning methods, such as trust propagation techniques, can be exploited to detect and correct annotation errors as well as for verification of connected records [87] . Additionally, they highlight the need for new metrics for quantifying quality of records and their respective metadata and the importance of propagating updates or corrections [87] . Machine-learning models, such as random forests and artificial neural networks, can further be employed to find sequencing errors [71] . Another important aspect of using big data is data cleaning or data cleansing, which aims to improve data quality via the identification and the subsequent removal of errors [88] . “Dirty data”, which is defined as being inaccurate, inconsistent and incomplete data, and poor-quality data can affect analyses [88] . Data scientists spend a great amount of time and effort on cleaning and organizing data, taking up 50–80% of their work time [89] . While some errors are relatively easy to spot, such as missing values that get encoded as an unrealistic number (e.g., 99 years of education due to missing values being encoded as 99) [89] , other errors, such as the error we happened to find by accident are more difficult to identify. Since identical data was labelled with different metadata, the only method to identify this error would have been to compare the newly uploaded sequences to all other sequences in the database. By doing so, the duplicate sequences would have been identified, and a subsequent comparison of the respective metadata would have indicated errors in the metadata. This approach would require a huge amount of computing power and is therefore impractical. Routinely comparing the annotated sex and the sex-specific gene expression can indicate errors in the metadata, but only the researchers who created the data might be able to find out whether the respective samples have a simple error in their metadata or if the sample in question got tagged with the metadata of another sample. This demonstrates the huge responsibility for researchers sharing their data.

Enhancing metadata from lab experiments: A possible method to handle newly generated laboratory data and the relevant metadata, such as who created the data, using which samples, following which protocols, has been described by Panse et al. (2022), who explain how the life sciences community of the Functional Genomics Center Zurich (FGCZ) is able to “glue together” data, including metadata, computing infrastructures, such as clouds and clusters, and visualisation software using their self-developed B-Fabric system, allowing instant data exploration and ad-hoc visualisations [90] . They also present their lessons learned, which is not only valuable information for researchers facing similar data organisation tasks but also showcases the qualities and advantages of their software solution [90] .

Ontology and terminology corrections: Another important aspect is highlighted by Beretta et al. (2021): They bring to attention that interdisciplinary data sharing and the discovery and reuse of data face additional challenges due to discipline specific formats and different metadata standards as well as semantic heterogeneity [91] . Therefore, they introduce a user-centric and flexible metadata model, which is based on a common paradigm based the observation concept [91] . In accordance with the FAIR principles, they aim to reuse existing ontological and terminological resources and specify the semantics of the elements of the model with ontologies and vocabularies [91] . By adding semantics to metadata, the model enhances discoverability and semantic interoperability, which enables interdisciplinary research projects [91] . Additionally, the model can utilize reasoning capabilities and for instance enhance a search query for a certain fish in the ocean by finding all datasets of the various species of the fish as well as datasets related to the habitat of the fish [91] . There are of course more solutions, for instance the tool Protégé ( https://protege.stanford.edu ) is an ontology editor equipped with the capability to be integrated with a reasoner for ontology consistency testing (e.g. HermiT).

Preserving metadata for heterogenous data repositories: Discovering, querying and integrating challenging heterogenous data and knowledge from different sources, has been analysed by Kamdar and Musen (2021) [92] . They meta-analysed more than 80 biomedical linked open data sources within the Life Sciences Linked Open Data (LSLOD) cloud, which has been created using Semantic Web and linked data technologies [92] .

In linked data, information is linked from different sources via a site [21] . This approach can even connect different databases which are maintained by separate organisations or heterogeneous systems which did not easily interoperate at the data level [21] . By using a Life Sciences Linked Open Data (LSLOD) schema graph, [92] observed that there is still need for improvement, as the LSLOD cloud was not as densely connected as assumed and that several databases were not well connected to others or even not interconnected with other databases at all [92] . This demonstrates the adverse effect of the heterogeneity and the quality discrepancies of the LSLOD sources and the lack of common vocabularies [92] and highlights the need for transforming non-FAIR data into linkable data [21] as well as the importance of the Linked Data Principles. These principles have been described in detail by Tim Berners-Lee, who highlighted and explained the importance of URIs and the information that can be provided via URIs. Additionally, he introduced a five star rating scheme, 5 Star Linked Data, to point out the important aspects of Linked (Open) Data [22] , [23] , which have been described above. These principles can also be adapted for specific requirements, for instance, to fulfil the requirements of the Linked Open Data Cloud, which include additional requirements including a resolvable http:// or https:// address, and being connected to data that is already part of the diagram via RDF links [93] . These strict criteria might also explain missing entries, since the respective entries might have lacked some of the required criteria.

Another approach besides employing Semantic Web techniques, is decreasing the semantic heterogeneity, e.g., by increasing vocabulary reuse. Using a defined vocabulary will enhance the effectiveness of querying and enable integrating diverse biomedical sources in the future [92] .

This might be aided by different tools that function similar to the tool METAGENOTE (referring to the collection of METAdata of Genomics studies on a web-based NOTEbook), which has been developed to help researchers using standardised metadata describing their genomics samples during the submission to the Sequence Read Archive [27] , [94] .

Github supported open tools for repositories and data maintenance: Ideally, these tools for checking the data should be available as open-source software. For various analysis tools, this is already common practice and many developers also maintain GitHub repositories, which offers the opportunity to interact with the developers and other users. This allows users to draw attention to problems and ask for support when needed. Additionally, all questions, discussions and solutions are archived and accessible for future reference. How well this system works can be seen in various GitHub Repositories for open source R tools such as DESeq2 [57] , Seurat [95] , [96] , [97] , [98] , [99] or OmniPath [100] , [101] .

This practice is not only convenient but often also a requirement upon publication of an article introducing the tool, and should be implemented for publications introducing new datasets and/or using these analysis tools as well.

An example is the Nature publication “ A transcriptomic atlas of mouse cerebellar cortex comprehensively defines cell types ” by Kozareva et al. (2021) [102] . The authors made their data available via the GEO database [8] and the Single Cell Portal, which is available at https://singlecell.broadinstitute.org/single_cell , and additionally created a GitHub Repository (available at https://github.com/MacoskoLab/cerebellum-atlas-analysis ) where they describe in detail how others can recreate their figures [102] .

Another possible approach for addressing deficiencies in data quality or the absence of corrections might be community moderation, which could give users the opportunity to discuss articles or research findings or even highlight discrepancies. However, solely relying on community input without moderating options or even an “anti-censorship” philosophy, such as the late 1970 s online Bulletin Board System CommuniTree [103] , is a risky strategy. The example of CommuniTree, which has been described in detail by Seering (2020) demonstrated that the dream of the internet being a “market place of ideas” which would allow better perspectives to naturally rise to the top, was a utopia [103] . While the consequences of a more heterogenous user group might have been a surprise for the first CommuniTree users, today, online conflicts are a well-known and unresolved problem [103] . Therefore, some forms of moderation options need to be implemented. The moderation itself can either be performed by the platform itself or by the community, for instance by volunteer moderators [103] . However, both approaches require additional resources, either by the respective platform itself or by volunteers or even by volunteers and the respective platform, since the platforms might want to employ platform administrators who are in charge of final content moderation decisions [103] . An example for the usefulness of an option to comment scientific articles is the bioRxiv platform, the preprint server for biology. The platform links preprints with discussions about the respective articles in the media and in Tweets and even provides links to online discussions regarding the preprints that occur elsewhere (Community Reviews). Additionally, bioRxiv offers a Comment option, where readers can discuss the article or ask questions regarding the article, which are sometimes answered by the authors or other readers. An example is the bioRxiv preprint of the Cell publication by Blanco-Melo et al. (2020) [50] , [104] , which can be found at https://www.biorxiv.org/content/10.1101/2020.03.24.004655v1#comments . Additionally, published preprints, such as the preprint by Blanco-Melo et al. (2020) [104] , are linked to the final version of the article [50] upon publication.

8.1.1. Tools for data management and version control

Besides the above-mentioned tools and strategies, several software solutions to enhance data management and facilitate generating FAIR data are available. A selection of these tools is presented in Table 4 , a more detailed description of these tools including advantages, disadvantages and FAIRness-rating as well as the respective links can be found in the Supplementary Data .

Selected software solutions for data management and version control, detailed descriptions are available in the Supplementary Data.

8.1.2. Selected tools for data handling, error detections, bioinformatic analyses and publications

Additionally, several tools have been designed and developed to aid researchers in every step of the publication process, from citation management software to workflow management, quality control and error detection. A small selection of these tools is summarised in Table 5 , more detailed information, including descriptions, links and last update, is available in the Supplementary Data.

Selected software solutions for data handling, error detections, bioinformatic analyses and publications  a .

Notable examples regarding big data and cloud use are for instance the open-source projects Apache Hadoop, Apache Spark, and Databricks.

Apache Hadoop, which was developed by Doug Cutting and Mike Cafarella, is one of the well-known solutions for working with big data [105] . Five characteristics, often referred to as the 5 Vs of Big Data, describe big data: Volume, velocity, value, veracity, and variety [105] . Volume is the most obvious and most immediate challenge of big data, as the data does not only need to be stored but also analysed [105] . Furthermore, big data is created with ever increasing speed (velocity) and refers to a variety of data, which can be structured or unstructured, which affects the storage and the analysis of the data [105] . The potential value is the most important aspect of big data [105] . Although the potential value is huge, Big Data needs to be analysed and turned into value [105] . However, the quality of the data can vary greatly. Due to high volume, velocity and variety, not all of the data can be 100% correct [105] . Therefore, the accuracy of the data analysis depends on the source data’s veracity [105] , which can be defined as “truthfulness, accuracy or precision, correctness” of the data [106] .

Hadoop was originally started as a part of the scalable open-source web crawler project Apache Nutch but soon emerged as an independent and top-level Apache project [107] . The first version of Hadoop consisted of the Hadoop Distributed File System (HDFS), an abstraction layer which is responsible for data storage, and the distributed programming paradigm MapReduce, which was used for managing job resources, as the two main components [107] . When Doug Cutting joined Yahoo! in 2006, a dedicated team for developing the project was created, and the tool, which was named after a yellow stuffed elephant Doug Cutting’s child used to play with, has been used extensively by the company [107] . By 2009, Yahoo! could sort 1 TB of data in 62 s by using a Hadoop cluster to index its search engine, and by 2010 an ecosystem of tools, such as Hive, Apache HBase and Pig, was developed around Hadoop [107] . As long as the number of functioning computers in the cluster is sufficient, the relatively fault-tolerant Hadoop MapReduce is able to handle hardware failures well and offers a cost-effective way for processing large amounts of data [108] , which is one of the reasons why it is still being used, even almost two decades after its inception. During the COVID-19 pandemic, Apache Hadoop and its MapReduce were proposed as an inexpensive and flexible processing and analysis solution for big data processing during the unprecedented data analysis challenge, which arose from the extraordinary and trailblazing sharing of COVID-19-related data [108] . In 2012, Hadoop 2 and Yet Another Resource Negotiator (YARN) were introduced, which allowed the use of Apache Spark as processing engine as well as other processing models by separating the resource management function of Hadoop from the processing layer [107] . While the big data analysis platform Apache Spark is commonly used on powerful computer clusters, Andrešić et al. (2017) could demonstrate that even a single standard computer is sufficient for data analysis with Apache Spark [109] . Using a standard computer with 8 GB RAM and Apache Spark in single-cluster mode, they could confirm that their approach of combining self-organising map software libraries and Apache Spark was still efficient and fast enough, demonstrating that Spark can also be employed by researchers having limited resources [109] .

Additionally, Spark is also suitable for cloud computing and is part of the Databricks Lakehouse Platform, where a Spark compute layer is used for querying, processing, and transforming the data stored in the storage layer, decoupling Cloud storage and Cloud computing [110] . The Databricks Lakehouse Platform is compatible with Microsoft Azure, AWS and Google Cloud, and an example for its use in data analysis is the phishing detection tool by [111] , which uses a combination of Microsoft Azure, a spark cluster and Azure Databricks [111] .

9. Rethinking the correction of errors in scientific publications

Both avoiding errors and finding errors are essential for data integrity. However, it is equally important to correct these errors once they have been found. Metadata and data errors are something natural and happen. They become more and more as the data accumulate, we can only do our best to lower a priori error probability per dataset but can never achieve a probability of zero. As seen in the rising number of retractions, more publications and better technical means lead to more reasons for retractions being found [112] . Due to inconsistencies of how different journals handle retractions, the Committee on Publication Ethics (COPE) published retraction guidelines in 2009, and the 2010-founded blog “Retraction Watch” covered over 200 retractions and logged more than a million page views in its first year of existence [112] .

Nevertheless, correcting errors in scientific publications is challenging due to the current public stigma connected to post-publication updates or corrections, especially when such updates are mistakenly perceived as punitive measures or confessions of wrongdoing [113] .

Additionally, the self-correction process and the correction of mistakes face considerable obstacles, which were highlighted by Allison et al. in their 2016 Nature article [114] . They identified several challenges associated with the correction process, including disincentives against correction (e.g., fees imposed on authors who request the withdrawal of their publication) and barriers, such as journals requiring publication fees for articles bringing attention to previously published works within the same journal [114] . Furthermore, addressing errors officially in a timely manner may be challenging, as editors may be unable or hesitant to take swift action [114] . This can also be attributed to the conflicting priorities of ensuring fairness to the authors during an ongoing investigation and the need to preserve the integrity of the literature [113] .

Luckily, large data infrastructure projects improve data governance such as the NFDI (the German National Research Data Infrastructure) initiative, which was implemented by the German Research Foundation (DFG) and fosters research data management (RDM) [115] and several consortia, including the NFDI4Microbiota ( https://nfdi4microbiota.de ). International efforts become increasingly aware of this escalating problem on data and metadata integrity, such as Elixir ( https://elixir-europe.org ), the European life sciences infrastructure bringing together EMBL-EBI and more than 220 institutes within 22 countries [116] . The goal of achieving reproducible results requires ever-new solutions for scientific data management, taking advantage of the willingness of the scientific community to achieve the highest data standards and overcoming the existing barriers by a systematic development of standards, tools, and infrastructure, the provision of training, education, and support, as well as additional resources for research data management (RDM) [115] , [116] .

In light of these considerations, we would like to appreciate the authors for their prompt response in rectifying the issue within the GEO database, as it is not always a given that such corrections are made swiftly. In this sense, our chosen introductory example is a best-practice example. However, it is important to note that the metadata within the article remains inaccurate (as of September 2023), which might be misleading for researchers only considering the information in the article, the supplementary data , and the metadata provided via the SRA Run Selector, without reading all GEO entries for the samples they have chosen to repurpose. At the same time, we would like to draw attention to an optimal practice of considering every available piece of information, even though this is a time-consuming step, especially for big data sets. In this digital age, various technical solutions can address these inaccuracies, for instance by applying the concept of “living articles” described by Barbour et al. in 2017 [113] . Ideally, this leads to a transparent and comprehensive history of changes, which is – in accordance to the key principles for amendments postulated by Barbour and colleagues – accessible for both human and machine readers [113] . Embracing these principles and adopting approaches like the “amendment” system proposed by Barbour et al. (2017), which uses a more neutral term for describing post-publication changes [113] , would contribute to a more robust and accurate scientific literature and enhanced reusability of research data. As resources are always scarce and the data avalanche is constantly increasing, at least the highly cited key datasets should be systematically supported by such a regularly updated curation history. Flagship databanks such as EMBL database and GenBank are good examples of continuous curation with new bimonthly releases and daily updates [117] .

Another aspect of big data to keep in mind is that every online resource, every tool, and every workflow using these tools is vulnerable to updates as updates might affect their functionality. Thus, if they want to guarantee the usability of their tools, the developers have to keep checking and updating their tools or software packages. For workflows describing or documenting how the results of a publication were generated, this might not be feasible, thus, in these cases documenting the software versions is of utmost importance. If users encounter a workflow that is not functionable, they can check the software versions and may try to recreate the workflow by installing older versions of the software or adapting the workflow to the current software requirements.

9.1. Review limitations and implications

This is not a systematic review of data, metadata and the current best practices for data governance, nor a systematic review about all existing errors in current databases. We want to raise awareness for errors and mistakes in metadata annotation and point out typical errors and helpful metadata maintenance tools. Metadata errors may arise due to problems in sample tracking, and might be avoidable by using appropriate laboratory information management systems and thoroughly documenting the sample metadata. There are errors which could have been avoided as they might be attributed to factors such as a too complicated or too time-consuming procedure for uploading metadata or maybe a lack of awareness regarding how to generate easily reusable data and what needs to be considered. However, most of the errors in metadata just happen with non-zero probability and in this sense cannot be avoided and the data can easily deteriorate with exponential increasing data volume without any counter measures. We hope that this review brings to attention that in big data even little mistakes might end up having huge consequences, especially in biomedical research, and that the errors that have been found might as well be only the tip of the iceberg.

While our review acknowledges necessary scope limitations and the vastness of the topic, it is clear that more studies and systematic evaluations are needed to fully grasp the extent of the issues at hand. One key aspect that emerged from our analysis is the necessity for individuals to possess fundamental knowledge and a sense of responsibility when dealing with data. With big data comes big responsibility, and it is crucial that we foster a culture that promotes accountability and encourages the timely identification and rectification of errors. Thus, to encourage this, our perception of errors in scientific research needs to evolve towards a mindset that acknowledges the occurrence of errors as normal instead of catastrophic events, and focuses on rectifying them promptly and effectively. By removing the stigma associated with mistakes, we create an environment that encourages open communication, timely error identification, and effective remediation, enabling continuous improvement. Overall, the spirit of science lies in the exchange and sharing of information to expand our collective knowledge. Digital data sets in bioinformatics are prime examples of resources that can be easily shared, and they provide opportunities for diverse perspectives and fresh insights.

By embracing responsible practices, encouraging data sharing, and fostering a supportive scientific community, we can navigate the challenges posed by big data and metadata management, paving the way for reliable research. We hope that we could raise awareness for the importance of metadata for future research and a little overview about the efforts that are being made to help avoiding such mistakes in the future.

10. Conclusion

Big data and the accompanying metadata create both chances and challenges for scientific research. The exponentially growing amount of data and the possibly drastic, indirect consequences of mismanaging metadata, akin to the hidden depths of an iceberg, emphasize the need for a comprehensive understanding of the importance of data integrity and a responsible maintenance approach. Automatic consistency checks on metadata integrity should be further improved (sex, age, experimental conditions) and be generally applied. Three-month updates and error corrections are routine in large public databases, but this should include all published large-scale datasets, praising authors for such efforts and not blaming them. Data and metadata integrity are a continuous effort for all scientists – and actually a battle for data quality we must not lose.

Author contributions

AC did make all data processing and data comparisons, supervised by TD. AC, SD, and TD analysed the resulting processed data and data comparisons together. AC drafted the paper; AC, SD, and TD edited and polished the manuscript. All authors agree to the final version. Contributor roles: AC: conceptualisation, data curation, formal analysis, investigation, methodology, visualisation, writing – original draft; writing – review and editing; SD: conceptualisation, formal analysis, investigation, validation, writing – review and editing; TD: conceptualisation, formal analysis, investigation, supervision, writing – review and editing.

We thankfully acknowledge funding by Stanford Translational Research and Applied Medicine (TRAM) grant, 2020–21 (SD), Stanford Diabetes Research Centre Pilot and Feasibility Grant, 2021 (SD). We are indebted to Deutsche Forschungsgemeinschaft (DFG), grant 492620490 – SFB 1583 /INF (TD, AC) and Land Bavaria (contribution to DFG grant 324392634 - TRR 221/INF) (TD) for their constant support of the respective data infrastructure which stimulated this article. The funders had no role in study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.

Declaration of Generative AI and AI-assisted technologies in the writing process

The authors state clearly that they did not use any generative artificial intelligence (AI) and AI-assisted technologies in the writing process.

Conflict of interest statement

All authors declare that they have no conflict of interest. There are no financial, personal or other conflicts of interest by any of the authors (TD, AC, SD).

Appendix A Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.10.006 .

Appendix A. Supplementary material

Supplementary material

Data Availability

IMAGES

  1. Frontiers

    big data in bioinformatics research

  2. HND in Bioinformatics

    big data in bioinformatics research

  3. Characteristics of big data in bioinformatics (see online version for

    big data in bioinformatics research

  4. Bioinformatics

    big data in bioinformatics research

  5. Biological Data Analysis & Visualization

    big data in bioinformatics research

  6. FUTURE OF BIOINFORMATICS

    big data in bioinformatics research

VIDEO

  1. LSU BIOMMED: Bioinformatics Training with Omics Logic

  2. Big Data and New Methods for Bioinformatics: Analysis from Sequence to Result

  3. Introducing bioinformatics workflow management

  4. In Silico Pharmacology| In Silico Pharmacology Trends You Need to Know| #CADD #bioinformatics

  5. Unlocking Mysteries: The Surprising World of Bioinformatics!

  6. Best Programming Languages for Bioinformatics

COMMENTS

  1. Bioinformatics, Big Data, and Cancer

    The NCI Cancer Research Data Commons (CRDC) is a data science infrastructure that connects cancer research data collections with analytical tools. The CRDC can be used to store, analyze, share, and visualize cancer research data types, including proteomics, animal models, and epidemiological cohorts. The CRDC includes the NCI Genomic Data ...

  2. Big data in basic and translational cancer research

    Indeed, the combination of big data, bioinformatics and artificial intelligence has led to notable advances in our basic understanding of cancer biology and to translational advancements.

  3. Big Data Analysis in Bioinformatics

    Big data has influenced bioinformatics extremely well in recent years. The area of exploration is tremendous and complex. ... Research into big data analytics has became widely accessible on the cloud computing infrastructure and distributed processing systems like MapReduce (Dean and Ghemawat 2005) and their open-source executions. Iterative ...

  4. Big Data Application in Biomedical Research and Health Care: A

    The application of big data in health care is a fast-growing field, with many new discoveries and methodologies published in the last five years. In this paper, we review and discuss big data application in four major biomedical subdisciplines: (1) bioinformatics, (2) clinical informatics, (3) imaging informatics, and (4) public health informatics.

  5. Big Data in Bioinformatics and Computational Biology: Basic Insights

    The field of medicine and biology contributes largely to big data and has led to evolution of the field of bioinformatics and computational biology. 1.1 Importance of Big Data in Biology Like engineering, biological sciences have also experienced an IT boom, emphasizing computer science's potential to aid biological research.

  6. Big Data Application in Biomedical Research and Health Care: A

    In this paper, we review and discuss big data application in four major biomedical subdisciplines: (1) bioinformatics, (2) clinical informatics, (3) imaging informatics, and (4) public health informatics. Specifically, in bioinformatics, high-throughput experiments facilitate the research of new genome-wide association studies of diseases, and ...

  7. Big Data Bioinformatics

    Big Data Bioinformatics. This article has been corrected. See J Cell Physiol. 2016 January; 231 (1): 257. Recent technological advances allow for high throughput profiling of biological systems in a cost-efficient manner. The low cost of data generation is leading us to the "big data" era. The availability of big data provides unprecedented ...

  8. Managing, Analysing, and Integrating Big Data in Medical Bioinformatics

    The most important initiatives for the usage of Big Data techniques in medical bioinformatics are related to scientific research efforts, as described in the paper. Nevertheless, some commercial initiatives are available to cope with the huge quantity of data produced nowadays in the field of molecular biology exploiting high-throughput omics ...

  9. Big Data Analysis in Computational Biology and Bioinformatics

    Nowadays, big data analysis in computational biology and bioinformatics is an important area of research and development. With high-throughput technology accelerating the development of biological data, it has become important to create new methods, algorithms, and tools to store, process, retrieve, statistical analysis, and visualize these ...

  10. Big Data Analytics in Bioinformatics: A Machine Learning Perspective

    Abstract Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies.

  11. Big Data Analysis in Computational Biology and Bioinformatics

    In this chapter, we provide an overview of the current status of big data analysis in computational biology and bioinformatics. We discuss the various aspects of big data analysis, including data acquisition, storage, processing, and analysis. We also highlight some of the challenges and opportunities of big data analysis in this area of research.

  12. Bioinformatics 101: Big Data in Biological Research

    The term "Big Data" often evokes images of vast server farms and complex algorithms. In bioinformatics, Big Data refers to the enormous volumes of biological data that scientists must sift through to find actionable insights. Platforms like Bionl are instrumental in this process, offering intuitive, no-code solutions that transform what could ...

  13. A scoping review of 'big data', 'informatics', and 'bioinformatics' in

    Research in big data, informatics, and bioinformatics has grown dramatically (Andreu-Perez J, et al., 2015, IEEE Journal of Biomedical and Health Informatics 19, 1193-1208). Advances in gene sequencing technologies, surveillance systems, and electronic medical records have increased the amount of he …

  14. Big Data in Bioinformatics and Computational Biology ...

    This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities. View Show ...

  15. Genomics in Big Data Bioinformatics

    With the advent of big data the volume of data in bioinformatics research is exponentially growing. The big data sources have vast information as they have exceeded the particle physics experiments and search—engine blogs and indexes. ... In the current trend bioinformatics in big data the technologies which capture the bio data has become ...

  16. Big Data in Bioinformatics

    Big Data in Bioinformatics. Biology is becoming increasingly data-intensive as high-throughput genomic assays become more accessible to greater numbers of biologists. Working with large-scale data sets requires user-friendly yet powerful software tools that stimulate user's intuition, reveal outliers, detect deeper structures embedded in the ...

  17. Bioinformatics and Big Data Analytics in Genomic Research

    Bioinformatics and Big Data Analyt ics in Genomic Research. Qaiser asad. Department of health science, university of Public Health, Gujrat, India. Abstract: The field of genomics has witnessed a ...

  18. Big Data in Bioinformatics and Computational Biology: Basic Insights

    Here, we talk about how the advent of high-performance sequencing platforms has paved the way for Big Data in biology and contributed to the development of modern bioinformatics, which in turn has helped to expand the scope of biology and allied sciences. New technologies and methodologies for the storage, management, analysis, and ...

  19. (PDF) Big Data in Bioinformatics

    PDF | On Apr 3, 2018, N.N. Nazipova and others published Big Data in Bioinformatics | Find, read and cite all the research you need on ResearchGate

  20. Bioinformatics clouds for big data manipulation

    Reviewer 2: Dr. Igor Zhulin (University of Tennessee, United States of America) The review summarizes advantages of using cloud computing for "big data" storage and analysis issues in bioinformatics. In general, it does a fair job on this front. However, disadvantages of clouds are not discussed in this review at all.

  21. Role of Bioinformatics in Data Mining and Big Data Analysis

    Through data mining and big data analysis, bioinformatics is unlocking a deeper understanding of biology and paving the way for future breakthroughs in healthcare and medicine. While challenges exist, the potential of bioinformatics in leveraging big data is immense and largely unexplored, offering exciting opportunities for future research and ...

  22. (PDF) Big Data in Bioinformatics

    At the. same time, keeping in mind the onset of the "big data" t erm in 2008, we can remember that. initially it was primarily concerned with the scientific sphere and, to a large extent, with ...

  23. Can Big Data Have a Role in Treating Dementia? That's What This

    Wong, who is set to graduate in May with a major in biology and a minor in data science, has spent his college career at Northeastern University doing research in the neuroscience domain. He started at Northeastern in 2020 and quickly began doing research at the university's Laboratory for Movement Neurosciences, learning under professors ...

  24. Metadata integrity in bioinformatics: Bridging the gap between data and

    In biomedical research, big data imply also a big responsibility. This is not only due to genomics data being sensitive information but also due to genomics data being shared and re-analysed among the scientific community. ... During our data retrieval research for a bioinformatics analysis project, we searched for human lung samples of ...