Datasets: arxiv_dataset like 63

The dataset viewer is disabled because the authors forbid processing this dataset automatically and require the users to download the dataset files manually .

Dataset Card for arXiv Dataset

Dataset summary.

A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

Supported Tasks and Leaderboards

[More Information Needed]

The language supported is English

Dataset Structure

Data instances.

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. An example is given below

Data Fields

  • id : ArXiv ID (can be used to access the paper)
  • submitter : Who submitted the paper
  • authors : Authors of the paper
  • title : Title of the paper
  • comments : Additional info, such as number of pages and figures
  • journal-ref : Information about the journal the paper was published in
  • doi : Digital Object Identifier
  • report-no : Report Number
  • abstract : The abstract of the paper
  • categories : Categories / tags in the ArXiv system

Data Splits

The data was not splited.

Dataset Creation

Curation rationale.

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more is presented to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

Source Data

This data is based on arXiv papers. [More Information Needed]

Initial Data Collection and Normalization

Who are the source language producers, annotations.

This dataset contains no annotations.

Annotation process

Who are the annotators, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators.

The original data is maintained by ArXiv

Licensing Information

The data is under the Creative Commons CC0 1.0 Universal Public Domain Dedication

Citation Information

Contributions.

Thanks to @tanmoyio for adding this dataset.

Models trained or fine-tuned on arxiv_dataset

Callidior/bert2bert-base-arxiv-titlegen, wi/arxiv-distilbert-base-cased, chega/distill-scibert_scivocab_uncased.

research abstracts dataset

TromeroResearch/SciMistral-V1

research abstracts dataset

jordyvl/test_implementation

Jordyvl/baseline_bert_50k_steps.

  • Español – América Latina
  • Português – Brasil
  • Tiếng Việt

TFDS now supports the Croissant 🥐 format ! Read the documentation to know more.

scientific_papers

  • Description :

Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories.

Both "arxiv" and "pubmed" have two features:

  • article: the body of the document, pagragraphs seperated by "/n".
  • abstract: the abstract of the document, pagragraphs seperated by "/n".

section_names: titles of sections, seperated by "/n".

Additional Documentation : Explore on Papers With Code north_east

Homepage : https://github.com/armancohan/long-summarization

Source code : tfds.datasets.scientific_papers.Builder

  • 1.1.0 : No release notes.
  • 1.1.1 (default): No release notes.

Download size : 4.20 GiB

Auto-cached ( documentation ): No

Feature structure :

  • Feature documentation :

Supervised keys (See as_supervised doc ): ('article', 'abstract')

Figure ( tfds.show_examples ): Not supported.

scientific_papers/arxiv (default config)

Config description : Documents from ArXiv repository.

Dataset size : 7.07 GiB

  • Examples ( tfds.as_dataframe ):

scientific_papers/pubmed

Config description : Documents from PubMed repository.

Dataset size : 2.34 GiB

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-12-23 UTC.

  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

We present PubMed 200k RCT 1 1 1 The dataset is freely available at https://github.com/Franck-Dernoncourt/pubmed-rct , a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

1 Introduction

Short-text classification is an important task in many areas of natural language processing, such as sentiment analysis, question answering, or dialog management. For example, in a dialog management system, one might want to classify each utterance into dialog acts  Stolcke et al. ( 2000 ) .

In the dataset we present in this paper, PubMed 200k RCT, each short text we consider is one sentence. We focus on classifying sentences in medical abstracts, and particularly in randomized controlled trials (RCTs), as they are commonly considered to be the best source of medical evidence  Tianjing Li ( 2015 ) . Since sentences in an abstract appear in a sequence, we call this task the sequential sentence classification task, in order to distinguish it from general text or sentence classification that does not have any context.

The number of RCTs published every year is steadily increasing, as Figure  1 illustrates. Over 1 million RCTs have been published so far and around half of them are in PubMed  Mavergames ( 2013 ) , which makes it challenging for medical investigators to pinpoint the information they are looking for. When researchers search for previous literature, e.g., to write systematic reviews, they often skim through abstracts in order to quickly check whether the papers match the criteria of interest. This process is easier when abstracts are structured , i.e., the text in an abstract is divided into semantic headings such as objective, method, result, and conclusion. However, over half of published RCT abstracts are unstructured , as shown in Figure  2 , which makes it more difficult to quickly access the information of interest.

Refer to caption

Consequently, classifying each sentence of an abstract to an appropriate heading can significantly reduce time to locate the desired information, as Figure  3 illustrates. Besides assisting humans, this task may also be useful for a variety of downstream applications such as automatic text summarization, information extraction, and information retrieval. In addition to the medical applications, we hope that the release of this dataset will help the development of algorithms for sequential sentence classification.

2 Related Work

Existing datasets for classifying sentences in medical abstracts are either small, not publicly available, or do not focus on RCTs. Table  1 presents an overview of existing datasets.

The most studied dataset to our knowledge is the NICTA-PIBOSO corpus published by Kim et al.  Kim et al. ( 2011 ) . This dataset was the basis of the ALTA 2012 Shared Task  Amini et al. ( 2012 ) , in which 8 competing research teams participated.

Only the dataset published in  Davis-Desmond and Mollá ( 2012 ) is publicly available: two datasets can only be obtained via email inquiries, and the other datasets are not accessible (unanswered email requests or negative replies). The only public dataset is also the smallest one.

3 Dataset Construction

3.1 abstract selection.

Our dataset is constructed upon the MEDLINE/PubMed Baseline Database published in 2016 , which we will refer to as PubMed in this paper. PubMed can be accessed online by anyone, free of charge and without having to go through any registration. It contains 24,358,442 records. A record typically consists of metadata on one article, as well as the article’s title and in many cases its abstract.

We use the following information from each PubMed record of an article to build our dataset: the PubMed ID (PMID), the abstract and its structure if available, and the Medical Subject Headings (MeSH) terms. MeSH is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed.

We select abstracts from PubMed based on the two following criteria:

the abstract must belong to an RCT. We rely on the article’s MeSH terms only to select RCTs. Specifically, only the articles with the MeSH term D016449, which corresponds to an RCT, are included in our dataset. 399,254 abstracts fit this criterion.

the abstract must be structured. In order to qualify as structured, it has to contain between 3 and 9 sections (inclusive), and it should not contain any section labeled as “None”, “Unassigned”, or “” (empty string). Only 0.5% of abstracts have fewer than 3 sections or more than 9 sections: we chose to discard these outliers. The label of each section was originally given by the authors of the articles, typically following the guidelines given by journals: as many labels exist, PubMed maps them into a smaller set of standardized labels: background, objective, methods, results, conclusions, “None”, “Unassigned”, or “” (empty string).

195,654 abstracts fit these two criteria, i.e., belong to RCTs and are structured.

3.2 Dataset Split

The dataset contains 195,654 abstracts and is randomly split into three sets: a validation set containing 2500 abstracts, a test set containing 2500 abstracts, and a training set containing the remaining 190,654 abstracts. Since 200k abstracts may be too many for some applications, we also provide a smaller dataset, PubMed 20k RCT, which contains 15000 abstracts for the training set, 2500 abstracts for the validation set, and 2500 abstracts for the test set. The 20k abstracts were chosen from the 200k abstracts by taking the most recently published ones. Table  2 presents the number of abstracts and sentences for both PubMed 20k RCT and PubMed 200k RCT, for each split of the data set.

3.3 Dataset Format

The dataset is provided as three text files: one for the training set, one for the validation set, and one for the test set. Each file has the same format: each line corresponds to either a PMID or a sentence with its capitalized label at the beginning. Each token is separated by a space. Listing  1 shows an excerpt from these files.

For each abstract, sentence and token boundaries are detected using the Stanford CoreNLP toolkit  Manning et al. ( 2014 ) . We provide two versions of the dataset: one with the original text, and one where digits are replaced by the character @ (at sign).

4 Dataset Analysis

Figure  4 counts the number of sentences per label: the least common label (objective) is approximately four times less frequent than the most common label (results), which indicates that the dataset is not excessively unbalanced. Figure  5 shows the distribution of the number of tokens the sentence. Figure  6 shows the distribution of the number of sentences per abstract. Figures  4 ,  5 and  6 are based on PubMed 200k RCT.

Refer to caption

5 Performance Benchmarks

We report the performance of several systems to characterize our dataset. The first baseline is a classifier based on logistic regression (LR) using n-gram features extracted from the current sentence: it does not use any information from the surrounding sentences. This baseline was implemented with scikit-learn  Pedregosa et al. ( 2011 ) .

The second baseline (Forward ANN) uses the artificial neural network (ANN) model presented in  Lee and Dernoncourt ( 2016 ) : it computes sentence embeddings for each sentence, then classifies the current sentence given a few preceding sentence embeddings as well as the current sentence embedding.

The third baseline is a conditional random field (CRF) that uses n-grams as features: each output variable of the CRF corresponds to a label for a sentence, and the sequence the CRF considers is the entire abstract. The CRF baseline therefore uses both preceding and succeeding sentences when classifying the current sentence. CRFs have been shown to give strong performances for sequential sentence classification  Amini et al. ( 2012 ) . This baseline was implemented with CRFsuite  Okazaki ( 2007 ) .

The fourth baseline (bi-ANN) is an ANN consisting of three components: a token embedding layer (bi-LSTM), a sentence label prediction layer (bi-LSTM), and a label sequence optimization layer (CRF). The architecture is described in  Dernoncourt et al. ( 2016 ) and has been demonstrated to yield state-of-the-art results for sequential sentence classification.

Table  3 compares the four baselines. As expected, LR performs the worst, followed by the Forward ANN. The bi-ANN outperforms the CRF, but as the data set becomes larger the difference of performances diminishes.

Table  4 presents the precision, recall, F1-score and support for each class with the bi-ANN. Accurately classifying the background and objective classes is the most challenging. The confusion matrix in Table  5 shows that background sentences are often confused with objective sentences, and vice versa.

Table  6 gives more details on the LR baseline, and illustrates the impact of the choice of the n-gram size on the performance. By the same token, Table  7 shows the impact of the choice of the window size on the performance of the CRF.

6 Conclusion

In this article we have presented PubMed 200k RCT, a dataset for sequential sentence classification. It is the largest such dataset that we are aware of. We have evaluated the performance of several baselines so that researchers may directly compare their algorithms against them without having to develop their own baselines. We hope that the release of this dataset will accelerate the development of algorithms for sequential sentence classification and increase the interest of the text mining community in the study of RCTs.

  • Amini et al. (2012) Iman Amini, David Martinez, and Diego Molla. 2012. Overview of the ALTA 2012 Shared Task. In Australasian Language Technology Association Workshop 2012 . volume 7, page 124.
  • Boudin et al. (2010) Florian Boudin, Jian-Yun Nie, Joan C Bartlett, Roland Grad, Pierre Pluye, and Martin Dawes. 2010. Combining classifiers for robust PICO element detection. BMC medical informatics and decision making 10(1):29.
  • Chung (2009) Grace Yuet-Chee Chung. 2009. Towards identifying intervention arms in randomized controlled trials: extracting coordinating constructions. Journal of biomedical informatics 42(5):790–800.
  • Davis-Desmond and Mollá (2012) Patrick Davis-Desmond and Diego Mollá. 2012. Detection of evidence in clinical research papers. In Proceedings of the Fifth Australasian Workshop on Health Informatics and Knowledge Management-Volume 129 . Australian Computer Society, Inc., pages 13–20.
  • Dernoncourt et al. (2016) Franck Dernoncourt, Ji Young Lee, and Peter Szolovits. 2016. Neural networks for joint sentence classification in medical paper abstracts. European Chapter of the Association for Computational Linguistics (EACL) 2017 .
  • Dunn (1997) Peter M Dunn. 1997. James lind (1716-94) of edinburgh and the treatment of scurvy. Archives of Disease in Childhood-Fetal and Neonatal Edition 76(1):F64–F65.
  • Hara and Matsumoto (2007) Kazuo Hara and Yuji Matsumoto. 2007. Extracting clinical trial design information from medline abstracts. New Generation Computing 25(3):263–275.
  • Hirohata et al. (2008) Kenji Hirohata, Naoaki Okazaki, Sophia Ananiadou, and Mitsuru Ishizuka. 2008. Identifying sections in scientific abstracts using conditional random fields. In Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I .
  • Huang et al. (2013) Ke-Chun Huang, I-Jen Chiang, Furen Xiao, Chun-Chih Liao, Charles Chih-Ho Liu, and Jau-Min Wong. 2013. Pico element detection in medical text without metadata: Are first sentences enough? Journal of biomedical informatics 46(5):940–946.
  • Huang et al. (2011) Ke-Chun Huang, Charles Chih-Ho Liu, Shung-Shiang Yang, Furen Xiao, Jau-Min Wong, Chun-Chih Liao, and I-Jen Chiang. 2011. Classification of pico elements by text features systematically extracted from pubmed abstracts. In Granular Computing (GrC), 2011 IEEE International Conference on . IEEE, pages 279–283.
  • Kim et al. (2011) Su Nam Kim, David Martinez, Lawrence Cavedon, and Lars Yencken. 2011. Automatic classification of sentences to support evidence based medicine. BMC bioinformatics 12(2):S5.
  • Krogh et al. (2016) Thøger P Krogh, Torkell Ellingsen, Robin Christensen, Pia Jensen, and Ulrich Fredberg. 2016. Ultrasound-guided injection therapy of achilles tendinopathy with platelet-rich plasma or saline a randomized, blinded, placebo-controlled trial. The American journal of sports medicine .
  • Lee and Dernoncourt (2016) Ji Young Lee and Franck Dernoncourt. 2016. Sequential short-text classification with recurrent and convolutional neural networks. In Human Language Technologies 2016: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT .
  • Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations . pages 55–60.
  • Mavergames (2013) Chris Mavergames. 2013. The future of knowledge: Cochranetech to 2020 (and beyond) . 21st Cochrane Colloquium. http://mavergames.info/ .
  • Meldrum (2000) Marcia L Meldrum. 2000. A brief history of the randomized controlled trial: From oranges and lemons to the gold standard. Hematology/oncology clinics of North America 14(4):745–760.
  • Okazaki (2007) Naoaki Okazaki. 2007. Crfsuite: a fast implementation of conditional random fields (CRFs) . http://www.chokkan.org/software/crfsuite/ .
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12(Oct):2825–2830.
  • Robinson (2012) David Alexander Robinson. 2012. Finding patient-oriented evidence in pubmed abstracts. Athens: University of Georgia .
  • Stolcke et al. (2000) Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational linguistics 26(3):339–373.
  • Tianjing Li (2015) Kay Dickersin Tianjing Li. 2015. Introduction to systematic review and meta-analysis. Coursera .
  • Zhao et al. (2012) Jin Zhao, Praveen Bysani, and Min-Yen Kan. 2012. Exploiting classification correlations for the extraction of evidence-based practice information. In AMIA .

ar5iv homepage

BIGPATENT : A Large-Scale Dataset for Abstractive and Coherent Summarization

Eva Sharma , Chen Li , Lu Wang

Export citation

  • Preformatted

Markdown (Informal)

[BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization](https://aclanthology.org/P19-1212) (Sharma et al., ACL 2019)

  • BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization (Sharma et al., ACL 2019)
  • Eva Sharma, Chen Li, and Lu Wang. 2019. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 2204–2213, Florence, Italy. Association for Computational Linguistics.

Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian

  • Original Paper
  • Published: 10 January 2022
  • Volume 56 , pages 973–1007, ( 2022 )

Cite this article

  • Batuhan Baykara   ORCID: orcid.org/0000-0001-8549-6380 1 &
  • Tunga Güngör 1  

1155 Accesses

4 Citations

Explore all metrics

Due to the exponential growth in the number of documents on the Web, accessing the salient information relevant to a user need is gaining importance, which increases the popularity of text summarization. Recent progress in deep learning shifted the research in text summarization from extractive methods towards more abstractive approaches. The research and the available resources remain mostly limited to the English language, which prevents progress in other languages. There is need in low-resourced languages for gathering large-scale resources suitable for such tasks. In this study, we release two large-scale datasets (TR-News and HU-News) that can serve as benchmarks in the abstractive summarization task for Turkish and Hungarian. The datasets are primarily compiled for text summarization, but are also suitable for other tasks such as topic classification, title generation, and key phrase extraction. Morphology is important for these agglutinative languages since meaning is carried mostly within the morphemes of the words. We utilize these morphological properties for tokenization to retain the semantic information and reduce the vocabulary sparsity introduced by the agglutinative nature of these languages. Using the datasets compiled, we propose linguistically-oriented tokenization methods (SeperateSuffix and CombinedSuffix) and evaluate them on the state-of-the-art abstractive summarization models. The SeperateSuffix method achieves the highest ROUGE-1 score on the TR-News dataset and provides promising results on the HU-News dataset. In another experiment, we show that the multilingual cased BERT model outperforms monolingual BERT models for both languages and reaches the highest ROUGE-1 score on the HU-News dataset. Lastly, we provide qualitative analysis of the generated summaries on the TR-News dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research abstracts dataset

Similar content being viewed by others

research abstracts dataset

An Overview of Indian Language Datasets Used for Text Summarization

research abstracts dataset

WikiMulti: A Corpus for Cross-Lingual Summarization

research abstracts dataset

OnSum: Extractive Single Document Summarization Using Ordered Neuron LSTM

https://github.com/batubayk/datasets .

https://github.com/batubayk/MorphologicalTokenizers .

https://github.com/ahmetaa/zemberek-nlp .

https://github.com/dlt-rilmta/purepospy .

https://github.com/dlt-rilmta/emmorphpy .

https://duc.nist.gov/duc2003 .

https://duc.nist.gov/duc2004 .

https://scrapy.org/ .

https://www.mongodb.com/ .

https://github.com/batubayk/newscrawler .

https://github.com/nlpyang/PreSumm .

https://www.nltk.org/ .

https://nlp.stanford.edu/software/ .

Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source NLP framework for Turkic languages. Structure, 10 , 1–5.

Google Scholar  

Anand, D., & Wagh, R. (2019). Effective deep learning approaches for summarization of legal texts. Journal of King Saud University-Computer and Information Sciences . https://doi.org/10.1016/j.jksuci.2019.11.015

Article   Google Scholar  

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: Proceedings of the international conference on learning representations (ICLR) , 2015.

Beke, A., & Szaszák, G. (2016). Automatic summarization of highly spontaneous speech. In: Speech and Computer—18th International Conference, SPECOM 2016, Budapest, Hungary, August 23-27, 2016, Proceedings, volume 9811 of Lecture Notes in Computer Science , pp. 140–147. Springer.

Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020 , pp. 4617–4624, Online, Association for Computational Linguistics.

Çığır, C., Kutlu, M., & Çiçekli, İ. (2009). Generic text summarization for Turkish. In: ISCIS , pp. 224–229. IEEE

Çelikyılmaz, A., Bosselut, A., He, X., & Choi, Y. (2018). Deep communicating agents for abstractive summarization. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , vol. 1 (Long Papers), pp. 1662–1675, Association for Computational Linguistics.

Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pp. 551–561. Association for Computational Linguistics.

Cho, K., van Merriënboer, B., Gülçehre, c., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio,Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 1724–1734. Association for Computational Linguistics.

Chopra, S., Auli, M., & Rush, A. M. (2016). Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pp. 93–98, June 2016. Association for Computational Linguistics.

Creutz, M., & Lagus, K. (2005). Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. In: Helsinki University of Technology

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics.

Döbrössy, B., Makrai, M., Tarján, B., & Szaszák, G. (2019). Investigating sub-word embedding strategies for the morphologically rich and free phrase-order Hungarian. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) , pp. 187–193

Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12 , 2121–2159.

Edmundson, H. P. (1969). New methods in automatic extracting. Journal of ACM, 16 (2), 264–285.

Erguvanlı, E. E., & Taylan, E. E. (1984). The function of word order in Turkish grammar (Vol. 106, p. 1984). University of California Press.

Eşref, Y., & Can, B. (2019). Using morpheme-level attention mechanism for Turkish sequence labelling. In: 2019 27th Signal Processing and Communications Applications Conference (SIU) , pp. 1–4. IEEE.

Gehrmann, S., Deng, Y., & Rush, A. (2018). Bottom-up abstractive summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 4098–4109. Association for Computational Linguistics.

Gehrmann, S., Ziegler, Z., & Rush, A. (2019). Generating abstractive summaries with finetuned language models. In: Proceedings of the 12th International Conference on Natural Language Generation , pp. 516–522. Association for Computational Linguistics.

Güngör, O., Güngör, T., & Üsküdarlı, S. (2019). The effect of morphology in named entity recognition with sequence tagging. Natural Language Engineering, 25 (1), 147–169.

Güran, A., Bayazit, N. G., & Bekar, E. (2011). Automatic summarization of Turkish documents using non-negative matrix factorization. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications , pp. 480–484. IEEE

Güran, A., Bayazit, N. G., & Gürbüz, M. Z. (2013). Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization. Turkish Journal of Electrical Engineering & Computer Sciences, 21 (5), 1411–1425.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9 (8), 1735–1780.

Huck, M., Riess, S., & Fraser, A. (2017). Target-side word segmentation strategies for neural machine translation. In: Proceedings of the Second Conference on Machine Translation , pp. 56–67.

Indig, B., Sass, B., Simon, E., Mittelholcz, I., Kundráth, P., & Vadász, N. (2019a). emtsv—egy formátum mind felett. In: XV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2019) , pp. 235–247. Szegedi Tudományegyetem Informatikai Tanszékcsoport.

Indig, B., Sass, B., Simon, E., Mittelholcz, I., Vadász, N., & Makrai, M. (2019b). One format to rule them all—the emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop , pp. 155–165. Association for Computational Linguistics.

Kettunen, K. (2014). Can type-token ratio be used to show morphological complexity of languages? Journal of Quantitative Linguistics, 21 (3), 223–245. https://doi.org/10.1080/09296174.2014.911506

Kiefer, F. (1997). On emphasis and word order in Hungarian (Vol. 76). Psychology Press.

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings

Körtvélyessy, L. (2017). Essentials of language typology . UPJŠ.

Kryściński, W., Paulus, R., Xiong, C., & Socher, R. (2018). Improving abstraction in text summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 1808–1817. Association for Computational Linguistics.

Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 66–75. Association for Computational Linguistics.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text summarization branches out (pp. 74–81). Association for Computational Linguistics.

Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 3730–3740, Nov. 2019. Association for Computational Linguistics.

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2 (2), 159–165.

Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pp. 1412–1421, Sept. 2015. Association for Computational Linguistics.

McKeown, K. R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J. L., Nenkova, A., Sable, C., Schiffman, B., & Sigelman, S. (2002). Tracking and summarizing news on a daily basis with Columbias Newsblaster. In: Proceedings of the Human Language Technology Conference , ppp. 280–285, 2002.

Nallapati, R., Zhou, B., dos Santos, C., Gülçehre, c., & Xiang, B. (2016). Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning , pp. 280–290. Association for Computational Linguistics.

Narayan, S., Cohen, S. B., & Lapata, M. (2018). Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 1797–1807. Association for Computational Linguistics.

Nemeskey, D. M. (2017a). Egy emBERT próbáló feladat. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2020) , pp. 409–418, 2020.

Nemeskey, D. M. (2017b). emMorph a Hungarian language modeling baseline. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY2017) , pp. 91–102.

Nemeskey, D. M. (2017c). Natural Language Processing Methods for Language Modeling. PhD thesis, Eötvös Loránd University, 2020.

Nenkova, A., & McKeown, K. R. (2012). A survey of text summarization techniques. In: Mining Text Data , pp. 43–76. Springer, 2012.

Oflazer, K. (2014). Turkish and its challenges for language processing. Language Resources and Evaluation, 48 (4), 639–653.

Özsoy, M. G., Çiçekli, İ, & Alpaslan, F. N. (2010). Text summarization of Turkish texts using latent semantic analysis. In: Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10 , pp. 869-876, 2010. Association for Computational Linguistics.

Pan, Y., Li, X., Yang, Y., & Dong, R. (2020). Morphological word segmentation on agglutinative languages for neural machine translation. CoRR, arxiv:2001.01589

Paulus, R., Xiong, C., & Socher, R. (2018). A deep reinforced model for abstractive summarization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings . OpenReview.net, 2018.

Pembe, F. C., & Güngör, T. (2008). Towards a new summarization approach for search engine results: An application for Turkish. In: Proceedings of the 2008 23rd International Symposium on Computer and Information Sciences , pp, 1–6. IEEE, 2008.

Porter, M. F. (2006). An algorithm for suffix stripping. Program, 40 (3), 211–218.

Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pp. 379–389, Sept. 2015. Association for Computational Linguistics.

Sandhaus, E. (2008). The New York Times annotated corpus. Linguistic Data Consortium, Philadelphia, 6 (12), 2008.

Schweter, S. (2020). BERTurk—BERT models for Turkish. https://doi.org/10.5281/zenodo.3770924

Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., & Staiano, J. (2020). MLSUM: The multilingual summarization corpus. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pp. 8051–8067, Online, Nov. 2020. Association for Computational Linguistics.

See, A., Liu, P. J., & Manning, C.D. (2017). Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 1073–1083. Association for Computational Linguistics.

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 1715–1725, Aug. 2016. Association for Computational Linguistics.

Simon, E., Indig, B., Kalivoda, A., Mittelholcz, I., Sass, B., & Vadasz, N. (2020). Újabb fejlemények az e-magyar háza táján. In: XVI. Magyar Számítógépes Nyelvészeti Konferencia , pp. 29–42. Szegedi Tudományegyetem Informatikai Tanszékcsoport.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems , vol. 27. Curran Associates, Inc., 2014.

Tawfik, A., Emam, M., Essam, K., Nabil, R., & Hassan, H. (2019). Morphology-aware word-segmentation in dialectal Arabic adaptation of neural machine translation. In: Proceedings of the Fourth Arabic Natural Language Processing Workshop , pp. 11–17, Aug. 2019. Association for Computational Linguistics.

Tündik, M. Á., Kaszás, V., & Szaszák, G. (2019). Assessing the semantic space bias caused by ASR error propagation and its effect on spoken document summarization. In: Proceedings of the Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019 , pp. 1333–1337. ISCA, 2019.

Turpin, A., Tsegay, Y., Hawking, D., & Williams, H. E. (2007). Fast generation of result snippets in web search. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 127–134

Üstün, A., Kurfalı, M., & Can, B. (2018). Characters or morphemes: How to represent words? In: Proceedings of The Third Workshop on Representation Learning for NLP , pp. 144–153. Association for Computational Linguistics.

Váradi, T., Simon, E., Sass, B., Gerőcs, M., Mittelholtz, I., Novák, A., Indig, B., Prószéky, G., & Vincze, V. (2017). Az e-magyar digitális nyelvfeldolgozó rendszer. In: XIII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2017) , pp. 49–60, 2017. Szegedi Tudományegyetem Informatikai Tanszékcsoport.

Váradi, T., Simon, E., Sass, B., Mittelholcz, I., Novák, A., Indig, B., Farkas, R., & Vincze, V. (2018). E-magyar – A Digital Language Processing System. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , May 7–12. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems , vol. 30. Curran Associates, Inc.

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ukasz, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., & Dean, J. (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, arxiv:1609.08144

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research , pp. 2048–2057, 07–09 Jul 2015. PMLR.

Download references

Author information

Authors and affiliations.

Department of Computer Engineering, Boğaziçi University, Bebek, 34342, Istanbul, Turkey

Batuhan Baykara & Tunga Güngör

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Batuhan Baykara .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

See Tables 14 and 15 .

Rights and permissions

Reprints and permissions

About this article

Baykara, B., Güngör, T. Abstractive text summarization and new large-scale datasets for agglutinative languages Turkish and Hungarian. Lang Resources & Evaluation 56 , 973–1007 (2022). https://doi.org/10.1007/s10579-021-09568-y

Download citation

Accepted : 12 November 2021

Published : 10 January 2022

Issue Date : September 2022

DOI : https://doi.org/10.1007/s10579-021-09568-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Abstractive text summarization
  • Morphological tokenization
  • Agglutinative languages
  • Turkish and Hungarian datasets
  • Find a journal
  • Publish with us
  • Track your research

Logo for Cornell University

We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors.

  • Accessibility
  • Status Information
  • Ancillary Files (data, code, images)
  • Availability of submissions
  • Category cross listing
  • Endorsement
  • Adding Journal Reference and DOI
  • Text Overlap
  • Metadata for Required and Optional Fields
  • Submit a new version of a work
  • Oversized Submissions
  • Submit a Paper List for Conference Proceedings
  • Creating tar and zip Files for Upload
  • What is TeX
  • Proxy / Third Party Submission
  • Translations
  • Version Availability
  • Why Submit TeX?
  • Withdraw / Retract a Submission
  • Institutional Repository Interoperability
  • Automated DOI and journal reference updates from publishers
  • arXiv Usage Stats

Support for data sets associated with arXiv articles

arXiv is primarily an archive and distribution service for research articles . arXiv provides support for data sets and other ancillary materials only in direct connection with research articles submitted.

arXiv supports the inclusion of ancillary files of modest size with articles. If you are including multiple page datasets or code with your submission please use the ancillary file option rather than embed them in the full text. The ancillary files are stored in the source package on arXiv and facilities are available to download either the entire source package or individual files. The ability to add ancillary files is available as part of the normal arXiv submission process .

  • Privacy Policy
  • contact arXiv Click here to contact arXiv Contact
  • subscribe to arXiv mailings Click here to subscribe Subscribe
  • Report an issue Click here to report an issue with arXiv's documentation in github Report a documentation issue
  • Web Accessibility Assistance

arXiv Operational Status Get status notifications via email or slack

search

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 04 April 2024

Globe-LFMC 2.0, an enhanced and updated dataset for live fuel moisture content research

  • Marta Yebra   ORCID: orcid.org/0000-0002-4049-9315 1 , 2 ,
  • Gianluca Scortechini   ORCID: orcid.org/0000-0002-0149-4028 1 ,
  • Karine Adeline 3 ,
  • Nursema Aktepe 4 ,
  • Turkia Almoustafa 5 , 6 ,
  • Avi Bar-Massada   ORCID: orcid.org/0000-0002-8331-0391 7 ,
  • María Eugenia Beget 8 ,
  • Matthias Boer   ORCID: orcid.org/0000-0001-6362-4572 9 ,
  • Ross Bradstock 10 ,
  • Tegan Brown 11 ,
  • Francesc Xavier Castro 12 ,
  • Rui Chen 13 ,
  • Emilio Chuvieco   ORCID: orcid.org/0000-0001-5618-4759 14 ,
  • Mark Danson 5 ,
  • Cihan Ünal Değirmenci 15 ,
  • Ruth Delgado-Dávila 16 , 17 ,
  • Philip Dennison   ORCID: orcid.org/0000-0002-0241-1917 18 ,
  • Carlos Di Bella   ORCID: orcid.org/0000-0001-7044-0931 19 ,
  • Oriol Domenech 20 ,
  • Jean-Baptiste Féret 21 ,
  • Greg Forsyth 22 ,
  • Eva Gabriel 12 ,
  • Zisis Gagkas   ORCID: orcid.org/0000-0002-9477-4407 23 ,
  • Fatma Gharbi 24 ,
  • Elena Granda 25 ,
  • Anne Griebel 9 , 26 ,
  • Binbin He 13 ,
  • Matt Jolly 27 ,
  • Ivan Kotzur 9 ,
  • Tineke Kraaij   ORCID: orcid.org/0000-0002-8891-2869 28 ,
  • Agnes Kristina 29 ,
  • Pınar Kütküt 15 ,
  • Jean-Marc Limousin 30 ,
  • M. Pilar Martín 31 ,
  • Antonio T. Monteiro 32 , 33 ,
  • Marco Morais 34 ,
  • Bruno Moreira 35 ,
  • Florent Mouillot 36 ,
  • Samukelisiwe Msweli   ORCID: orcid.org/0000-0002-3396-2316 37 ,
  • Rachael H. Nolan   ORCID: orcid.org/0000-0001-9277-5142 9 ,
  • Grazia Pellizzaro 38 ,
  • Yi Qi 39 , 40 ,
  • Xingwen Quan   ORCID: orcid.org/0000-0001-5344-1801 13 ,
  • Victor Resco de Dios   ORCID: orcid.org/0000-0002-5721-1656 41 ,
  • Dar Roberts   ORCID: orcid.org/0000-0002-3555-4842 34 ,
  • Çağatay Tavşanoğlu   ORCID: orcid.org/0000-0003-4447-6492 15 ,
  • Andy F. S. Taylor 42 ,
  • Jackson Taylor 1 ,
  • İrem Tüfekcioğlu 15 ,
  • Andrea Ventura 38 &
  • Nicolas Younes Cardenas 1  

Scientific Data volume  11 , Article number:  332 ( 2024 ) Cite this article

Metrics details

  • Ecophysiology
  • Natural variation in plants
  • Plant physiology

Globe-LFMC 2.0, an updated version of Globe-LFMC, is a comprehensive dataset of over 280,000 Live Fuel Moisture Content (LFMC) measurements. These measurements were gathered through field campaigns conducted in 15 countries spanning 47 years. In contrast to its prior version, Globe-LFMC 2.0 incorporates over 120,000 additional data entries, introduces more than 800 new sampling sites, and comprises LFMC values obtained from samples collected until the calendar year 2023. Each entry within the dataset provides essential information, including date, geographical coordinates, plant species, functional type, and, where available, topographical details. Moreover, the dataset encompasses insights into the sampling and weighing procedures, as well as information about land cover type and meteorological conditions at the time and location of each sampling event. Globe-LFMC 2.0 can facilitate advanced LFMC research, supporting studies on wildfire behaviour, physiological traits, ecological dynamics, and land surface modelling, whether remote sensing-based or otherwise. This dataset represents a valuable resource for researchers exploring the diverse LFMC aspects, contributing to the broader field of environmental and ecological research.

Background & Summary

Live Fuel Moisture Content (LFMC), a critical parameter in fire-related research, quantifies the vegetation water content. It is computed as:

where W f represents the weight of fresh plant material, measured post-sample collection, W d indicates the weight of the same sample after thorough drying, often in an oven.

Numerous studies have demonstrated LFMC’s influence on various wildfire metrics, including flammability, rate of spread, fire occurrence and cumulative burnt area 1 , 2 , 3 , 4 , 5 . Growing interest surrounds the exploration of LFMC dynamics in relation to ecological, meteorological and ecophysiological parameters 6 , 7 , 8 , 9 , 10 , especially within the context of a changing climate 11 .

However, conducting fieldwork, collecting measurements, and recording data can be costly, time consuming, and resource-intensive. Therefore, the convenience of having access to a readily available LFMC dataset proves beneficial for advancing research. As a result, several LFMC datasets 12 , 13 , including the 2019 version of Globe-LFMC 14 , have emerged online.

Globe-LFMC 2.0 15 , presented herein and accessible at the figshare repository, represents an updated version of the 2019 release. It incorporates previously published datasets and adds more than 120,000 additional measurements hitherto unavailable to the research community.

This extensive dataset comprises over 280,000 LFMC values derived from samples gathered at more than 2,000 locations across 15 countries. It includes data from more than 500 different species or combinations of species. The timeframe of the data spans from 1977 to 2023 (Tables  1 , 2 , Fig.  1 ).

figure 1

Locations of sampling sites. The sampling sites are represented as coloured points on the map, with the colour intensity indicating the abundance of LFMC values collected at each location. To enhance clarity, points have been ordered on the z-axis based on the number of LFMC samples, with sites having fewer samples placed beneath those with a higher data count. Predominantly, the sampling sites and LFMC measurements are concentrated in the USA, France, and Spain. The base map for this figure is derived from NASA’s Visible Earth ‘Explorer Base Map’ 30 .

The compilation process included formatting source data, performing rigorous and recursive quality checks, merging data from co-authors, and introducing supplementary information. Notably, each data point now includes land cover type and meteorological variables, aligned with the sampling date and location. An outlier detection analysis was executed, and its findings are presented (Fig.  2 ).

figure 2

Workflow followed to compile Globe-LFMC 2.0 15 .

Distinguishing Globe-LFMC 2.0 15 from its predecessor, it presents two significant enhancements. First, it incorporates a large number of LFMC measurements from individual samples, broadening its coverage across various geographic and climatic conditions. Second, it includes additional descriptor variables per sample (Tables 3 , 4 ) and rectifies inaccuracies and typos that may have been present in the previous version. These improvements not only increase the comprehensiveness of the dataset but also enhance its adaptability for end-users, allowing them to process the data and aggregate the samples as they see fit.

Globe-LFMC 2.0 15 applications are manifold. Researchers can employ it to develop and validate models for LFMC estimation from remote sensing data 16 , 17 , or for other types of land surface modelling, such as those derived from climate variables 3 . It is equally valuable in investigating the relationship between LFMC and wildfire occurrence and behaviour, as well as its associations with other plant water status metrics, meteorological parameters and ecological drivers.

In conclusion, as we plan to keep the dataset updated and publish future versions, we invite researchers and other interested parties to contact us if they wish to contribute.

Compilation of Live Fuel Moisture Content measurements

Globe-LFMC 2.0 15 is the result of collaborative efforts involving international researchers and agencies, incorporating data from multiple sources, including publicly available datasets 12 , 13 , 14 , 18 , 19 .

The authors meticulously adapted their datasets to conform to the template spreadsheets, aligning with the structured format of Globe-LFMC 2.0 15 (a comprehensive breakdown of the dataset fields is available in the Data Records section). These refined spreadsheets were subsequently integrated into a unified dataset, following a rigorous visual quality check. This check was essential to verify data integrity, and rectify any typographical errors, formatting inconsistencies and obsolete information to ensure the dataset’s reliability and accuracy.

LFMC values in this dataset were derived from destructive measurements of plant materials obtained during field sampling. While sampling and weighing protocols varied among contributors, the common procedure involved weighing fresh plant material, typically leaves, either in the field or a laboratory after secure transportation in a sealed bag or container. Subsequently, the samples were oven-dried for several hours at a minimum of 60 °C and re-weighed. Sampling details, including location, date, and sometimes the time of sampling, as well as specific sampling protocols, were meticulously recorded.

Unlike the previous version of Globe-LFMC 14 , efforts were made to avoid data aggregation and preserve individual sample measurements wherever possible. This means that values corresponding to the same combination of species, sampling location, and date were not averaged together. In cases where data from the 2019 version of the dataset were included, averaging was replaced with the original individual measurements, when available.

Entries that remained as mean LFMC values for multiple measurements were flagged in a dedicated dataset column.

A comprehensive review of the 2019 dataset was undertaken to rectify typos and inaccuracies, encompassing species names, protocol details, and, in a limited number of instances, sampling dates (a list of the dates changed is available at the figshare repository 15 ).

The US National Fuel Moisture Database (NFMD) 19 was redownloaded from the original source, leading to differences from the previous Globe-LFMC 14 version. Some data entries were added, others were removed. Dead Fuel Moisture measurements were excluded, while all LFMC values were retained, irrespective of whether they were later identified as outliers during the quality check. The decision not to delete these values was due to impracticality in contacting the original data providers for further investigation.

After compiling all data sources, extensive efforts were made to harmonize the diverse datasets, ensuring uniformity and consistency across Globe-LFMC 2.0 15 . In cases where the same site name was associated with different coordinates, we introduced unique identifiers at the end of the name to distinguish them. Conversely, when identical coordinates were linked to multiple sites, their names were merged. This meticulous process culminated in a dataset where each site name corresponded exclusively to one set of coordinates, and vice versa, fostering data integrity and precision.

Land cover data

Land cover type information was also added to the final dataset following the IGBP classification from LP DAAC MCD12Q1.061 (MODIS/Terra + Aqua Land Cover Type Yearly L3 Global 500 m SIN Grid) 20 .

The process started by downloading the complete set of MCD12Q1.061 sinusoidal tiles products spanning the years 2001 to 2022. Subsequently, these tiles were mosaicked into yearly raster images at a spatial resolution of 500 m within the WGS84 reference system.

For each LFMC value, the mosaic corresponding to the respective calendar year was employed to retrieve the land cover ID by selecting the pixel value at the precise sampling location. Additionally, the descriptive land cover name (e.g., “Grasslands”) was incorporated into the dataset.

Given that the available land cover time series extended from 2001 to 2022, the land cover type of 2001 was attributed to all samples collected before 2001, as it most closely represented the respective sampling date. Similarly, for samples collected after 2022, the land cover type of the year 2022 was assigned. This method ensured consistent land cover information across all samples.

Meteorological data

Meteorological data was sourced from AgERA5 (Agrometeorological indicators from 1979 to present derived from reanalysis) AgERA5 is a high level product built upon ERA5 data, which were aggregated to obtain daily values and downscaled to 0.1° × 0.1° spatial resolution 21 .

The initial step involved downloading NetCDF files containing specific meteorological variables: total daily precipitation, relative humidity at 2 m above surface at four distinct times (6am, 9am, 12 pm and 3 pm), maximum daily air temperature at 2 m above surface, mean daily air temperature, mean daily vapour pressure, mean daily wind speed at 10 m above surface and mean daily dewpoint temperature at 2 m above surface.

Subsequently, the values for each meteorological variable were extracted from the downloaded files at the date and location of each entry in the dataset.

Additionally, cumulative precipitation data for the preceding 3 days, 1 week, 4 weeks, and 12 weeks before the sampling date was included in the final dataset.

Detection of possible outliers

The process of identifying potential outliers within LFMC values consisted of a two-step strategy, combining both manual inspection and the application of two distinct statistical models.

We define outliers as values that deviate notably from the norm, being either anomalously high or low. Such deviations may arise from measurement inaccuracies due to instrument or human errors. Additional context regarding the interpretation of outlier detection is available in the “Technical Validation” section.

Step 1: Manual Inspection and Data Provider-Specific Methods

In the initial phase, when possible, data providers meticulously examined each dataset comprising Globe-LFMC 2.0 15 . Since these datasets varied significantly in structure, the authors customized outlier detection methods for each. The outcomes of this initial assessment were documented in the “Extra information/Quality Flag” column of the dataset. The methods used in this step were tailored to the specific dataset’s characteristics, involving visual inspections, percentile-based, or standard deviation-based approaches to identify outliers.

Step 2: Statistical Model-Based Outlier Detection

The second approach leveraged the Isolation Forest algorithm 22 , a tool that utilises binary trees to identify data points as outliers via random splits in the dataset. Fewer splits required to isolate a data point indicate a higher likelihood that it is an outlier. The implementation of this method was conducted through Python’s Scikit-Learn 1.3.0 library 23 as illustrated in Fig.  3 . Isolation Forest analysis was executed on separate data subsets categorised by species. The variables integrated into the models were time, latitude, longitude, and LFMC to account for both variations among local populations of the same species and fluctuations in time series data.

figure 3

Decision diagram explaining the outlier detection method based on the Isolation Forest algorithm.

Due to the unsupervised nature of this model, hyperparameters were predefined with a theoretical approach as the true anomalies were unknown. Key hyperparameters included the number of trees (“n_estimators”) set at 10,000, which was considered sufficient for building a precise model without excessive computational demands. Additionally, “max_samples” was set at 75% of each subset total data points to facilitate the detection of outlier clusters. The inclusion of bootstrap, “max_features” set at 4, and a contamination ratio of 0.05 was determined based on a conservative assessment of the data following preliminary visual examination.

A potential limitation of this approach is its propensity to identify data points as anomalies when they are isolated in time or space, even if their LFMC values are within the expected range. To minimize this risk, a complementary model was simultaneously applied to the same subsets. This model specifically focused on time, latitude, and longitude, with a “max_features” setting adjusted to 3. It aimed to detect data points isolated independently of their LFMC values. The anomalies identified by this secondary model were then subtracted from those found by the LFMC-inclusive model, producing a refined list of anomalous LFMC values.

Given the stochastic nature of the Isolation Forest algorithm, five model versions were created (both including and excluding LFMC as a variable), each employing different random states. A data point was designated as a possible outlier only if all five LFMC-inclusive models identified it as isolated, and at least one of the five models without LFMC did not isolate it, as depicted in Fig.  3 . Data points isolated by all models with and without LFMC were not classified as outliers, as their isolation could be attributed to time or spatial factors unrelated to their LFMC values.

The results of all ten models, along with their respective scores, are provided in the figshare repository 15 .

Moreover, the repository contains results from an alternative outlier detection method: Cook’s Distance 24 , which gauges the influence of a data point on a regression line. This analysis was conducted using the Python library statsmodels 0.14.0 25 . It involved grouping data points by species and sampling location, calculating ordinary least squares regression, and comparing Cook’s Distance scores to the “4/n” threshold (where “n” stands for the number of observations within a group of samples), commonly used to identify influential data points 26 , 27 . An additional criterion was considered, flagging data points with Cook’s Distance values more than three times the mean Cook’s Distance of data points in the same group.

In cases where Cook’s Distance resulted in NaN (not a number) or infinite values, “NA” (not available) was assigned to all data points within the same group.

Data Records

The Globe-LFMC 2.0 dataset 15 is available in an MS Excel file containing three sheets: “Contact” (Table  5 ), “LFMC Data” (Tables  3 , 4 ) and “Protocol” (Table  6 ). The primary dataset is located within the “LFMC Data” sheet, which contains the core LFMC values along with associated information. The “Contact” sheet offers supplementary details regarding the contact person responsible for each sub dataset, facilitating direct communication and inquiries related to the data. In the “Protocol” sheet, a comprehensive description of the sampling and weighing procedures employed to obtain the LFMC measurements is presented, providing essential context for data interpretation.

Accompanying the dataset, additional files are provided for reference and extra data. In these files it is possible to retrieve all the outcomes generated from the outlier detection procedures, offering transparency and insight into data quality control, as well as the references to the original sources and datasets incorporated into Globe-LFMC 2.0 15 . The files are equipped with column descriptions where needed, enhancing the accessibility and usability of the dataset.

Technical Validation

A rigorous quality check of the LFMC data was conducted individually by each contributing author, as outlined in the Methods section. Furthermore, to ensure data integrity, two outlier detection methods, the Isolation Forest and the Cook’s Distance, were applied across the entire dataset (see Usage Notes for details).

Upon removal of data points flagged as potential anomalies by the Isolation Forest method, the LFMC values generally fell within expected ranges, as demonstrated in Fig.  4 and detailed in Table  7 , which provides example LFMC distributions and descriptive statistics for some of the most common species in the dataset.

figure 4

Box plots and violin plots illustrating the seasonal variability and statistical distribution of LFMC for eight common species found in Globe-LFMC 2.0 15 . The species include Quercus gambelii , a deciduous oak (sampled in USA); Quercus coccifera , an evergreen oak (sampled in France, Spain, Türkiye); Pinus edulis , a medium sized pine (sampled in USA); Pinus taeda , a tall pine (sampled from the USA); Cistus monspeliensis , an evergreen shrub with narrow leaves (sampled in France, Italy, Spain, Tunisia); Arctostaphylos patula , an evergreen shrub with round leaves (sampled in USA); Rosmarinus officinalis , an evergreen shrub with narrow leaves (sampled in France, Italy, Spain, Tunisia); and unidentified grass encompassing various unidentified grass species collected in grasslands (sampled in Argentina, Australia, China, Portugal, Senegal, Spain). The seasons were defined based on time ranges between astronomical equinoxes and solstices. (Figure created using seaborn 31 ).

Notably low LFMC values may be attributed to samples that contain a combination of live and dead plant material or, in some cases, exclusively dead material from living vegetation. Similarly, very high LFMC values not identified as potential outliers could originate from juvenile leaves, fleshy plant species, or samples influenced by waterlogged soil conditions. Whenever available, this contextual information was included in the dataset.

It is important to acknowledge that certain data points may not have been identified as anomalies by the method depicted in Fig.  3 , potentially due to isolation in time or space, irrespective of their LFMC value.

Moreover, although efforts were made to detect outliers, it is possible that a small number of very high values remain unidentified due to the stochastic nature of the method applied (Isolation Forest).

The correctness of the land cover and meteorological values added to the dataset was verified visually by comparing the output of the Python scripts with the source raster images in a Geographic Information System (GIS) software. This validation process was conducted on a small randomly selected subset consisting of 15 data points (one per country).

Usage Notes

The “LFMC data” sheet contains various attributes that can be utilized for data filtering and categorization as per research requirements. Additionally, it offers valuable meteorological and land cover data that can support the study of LFMC dynamics. Tables  3 , 4 provide detailed explanations for each column, but further guidance on how to effectively use some of the more intricate attributes is provided below.

The “Species functional type” column provides a generic classification of the sampled species. It is particularly valuable for understanding the vertical structure of the collected species within the plot, especially when different species are sampled from the same location. The functional types were assigned by data providers based on their expertise. Hence, intermediate-size plants were occasionally categorised using different terms depending upon each author’s judgement (e.g., “small tree” and “large shrub” could refer to plants having analogous size).

Functional type information is especially useful for optical remote sensing studies, particularly in closed forests, where the canopy may obstruct visibility of lower vegetation layers. In such cases, it is advisable to select only measurements from trees.

For remote sensing applications, it is recommended to average the LFMC measurements taken on the same date and located within the same pixel of the product employed in the study. The choice of which functional type to include in the average can be guided by the land cover type of that pixel. For example, in open canopy forests, both trees and shrubs (or grass) could be included.

However, caution is advised when utilising land cover information, given the 500 m spatial resolution and inherent uncertainties in the satellite-based product, which may compromise the accuracy of land cover classification.

The “protocol” column and accompanying protocol sheet can be used to filter the data based on specific research requirements. For instance, selecting only LFMC values retrieved following a specific sampling and weighing criteria or excluding samples that might have included flowers or buds.

Land cover type and meteorological data are provided to aid preliminary studies and hypothesis testing regarding LFMC dynamics or investigation of reasons behind anomalous LFMC values, or retrieval of information about the plant type sampled.

The “Extra information/Quality Flag” column contains additional miscellaneous information provided by data providers to enhance the understanding of the data. It may include markers for suspected anomalies, explanations for unusual LFMC values, or information about the plant type sampled.

“Isolated data point” reports the output of the Isolation Forest models (Fig.  3 ). Users can employ this column to filter the dataset by removing “isolated” data points, which could be potential outliers (by only selecting the “FALSE” values; i.e., not isolated).

Instead of removing potential outliers from the dataset, adding flags enables each user to employ the data in the way that best suits their research needs.

It is important to note that if a data point is identified as isolated (value “TRUE”), it may not necessarily be a true outlier, as the algorithm compares it only to other data points in the same subset without prior knowledge of LFMC variability of a given species.

Moreover, it is possible that anomalous LFMC values were not flagged as outliers because those data points were selected as isolated in time or space by all the models without LFMC (see Methods for details), and they were subtracted from the potential outliers.

Lastly, due to the random nature of this method, both false positives and false negatives are possible.

Further outlier detection criteria are provided in the figshare repository 15 , including columns reporting the results from Cook’s Distance method. The columns “Above 4/n Cook Distance” and “Above 3xMean Cook Distance” are two ready-to-use quality flags that can help identify influential data points. Cook’s Distance methods tended to detect a much higher number of outliers; hence they appear to be more conservative than the Isolation Forest. However, it can also sometimes fail to identify possible outliers with suspiciously high (or low) LFMC values, if there are other values in the same subset that are even higher (or lower).

Moreover, additional output data from both outlier detection methods are also shared, providing dataset users with the flexibility to create customized filters to suit their specific requirements. For instance, users can employ algorithms’ scores to establish custom thresholds. Alternatively, in the context of the Isolation Forest method, they can flag a data entry as a potential outlier even if does not meet the consensus of five models.

Furthermore, it is possible to employ a combination of different methods. For example, the Cook’s Distance metrics can be used to cross-verify LFMC values of data points that were not classified as anomalies by the Isolation Forest method only because they were detected as isolated in time and space.

Finally, it is strongly recommended to use the most recent version of the dataset, as it incorporates corrections for occasional inaccuracies and typos. Continued use of the 2019 version is discouraged.

Code availability

The code for the detection of potential outliers and the extraction of land cover and meteorological data was developed using Python 3.9.7. The corresponding Jupyter Notebooks are available at the figshare repository 15 .

The outlier detection code uses the Globe-LFMC-2.0 15 file as input.

When executing the land cover and meteorological data extraction code, it is essential to have downloaded the required input data first.

Dennison, P. E. & Moritz, M. A. Critical live fuel moisture in chaparral ecosystems: a threshold for fire activity and its relationship to antecedent precipitation. International Journal of Wildland Fire 18 , 1021 (2009).

Article   Google Scholar  

Dimitrakopoulos, A. & Papaioannou, K. Flammability Assessment of Mediterranean Forest Fuels. Fire Technology; Norwell 37 , 143 (2001).

Park, I., Fauss, K. & Moritz, M. A. Forecasting Live Fuel Moisture of Adenostema fasciculatum and Its Relationship to Regional Wildfire Dynamics across Southern California Shrublands. Fire 5 , 110 (2022).

Pimont, F., Ruffault, J., Martin-StPaul, N. K. & Dupuy, J.-L. A Cautionary Note Regarding the Use of Cumulative Burnt Areas for the Determination of Fire Danger Index Breakpoints. Int. J. Wildland Fire 28 , 254 (2019).

Rossa, C. G., Veloso, R. & Fernandes, P. M. A laboratory-based quantification of the effect of live fuel moisture content on fire spread rate. Int. J. Wildland Fire 25 , 569 (2016).

Bar-Massada, A. & Lebrija-Trejos, E. Spatial and temporal dynamics of live fuel moisture content in eastern Mediterranean woodlands are driven by an interaction between climate and community structure. Int. J. Wildland Fire 30 , 190 (2021).

Boving, I. et al . Live fuel moisture and water potential exhibit differing relationships with leaf-level flammability thresholds. Functional Ecology , https://doi.org/10.1111/1365-2435.14423 (2023).

Griebel, A. et al . Specific leaf area and vapour pressure deficit control live fuel moisture content. Functional Ecology 37 , 719–731 (2023).

Article   CAS   Google Scholar  

Nolan, R. H. et al . Drought-related leaf functional traits control spatial and temporal dynamics of live fuel moisture content. Agricultural and Forest Meteorology 319 , 108941 (2022).

Pivovaroff, A. L. et al . The Effect of Ecophysiological Traits on Live Fuel Moisture Content. Fire 2 , 12 (2019).

Ma, W. et al . Assessing climate change impacts on live fuel moisture and wildfire risk using a hydrodynamic vegetation model. Biogeosciences 18 , 4005–4020 (2021).

Article   ADS   CAS   Google Scholar  

Gabriel, E. et al . Live fuel moisture content time series in Catalonia since 1998. Annals of Forest Science 78 , 44 (2021).

Martin-StPaul, N. et al . Live fuel moisture content (LFMC) time series for multiple sites and species in the French Mediterranean area since 1996. Annals of Forest Science 75 , 57 (2018).

Yebra, M. et al . Globe-LFMC, a global plant water status database for vegetation ecophysiology and wildfire applications. Sci Data 6 , 155 (2019).

Yebra, M. et al . Globe-LFMC 2.0, An enhanced and updated dataset for Live Fuel Moisture Content research, figshare , https://doi.org/10.6084/m9.figshare.c.6980418 (2024).

Cunill Camprubí, À., González-Moreno, P. & Resco De Dios, V. Live Fuel Moisture Content Mapping in the Mediterranean Basin Using Random Forests and Combining MODIS Spectral and Thermal Data. Remote Sensing 14 , 3162 (2022).

Article   ADS   Google Scholar  

Miller, L. et al . Projecting live fuel moisture content via deep learning. Int. J. Wildland Fire 32 , 709–727 (2023).

République Française - Conservatoire de la Forêt Méditerranéenne, Office National des Forêts. Réseau Hydrique http://www.reseauhydrique.dpfm.fr .

United States Government. National Fuel Moisture Database https://www.wfas.net/nfmd/public/about.php .

Friedl, M. & Sulla-Menashe, D. MCD12Q1.061 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V061 [Data set]. NASA EOSDIS Land Processes Distributed Active Archive Center. https://doi.org/10.5067/MODIS/MCD12Q1.061 (2022).

Boogaard, H. et al . Agrometeorological indicators from 1979 to present derived from reanalysis. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). https://doi.org/10.24381/cds.6c68c9bb (2020).

Liu, F. T., Ting, K. M. & Zhou, Z.-H. Isolation-Based Anomaly Detection. ACM Transactions on Knowledge Discovery from Data 6 , 1–39 (2012).

Pedregosa, F. et al . Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 , 2825–2830 (2011).

MathSciNet   Google Scholar  

Cook, R. D. Detection of Influential Observation in Linear Regression. Technometrics 19 , 15–18 (1977).

Seabold, S. & Perktold, J. statsmodels: Econometric and statistical modeling with python. in 92–96, https://doi.org/10.25080/Majora-92bf1922-011 (Austin, Texas, 2010).

Altman, N. & Krzywinski, M. Analyzing outliers: influential or nuisance? Nature Methods 13 , 281–282 (2016).

Van der Meer, T., Te Grotenhuis, M. & Pelzer, B. Influential Cases in Multilevel Modeling: A Methodological Comment. American Sociological Review 75 , 173–178 (2010).

Briottet, X. et al . BIODIVERSITY – A new space mission to monitor Earth ecosystems at fine scale. RFPT 224 , 33–58 (2022).

Adeline, K. et al . Multi-scale datasets for monitoring Mediterranean oak forests from optical remote sensing during the SENTHYMED/MEDOAK experiment in the north of Montpellier (France). Data in Brief 53 , 110185 (2024).

Stevens, J. Explorer Base Map, NASA Earth Observatory map by Joshua Stevens using data from NASA’s MODIS Land Cover, the Shuttle Radar Topography Mission (SRTM), the General Bathymetric Chart of the Oceans (GEBCO), and Natural Earth boundaries. (2020).

Waskom, M. L. seaborn: statistical data visualization. Journal of Open Source Software 6 , 3021 (2021).

Download references

Acknowledgements

The early stages of dataset compilation were made possible through the Australian Government Research Training Program Domestic Scholarship. Subsequent efforts received financial support from the Bushfire Research Centre of Excellence, supported by the Australian National University and Optus. The data provided by the Hawkesbury Institute for the Environment was funded by the NSW Department of Planning and Environment via the NSW Bushfire and Risk Management Research Hub. CNES, focused on BIODIVERSITY space mission under the APR project named SentHyMED 28 supported the work with A. Karine, J.B. Féret and J.M. Limousin as contacts and the raw dataset 29 . Portuguese FCT – Fundação para a Ciência e Teconologia in the framework of the researcher contract DL57/2016/CP1442/CP0005 supported the work of A. Monteiro. The data provided by the James Hutton Institute was funded by the Scottish Government via NatureScot. We acknowledge the incorporation of sections of the datasets at zenodo.org/records/162978 (CC BY 4.0) 13 and zenodo.org/records/4694854 (CC BY 4.0) 12 into Globe-LFMC 2.0 15 . The meteorological data was generated using Copernicus Climate Change Service information 2023 (AgERA5) ( https://doi.org/10.24381/cds.6c68c9bb , Accessed in May, June and July 2023). The land cover type data was retrieved from the online Data Pool, courtesy of the NASA EOSDIS Land Processes Distributed Active Archive Center (LP DAAC), USGS Earth Resources Observation and Science (EROS) Center, Sioux Falls, South Dakota, https://lpdaa.usgs.gov/tools/data-pool/ (Accessed in Aug and Sep 2023).

Author information

Authors and affiliations.

Fenner School of Environment & Society, Australian National University, Canberra, ACT, Australia

Marta Yebra, Gianluca Scortechini, Jackson Taylor & Nicolas Younes Cardenas

School of Engineering, Australian National University, Canberra, ACT, Australia

Marta Yebra

ONERA / DOTA, Université de Toulouse, F-31055, Toulouse, France

Karine Adeline

Department of Biology, Kastamonu University, Kastamonu, Türkiye

Nursema Aktepe

School of Environment and Life Sciences, University of Salford, Salford, UK

Turkia Almoustafa & Mark Danson

Faculty of Arts and Humanities, Geography Department, Tishreen University, Tishreen, Syria

Turkia Almoustafa

Department of Biology and Environment, University of Haifa at Oranim, Kiryat Tivon, 36066, Israel

Avi Bar-Massada

Instituto Nacional de Tecnología Agropecuaria, Buenos Aires, Argentina

María Eugenia Beget

Hawkesbury Institute for the Environment, Western Sydney University, Penrith, NSW, Australia

Matthias Boer, Anne Griebel, Ivan Kotzur & Rachael H. Nolan

University of Wollongong, Wollongong, NSW, Australia

Ross Bradstock

US Forest Service, Rocky Mountain Research Station, Fire Sciences Laboratory, 5775 Highway 10 West, Missoula, 59803, MT, USA

Tegan Brown

Servei de Prevenció d’Incendis Forestals (Generalitat de Catalunya), Santa Perpètua de Mogoda, Barcelona, Spain

Francesc Xavier Castro & Eva Gabriel

School of Resources and Environment, University of Electronic Science and Technology of China, Sichuan, China

Rui Chen, Binbin He & Xingwen Quan

Department of Geology, Geography and the Environment, University of Alcalá, Colegios 2, 28801, Alcalá de Henares, Spain

Emilio Chuvieco

Division of Ecology, Department of Biology, Hacettepe University, Beytepe, Ankara, Türkiye

Cihan Ünal Değirmenci, Pınar Kütküt, Çağatay Tavşanoğlu & İrem Tüfekcioğlu

Joint Research Unit CTFC - AGROTECNIO, Crta. de St. Llorenç de Morunys, km 2, E, 25280, Solsona, Spain

Ruth Delgado-Dávila

Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel

Department of Geography, University of Utah, Salt Lake City, Utah, USA

Philip Dennison

IFEVA-CONICET, Faculty of Agronomy, University of Buenos Aires, Buenos Aires, Argentina

Carlos Di Bella

Centre Forestal de les Illes Balears (CEFOR-Menut), Forest Management Service (Government of the Balearic Islands), Palma de Mallorca, Spain

Oriol Domenech

INRAE, UMR TETIS, 500 rue Jean-François Breton, 34093, Montpellier, France

Jean-Baptiste Féret

CSIR, NRE, Stellenbosch, South Africa

Greg Forsyth

Environmental and Biochemical Sciences Department, The James Hutton Institute, Aberdeen, UK

Zisis Gagkas

Faculty of Sciences of Tunis, University of Tunis El Manar, Tunis, Tunisia

Fatma Gharbi

Departamento de Ciencias de la Vida, Universidad de Alcalá, Alcalá de Henares, Spain

Elena Granda

School of Life Sciences, University of Technology Sydney, PO Box 123 Broadway, Ultimo, NSW, 2007, Australia

Anne Griebel

RMRS, Missoula Fire Sciences Laboratory, USFS, Rocky Mountain Research Station, 5775 Hwy 10 W Missoula, Missoula, MT, 59808, USA

Nelson Mandela University, School of Natural Resource Management, George, South Africa

Tineke Kraaij

Bushfire Technical Services, DFES WA, Perth, Australia

Agnes Kristina

CEFE, Univ Montpellier, CNRS, EPHE, IRD, Montpellier, France

Jean-Marc Limousin

Environmental Remote Sensing and Spectroscopy Laboratory (SpecLab), IEGD, Spanish National Research Council (CSIC), Madrid, Spain

M. Pilar Martín

Centro de Estudos Geográficos (CEG) and Laboratório Associado TERRA, Instituto de Geografia e Ordenamento do Território (IGOT), Universidade de Lisboa, Rua Edmée Marques, 1600-276, Lisboa, Portugal

Antonio T. Monteiro

Istituto di Geoscienze e Georisorse, Consiglio Nazionale delle Ricerche (CNR-IGG), Via Moruzzi 2, 56124, Pisa, Italy

Department of Geography, University of California, Santa Barbara, USA

Marco Morais & Dar Roberts

Department of Ecology and Global Change. Centro de Investigaciones sobre Desertificación (CIDE-CSIC/UV/GV). Carretera Moncada-Náquera km 4, 5 s/n, E-46113, Moncada, Valencia, Spain

Bruno Moreira

IRD, CEFE/CNRS, 1919 Route de Mende, 34293, Montpellier, Cedex 5, France

Florent Mouillot

Natural Resource Science and Management Cluster, Nelson Mandela University, George, South Africa

Samukelisiwe Msweli

Istituto per la Bioeconomia, Consiglio Nazionale delle Ricerche, (CNR-IBE), Traversa La Crucca 3, 07100, Sassari, Italy

Grazia Pellizzaro & Andrea Ventura

University of Nebraska-Lincoln, Lincoln, Nebraska, USA

University of Southern California, Los Angeles, California, USA

Universitat de Lleida, Lleida, Spain

Victor Resco de Dios

Ecological Sciences Department. The James Hutton Institute, Aberdeen, UK

Andy F. S. Taylor

You can also search for this author in PubMed   Google Scholar

Contributions

M.Y. conceived the idea, supervised the data collection and the dataset compilation, provided data included in the dataset and feedback on the dataset format, wrote the first version of the manuscript, and contributed to the final version the article. G.S. coordinated the data collection, harmonised the dataset, provided feedback on the dataset format, wrote the first version of the manuscript, produced the figures and tables, and contributed to the final version the article. K.A., A.B.M., M.E.B., E.C., R.D.D., P.D., C.D.B., F.G., E.Gr., A.G., I.K., M.P.M., A.T.M., R.H.N., G.P., N.Y.C. provided data included in the dataset and contributed to the final versions of the dataset and the article. All remaining authors provided data included in the dataset and contributed to the final version of the dataset.

Corresponding author

Correspondence to Marta Yebra .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Yebra, M., Scortechini, G., Adeline, K. et al. Globe-LFMC 2.0, an enhanced and updated dataset for live fuel moisture content research. Sci Data 11 , 332 (2024). https://doi.org/10.1038/s41597-024-03159-6

Download citation

Received : 15 December 2023

Accepted : 18 March 2024

Published : 04 April 2024

DOI : https://doi.org/10.1038/s41597-024-03159-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research abstracts dataset

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • EagleW/ACL_titles_abstracts_dataset -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Acl title and abstract dataset.

research abstracts dataset

This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016).

The structure of the data is as follows: - title - abstract - \newline

This dataset is used in our published paper: Paper Abstract Writing through Editing Mechanism

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove, license edit.

  • MIT License

Modalities Edit

Languages edit.

  • Open access
  • Published: 27 November 2023

Novel research and future prospects of artificial intelligence in cancer diagnosis and treatment

  • Chaoyi Zhang 1 , 2 , 3 , 4   na1 ,
  • Jin Xu 1 , 2 , 3 , 4   na1 ,
  • Rong Tang 1 , 2 , 3 , 4 ,
  • Jianhui Yang 1 , 2 , 3 , 4 ,
  • Wei Wang 1 , 2 , 3 , 4 ,
  • Xianjun Yu 1 , 2 , 3 , 4 &
  • Si Shi 1 , 2 , 3 , 4  

Journal of Hematology & Oncology volume  16 , Article number:  114 ( 2023 ) Cite this article

5676 Accesses

3 Citations

2 Altmetric

Metrics details

Research into the potential benefits of artificial intelligence for comprehending the intricate biology of cancer has grown as a result of the widespread use of deep learning and machine learning in the healthcare sector and the availability of highly specialized cancer datasets. Here, we review new artificial intelligence approaches and how they are being used in oncology. We describe how artificial intelligence might be used in the detection, prognosis, and administration of cancer treatments and introduce the use of the latest large language models such as ChatGPT in oncology clinics. We highlight artificial intelligence applications for omics data types, and we offer perspectives on how the various data types might be combined to create decision-support tools. We also evaluate the present constraints and challenges to applying artificial intelligence in precision oncology. Finally, we discuss how current challenges may be surmounted to make artificial intelligence useful in clinical settings in the future.

Introduction

In the upcoming decades, it is anticipated that cancer would surpass other illnesses as one of the main global causes of morbidity and mortality [ 1 ]. A recent study from The Lancet [ 2 ] demonstrated that for many low-income and middle-income nations, noncommunicable diseases (NCDs) pose an ever-greater health threat, with cancer becoming an NCD of greater importance. Therefore, it is imperative to focus on cancer treatment, enhance the rate of early detection and cure, and boost cancer screening.

Due to technical advancements in statistics and computer software, computer professionals, and health scientists may now collaborate closely to improve prognoses. As a result of the adoption of artificial intelligence (AI) strategies, researchers have increasingly concentrated on creating models using AI algorithms to detect and diagnose cancer. AI is the process of teaching a computer to mimic human intelligence by showing it how to study, evaluate, comprehend, deduce, interact, and make decisions [ 3 ]. Tremendous success has been achieved with AI in the last ten years in the fields of speech synthesis, natural language processing, and computer vision. This review focuses on the latest AI techniques for tumor diagnosis, treatment, and prognosis. We highlight artificial intelligence applications for omics data types, and we offer perspectives on how the various data types might be combined to create decision-support tools and discuss how current challenges may be surmounted to make artificial intelligence useful in clinical settings in the future.

We searched three databases from their creation until November 10, 2023: MEDLINE (PubMed), CENTRAL (Cochrane Central Register of Controlled Trials), and Embase to assess the published literature pertaining to the application of artificial intelligence in cancer. Due to the rapid pace of AI updates, we have focused on the last two years of relevant research. The following keywords were used in this scoping review: (neoplasms OR cancer) AND (artificial intelligence OR deep learning OR machine learning). With a focus on the application and usage of artificial intelligence in cancer treatment, we incorporated a total of 254 publications in the construction of this narrative review, including pertinent prospective, retrospective, and review studies.

Specific meaning of artificial intelligence

AI is an area of computer technology comprising numerous techniques and subfields aimed at performing activities that could previously be completed only by humans. To enhance the interpretation of medical data relevant to medical administration, diagnostics, and predictive outcomes, AI technologies and their subdomains are being implemented in healthcare delivery. The two main techniques for implementing AI are machine learning (ML) and deep learning (DL), which are terms that are frequently used interchangeably. Deep learning is a branch of machine learning. ML generates predictions by spotting patterns in data by means of mathematical algorithms. DL produces forecasts using multiple layers of fabric neural network algorithms that are modeled after the brain’s neural network architecture. In the past ten years, with advancements in big data, algorithms, computing power, and Internet technology, AI has excelled in numerous tasks across a wide range of industries, including identification of faces, image classification, speech recognition, automatic translation, and healthcare [ 4 ]. The main ML techniques are support vector machines (SVMs), decision trees, and K unsupervised algorithms, while the most commonly used for DL today are convolutional neural networks (CNNs) [ 5 ]. Figure  1 presents a few of the most basic ML and DL approaches.

figure 1

Network structure of DL. a A model of an SVM; b A model of a random forest that is composed of several decision trees; c KNN characterized by the fact that it is composed of many random features rather than a linear feature; d Components of CNNs [ 6 ]; and e Components of graphical CNNs

ML fundamentally seeks to replicate or mimic humans’ capacity for pattern recognition. Traditional ML approaches take far longer to teach and test based on a specific problem than DL approaches. SVMs, decision trees, random forests, gradient boosting (such as XGBoost), and other conventional ML techniques are examples of traditional ML techniques. There is a significant flaw with decision trees, namely a decision tree divides samples extremely precisely, but dividing samples too precisely causes overfitting of the training set, and dividing samples coarsely results in a decision tree that does not fit the samples properly. Decision trees called random forests are based on the concept of learning and bagging combined. Two factors—the random selection of the dataset and the random selection of the characteristics utilized in each tree—reflect the unpredictability of a random forest the most. The XGBoost technique repeatedly constructs an ensemble of decision trees. The capacity of this technique to manage missing data, capture nonlinear correlations between the model features and the outcome, and have higher-order interactions between variables is its key benefits over conventional logistic regression-based risk models [ 7 ].

Training artificial intelligence models

Several processes are necessary for training an AI model, including data gathering and preparation, model selection, model training, and hyperparameter tuning.

Data collection and preprocessing

With the rapid development of modern medicine, various types of data are emerging. A large amount of imaging data has been generated as represented by X-ray, CT, and MRI, and the development of pathology has made sectioning the gold standard for tumor diagnosis. In addition to the traditional clinical information data, with the remarkable advances in sequencing technology over the past two decades, how to deal with the large amount of molecular data brought about by genomics, transcriptomics, proteomics, etc., has also become a matter of close attention for clinicians. Later, we will describe how to deal with a single type of data. However, a patient usually does not have only one type of test and one type of data, so we will also introduce how to integrate different types of data to enhance computational models.

To facilitate the subsequent model training, we need to preprocess these data. For digital data, we need to remove outliers, deal with missing values, and normalize the data. The most often utilized AI algorithms, using EHR as an example, are deep learning, decision tree algorithms, and regression algorithms. While completing regression tasks to finish disease risk prediction, researchers also use classification tasks to extract lesion characteristics from illnesses and categorize them [ 8 ]. The initial set of preparation measures mentioned above led us to normalize the data. To extract the characteristics from the data, we must then process it further. Digital data may be used directly as raw data.

To enhance the diversity of the dataset, we may need to adopt techniques such as pitch shifting [ 9 ], time stretching [ 10 ], and adding background noise for sound data. Sound data can have features extracted using methods such as Mel Frequency Cepstrum Coefficients (MFCCs). Deep learning approaches are already being used in numerous creative photoacoustic tomography projects for a range of goals, such as enhanced quantification, inadequate sampling modification, resolution enhancement, and reconstruction artifact removal [ 11 ]. The sort of cancer for which sound data are most frequently employed is skin cancer. In a previous study [ 12 ], vibrational optical tomography (VOCT) and machine learning were utilized to assess the specificity and sensitivity of employing light and audible sound to distinguish between skin malignancies and normal skin. An OQ LabScope 2.0 was used to measure the resonance frequency. Various machine learning techniques, including logistic regression, support vector, and decision-making models, were then used and contrasted to determine which model produced the best reliability. A recent study [ 13 ] imaged breast cells, especially malignant MDA-MB-231 cells and normal MCF10a cells, using phonon microscopy. A shallow convolutional neural network was trained to differentiate signals coming from healthy cells, malignant cells, and background using the raw phonon data as inputs. They used the Gramian angular summation fields approach to transform the signals into a format that was appropriate for the network, which produced visual representations of the time-resolved signals. The final model has a 93% accuracy rate.

For image data, the preprocessing process may involve techniques such as rotating, inverting, scaling, and adding noise. The best AI tool for processing images is deep learning. The most representative of these is convolutional neural networks (CNNs) [ 14 , 15 , 16 ]. A CNN often includes the following layers: an input layer, a convolutional layer, an activation layer, a pooling layer, and a fully connected layer. The core of CNN’s efficient image processing lies in the convolutional layer (Fig.  1 .C). In this way, an image is digitized. Transformer neural networks have recently replaced convolutional neural networks (CNNs) in many nonclinical and clinical image processing jobs because of their enhanced reliability and efficiency in computer vision tasks [ 17 , 18 ]. According to a previous study [ 19 ], transformer-based methods outperformed attention-based MIL techniques in terms of data efficiency since they were better at learning from tiny quantities of data.

Different data can also be converted to each other. In addition to converting image data to numerical values, image data can also be converted to sound data to acoustically differentiate between malignant and benign lesions [ 20 ]. In the last layer of the DL classifier, all 1024 nodes’ weighted activations were sonified—that is, data were represented using nonspeech to produce sounds—after training, fine-tuning, and data replenishment [ 21 ].

Indeed, multimodality is inherent in health data. Our current state of health comprises a multitude of data, ranging from the broad macro-level (disease existence or lack) to the detailed micro level (biomarkers, proteomics, and genomes). To improve prediction performance, a subsection of machine learning called “multimodal machine learning” seeks to create and train models that can use a variety of data sources and understand how to link to or integrate distinct modalities [ 22 ]. The majority of multimodal clinical decision-support systems in use today rely on an uncoordinated method of combining data from several sources [ 23 , 24 , 25 ]. IRENE was the first medical diagnostic transformer-based model to perform holistic representation learning on multimodal clinical data concurrently using a single, cohesive AI model [ 26 ]. In contrast to earlier nonunified approaches, IRENE avoids taking separate pathways for learning modality-specific characteristics in nonunified techniques, instead gradually learning holistic representations of multimodal clinical data. Large language models, which have just been developed, may improve this method [ 27 ].

Model selection

Depending on the kind of data and the issue we are trying to address, we must select the best ML or DL architecture. When the dataset holds numeric data, we can use traditional regression models (e.g., linear regression) for prediction and traditional clustering algorithms (e.g., support vector machines (SVMs)) for classification. When the data we need to deal with are sound and image data, we need to choose to use neural networks (NNs), such as CNNs and RNNs, to help us mine the deeper features of images. If we also need to focus on the sequence information between the data, we can use long short-term memory (LSTM).

Model training

Conventional model training is divided into two steps: training and verification. We can divide the existing dataset into a training dataset and a verifying dataset at a ratio of 7:3 or 8:2. We first use the training dataset to train the model so that the model automatically optimizes the parameters. To achieve better recognition and prediction results, then we use the verifying dataset to verify the training effect of our model.

Hyperparameter tuning

In an AI model, parameters are often divided into two categories: hyperparameters and model parameters. Model parameters are parameters that can be automatically optimized through continuous training and iteration, while hyperparameters are fixed parameters that need to be set manually. The number of layers in the convolution layer of the CNN is one kind of hyperparameter. The setting of hyperparameters will directly affect the performance of a model. When the classification and prediction of a model is not good, we can modify the hyperparameters to provide its performance. The optimization of hyperparameters is complicated work that requires sufficient professional knowledge and experience accumulated from long-term tuning (hyperparameter tuning).

With the development of AI, an increasing number of models have been built, and Table 1 describes the latest FDA-approved AI models related to cancer.

ChatGPT, a public and open research preview that was released in November 2022, quickly popularized OpenAI’s work with autoregressive LLMs based on generative pretrained transformers (GPT). Tiffany H. Kung et al . [ 35 ] evaluated ChatGPT on the United States Medical Licensing Examination (USMLE) and found that ChatGPT performed at or near the passing threshold of 60% accuracy. According to their study, LLMs such as ChatGPT may be able to help human students in a medical education context as a step toward eventual inclusion in clinical decision-making.

However, before clinical decisions are made, physicians often perform an additional and crucial step wherein they ask patients a series of questions to further clarify issues and schedule relevant tests to obtain more accurate information to support a diagnosis. This step is currently difficult for ChatGPT to accomplish proactively. We must recognize that AI’s purpose is not to eclipse or take the place of humans but rather to offer decision-support tools that aid in the clinical management of cancer patients by medical professionals and researchers studying the illness.

Increasingly significant role of AI in tumor diagnosis, staging, and grading

Tumor screening and early detection.

An important way to reduce cancer incidence and mortality is through screening in a population. With the increasing awareness of health screening, an increasing number of smart detectors are being invented to improve the early detection of cancer. For the purpose of early cancer diagnosis, traditional machine learning (ML) approaches including random forest (RF), naïve Bayes, k-nearest neighbor, support vector machines (SVM), and related methods have been applied. Convolutional neural networks are the most commonly used model in image-based screening, and SVM algorithm-based and mass spectrum-based feature selection are commonly used in molecular diagnostics.

Digital breast tomosynthesis (DBT) can improve breast cancer detection rates by decreasing recall rates, increasing incremental cancer detection rates, and increasing cancer detection rates [ 36 , 37 , 38 ]. However, DBT images take longer to interpret [ 39 ]. An AI model [ 40 ] was built consisting of a collection of 50 different classifiers. The clinical data and data from the Digital Imaging and Communications in Medicine tags were analyzed by five machine learning (ML) classifiers, and the four DBT viewpoints were processed by 45 deep learning (DL) classifiers. The ability of the AI model to recognize common digital breast tomosynthesis screening techniques reduced the number of examinations that required doctors interpretation in a simulated clinical workflow.

For lung cancer screening, X-rays and low-dose CT are the most routine screening methods. DL algorithms have made good progress [ 41 , 42 , 43 ] in improving X-ray screening of lung nodules. However, low-dose CT is more accurate than X-rays. The use of low-dose spiral computed tomography (CT) scans has been shown to significantly reduce lung cancer mortality [ 44 ]. A CNN (CXR-LC) was created utilizing information that is frequently found in electronic records (CXR picture, age, sex, and whether or not a person is a smoker) and validated that it can identify smokers at high risk of developing incident lung cancer in two large lung cancer screening trials (PLCO, NLST) [ 45 , 46 ]. A DL system [ 47 ] was created that can correctly identify the existence of lung cancer within three years and account for all pertinent nodule and nonnodule markers on screening chest CTs. Their research was the first to create a deep machine learning prediction method without the use of computer-aided diagnostic tools to assess a person’s 3-year probability of developing lung cancer and related lung cancer-specific mortality. Kiran Vaidhya Venkadesh et al . [ 48 ] created and externally verified a CNN-based DL algorithm for estimating the likelihood of malignancy in lung nodules found by low-dose screening CT, which demonstrated good performance, on par with thoracic radiologists, at estimating the malignancy risk of pulmonary nodules observed during screening CT(AUC = 0.93). However, their researches included a number of restrictions. Firstly, one CT scan was employed in the created method, and a prior CT image was not taken into account [ 48 ]. Secondly, on average, members of the cohort [ 47 ] had LDCTs for screening every year, which may cause bias in the measurement results.

In order to address the above issues, a deep learning system was developed that can forecast the probability of developing lung cancer six years from now. Newly developed Sybil [ 49 ] can precisely estimate a person’s future risk of lung cancer on a single LDCT scan, enabling more individualized screening. When using CNNs to perform lung nodule classification, data imbalance is a crucial issue to be considered. To address this, MLSL-Net [ 50 ] was established, which employs multilabel softmax loss (MLSL) as the performance index. Recently, Xiangde Luo et al . [ 51 ] proposed a centroid matching detection network (SCPM-Net) based on a 3D sphere representation to address the limitations of CNNs, namely that they have limited elasticity when dealing with pulmonary nodules that have a large range of sizes and require predefined anchor parameters, such as the size, number, and aspect ratio of anchors. According to experimental findings on the LUNA16 dataset, the SCPM-Net framework has an average sensitivity of 89.2% at 7 preset FPs/scan.

In addition to imaging, molecular tests are an important part of early screening. Nine lipids have been identified [ 52 ] as the features most crucial for early-stage cancer detection using SVM algorithm-based and mass spectrum-based feature selection. The chosen lipids were found to be differentially expressed in in situ early-stage lung cancer tissues according to matrix-assisted laser desorption/ionization MS imaging. A diagnostic screening approach for gliomas called DeepGlioma [ 53 ] uses deep neural networks and stimulated Raman histology (SRH) to quickly screen for molecular changes in newly collected glioma specimens.

Adenomas of the colorectum have been shown [ 54 ] to be highly correlated with colorectal cancer. Several studies have recently developed different AI models for improving adenoma detection rates [ 32 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 ]. To predict the polyp class, two DL models, SEG and noSEG, were trained using 3D CT colonography image subvolumes. Model SEG was also trained using polyp segmentation masks [ 56 ]. Joel Troya et al . [ 58 ] combined side optics with AI. Hong Xu et al . [ 60 ] invented an AI polyp detection system (Eagle-Eye) with real-time notification on the same monitor of the endoscopy system. All of these models have been shown to enable CT colonography to noninvasively distinguish benign and premalignant colon polyps. In addition, AI has been shown to save on the cost of colonoscopies [ 62 , 63 ]. Along with colonoscopy, noninvasive diagnostics, including plasma fluorescence [ 64 ], tests for intestinal microbiota [ 65 ], and spatial light interference microscopy [ 59 ], can be used in conjunction with AI to enhance the early detection of colorectal cancer.

Furthermore, cervical cancer [ 66 ], skin cancer [ 67 , 68 ], oral cancer [ 69 , 70 ], esophageal squamous cell carcinoma and adenocarcinoma of the esophagogastric junction [ 71 ] can also be detected and distinguished early using AI models. The above studies greatly demonstrate the potential of AI models in detecting early cancers. Figure  2 describes the function of AI in cancer.

figure 2

AI in oncology, including early screening, diagnosis, treatment, prognosis, and clinical decision-making. (Created with BioRender.com)

Tumor diagnosis

When assessing a patient’s signs and symptoms, clinicians typically draw on their own knowledge and professional expertise. Given the enormous amount of clinical data, it can be challenging for them to make a diagnosis quickly. In addition, there are issues with individualized patients, atypical test results, and false negatives. Doctors with heavy clinical workloads frequently run the risk of missing or misdiagnosing patients. However, AI can process a large amount of data in a short period and can improve the accuracy and speed of disease diagnosis, thus allowing AI to be widely used in cancer diagnosis.

The current approaches of AI for cancer diagnosis can be routinely divided into two main types: microscopy-based and image-based AI. Microscopy-based AI mainly explores models to improve the correct diagnosis from a histopathological point of view. Image-based AI involves algorithms that reduce the incorrect diagnosis rate from images such as X-ray and CT scans.

Here, we focus on the latest advances in current conventional microscopy-based and image-based AI for cancer diagnosis. Weakly supervised learning models and generative adversarial networks are the most relevant models for histopathology. Various models in deep learning play an important role in assisting imaging to diagnose tumors.

Microscopy-based research

Pathological diagnosis was once considered the gold standard for cancer diagnosis, but errors inevitably exist [ 72 ]. In the past, the majority of techniques relied on morphological traits or hand-crafted features to identify malignant and noncancerous cells in histopathological images [ 73 , 74 ]. The power of AI is not limited to image categorization, where the goal is to forecast a certain condition consistent with the image. Generic models replicate the original visuals and provide fresh possibilities, such as quick and safe model training. Curating enormous databases of digitized tissue sections has been made practical by whole-slide imaging (WSI) of tissues, affordable storage, and rapid network data transfer [ 75 ]. Annotation-less techniques have gained popularity in recent high-profile papers [ 27 , 76 , 77 ]. These approaches do not rely on annotations for individual structures such as nuclei, cells, or tissues; instead, they simply need one label per complete WSI, such as malignant/benign, which characterizes the WSI as a whole. AI techniques leverage weak labels by utilizing the multiple instance learning (MIL) AI framework [ 78 ].

Histological staining, a vital step in the pathology workflow, is required to offer tissue contrast and color by permitting chromatic discrimination between different tissue components. The most popular stain, hematoxylin and eosin (H&E), sometimes known as the “standard stain,” is used in almost all clinical settings [ 79 ]. Aman Rana et al. [ 80 ] trained conditional generative adversarial networks (cGANs), which automatically convert native nontrained RGB WSIs to computational H&E-stained pictures. Binglu Huang et al . [ 81 ] collected 1037 H&E-stained pathology images from 2333 GC patients to develop GastroMIL, which achieved an accuracy of 0.920 in an external validation set, superior to that of junior pathologists and comparable to that of expert pathologists. It is challenging to identify mitosis in H&E-stained slices because there are few datasets available and because mitotic and nonmitotic cells are similar. Comparing performance metrics of multi-CNN combinations with other classifiers such as AdaBoost and random forest, multi-CNN combinations with three pretrained CNNs and a linear SVM have been shown to provide 93.81% accuracy and a 92.41% F1 score for detecting mitosis [ 82 ].

DL can also predict biomarkers with high performance from cancer pathology slides. Malignant cancer cells are created when normal cells have oncogenic driver mutations, which completely alter the behavior of the cells by rewiring their internal systems [ 83 ]. When genetic mutations are present, the genotype as established by an enzyme-mediated biological research assay or other gold standard testing is used as the ground-truth approach during the traditional diagnostic workup to identify the picture label. The term “ground truth” describes the kind of test that is employed to identify training pictures. As a result, by examining histological image data, the DL classifier may be trained to replicate the “ground truth.” In contrast to basic DL applications, these progressed applications for deep learning can give doctors extra information that is not being gleaned from routine material in the current medical workflows. They signify a novel category of biomarkers possessing prognostic and/or predictive utility. Microsatellite instability (MSI) due to mismatch repair (MMR) defects accounts for 15–20% of colon cancer (CC), and many DL algorithms [ 84 , 85 , 86 ] have been established to detect MSI. To predict additional biomarkers for CRC from pathology slides, Jan Moritz Niehues et al . [ 14 ] comprehensively assessed six distinct cutting-edge DL architectures. They discovered that while MSI and BRAF mutant prediction was performed at a clinical-grade level, PIK3CA, KRAS, and NRAS mutation prediction did not meet these standards. An algorithm for cell-distance analysis of multiplex fluorescence immunohistochemistry (mfIHC) staining and a framework for automated Ki-67 LI quantification were created and validated in a cohort of 12,475 prostate cancer samples in order to enable automated Ki-67 LI assessment in common clinical practice [ 87 ]. AI-assisted analysis of biomarkers in thyroid cancer [ 88 ], and breast cancer [ 89 ] also helps in accurate diagnosis.

Overall, these models may be useful for diagnosing and categorizing malignancies if their performance is supported by prospective studies. This is especially true given that their performance is on par with or even superior to that of experts in the area.

Image-based research

AI has much potential for helping radiologists with their work and for image information mining. Clusters of graphics processing units are integrated into high-performance computers, which have powerful computational power. In addition to the AI we mentioned above, which can assist in early screening for breast, lung and colorectal cancers, some other promising imaging tests can be combined with AI to improve diagnostic accuracy.

Numerous modalities have been used to acquire vast numbers of high-quality skin photos, exploiting the exceptional advancements in optical imaging methods. Therefore, AI has made promising progress in the detection of skin cancer through dermoscopy. Inception V3 models that have already been trained have been used [ 90 ] to classify skin lesions and to present dermatologist-level prediction outcomes. The knowledge distillation approach is also often used to help diagnose melanoma [ 91 , 92 ]. In addition to the simple teacher–student model, the SSD-KD approach [ 93 ], a unique self-supervised diversified knowledge distillation technique, has been used for the lightweight multiclass categorization of skin diseases utilizing dermoscopy images. In that study, the conventional single relational modeling block was substituted with dual relational blocks in terms of technological innovation. Multi-Site Cross-Organ Calibrated Deep Learning (MuSClD), a novel approach to cross-organ calibration between two sites of digitalized histopathology images, was validated in nonmelanoma skin cancer. 3D images [ 94 , 95 ], EfficientNet [ 96 , 97 ], genetic programming (GP) [ 98 ], and new AI algorithms on smartphones [ 99 , 100 ] have also been developed for skin cancer diagnosis.

To supplement human visual inspection, AI can assist in the detection of undetectable tumor lesions on PET scans. Ga-PSMA-11 PET-based radiomics features have been used to generate random forest models that accurately predicted invisible intraprostatic lesions [ 101 ]. Biopsy and magnetic resonance imaging (MRI) are frequently used to diagnose intracranial tumors. Due to the similar phenotypes of various tumor classes on MRI scans, it has been difficult to identify tumor types, especially rare types, from MRI data. A DL method for segmenting and classifying 18 distinct types of intracranial tumors was developed [ 102 ] using T1- and T2-weighted images and T2 contrast MRI sequences and evaluated with an AUC of 0.92.

AI may easily be applied to medical imaging, and major advancements in this area have been made in recent years. AI eliminates the uncertainty that people contribute to decisions and delivers objective measurements for each choice. However, the limits are also readily apparent. The molecular causes of illnesses are not revealed by morphological evidence. By using this method, disease states with the same morphological appearance cannot be discriminated.

Tumor staging and grading

Important factors for tumor T-staging include the size and degree of invasiveness of primary tumors, which comprise descriptions of their shapes. Convolutional neural networks are most used in this task. The T stage of Barrett’s carcinoma is a crucial consideration when choosing a course of therapy. Endoscopic ultrasonography is still the norm for preoperative staging, but its usefulness is under question. To help with staging and to improve outcomes, new tools are needed. With a high accuracy of 73% in diagnosing esophageal cancer, an AI system built around endoscopic images has been developed [ 103 ]. Tumor sizes and forms vary, making individual slice-by-slice screening for T-staging time intensive. Consequently, a multi-perspective aggregation network (TSD Net) has been created with ideas from oncological diagnostics that included different diagnosis-oriented knowledge and enabled automatic nasopharyngeal carcinoma T-staging identification [ 104 ].

Advances in imaging histology have greatly contributed to helping TNM staging of tumors. Separate iterations of the machine learning models have been created using both the entire collection of extracted features (full model) and just a selection of the previously discovered robust metrics (robust models) to confirm that CT-based radiomics signatures were effective tools for determining the grade and stage of ccRCC [ 105 ]. Additionally important in the early phases of decision-making, but time-consuming, is a delineation of the tumor. To forecast the grade of a tumor while also segmenting it, a single multi-task convolutional neural network has been created using the whole 3D, structural, preoperative MRI data [ 106 ].

Accurate assessment of lymph node metastasis (LNM) is essential for evaluating the staging and grading of tumor patients. In addition to offering a straightforward “yes” or “no” response on the likelihood of having cancer, AI models can also identify the disease site from a test picture. One of the most common applications is to help find the localization of metastatic tumors. Using whole-body PET/CT scans, convolutional neural networks (CNNs) based on UNet [ 1 ] were trained to detect and separate metastatic prostate cancer lesions fully automatically [ 107 ]. The localization of tumor metastasis in whole-slide images has also been studied extensively in recent years [ 107 , 108 , 109 , 110 ]. The condition of the lymph nodes (LNs) prior to surgery is crucial for the management of colorectal cancer (CRC). With areas under the curve (AUCs) of 0.79, 0.73, and 0.70 in the training set, testing set, and verification set, respectively, a deep learning (DL) model [ 111 ] with features gathered from improved venous-phase CT images of CRC has been proposed to identify LNM in CRC. Shaoxu Wu et al . [ 112 ] created a diagnostic algorithm called LNMDM based on AI that was effective for finding micrometastases in lymph nodes and was demonstrated not only in bladder cancer (0·983 [95% CI 0·941–0·998]) but also in breast cancer (0·943 [95% CI 0·918–0·969]) and prostate cancer (0·922 [95% CI 0·884–0·960]). AI plays a significant role in aiding diagnostics to find lymph node metastases in slide pictures. Lymph node metastases, especially micrometastases, were successfully identified by the LNMDM [ 112 ] on whole-slide images in bladder cancer. The VIS AI algorithm demonstrated comparable accuracy and NPV in identifying LN metastases on breast cancer. In summary, the implementation of AI in tumor staging and grading has significantly improved tumor prognoses and increased the general survival rate of cancer patients.

Tumor therapy

Ai for exploring tumor therapeutic targets.

In recent years, the development of multiomics technologies in cancer research [ 113 , 114 ] has greatly facilitated the discovery of anticancer targets [ 115 , 116 , 117 ]. The advancement of precision medicine and translational medicine will be significantly aided by the use of ML and DL to mine multiomics data to investigate complicated disease causation processes and treatment response mechanisms. In the following, we describe in detail the advances in genomics, epigenetics, transcriptomics, proteomics, metabolomics, and multiomics in cancer target discovery. Figure  3 describes the main sources of these six components and the advanced methods currently comprising them.

figure 3

Components of multiomics and the main techniques. The combination of AI and multiomics has led to the discovery of new targets for cancer therapy. (Created with BioRender.com)

The genome contains inherited information that controls gene expression to shape the structure and working machinery of the cell [ 118 ]. Genomics focuses on understanding the composition, organization, visualization, and modification of an organism’s whole genome [ 119 ]. The rise of the genomic era has also boosted precision medicine and cancer [ 120 ]. The approach of a meta-learning model [ 121 ] allows users to discover significant pathways in cancer and priority genes based on their contribution to survival prediction. To fully understand how cancer develops, progresses, and is treated, accurate somatic mutation detection is difficult yet essential. The first method for detecting somatic mutations based on deep CNNs is called NeuSomatic [ 122 ]. However, the fact that matched normal specimens are not frequently acquired in clinical practice is a major barrier to genetic testing in cancer. The somatic vs. germline status of each discovered change may be predicted using SGZ, [ 123 ] which does not need a patient-related standard control, by modeling the mutation’s allele frequency (AF), accounting for the cancer content, cancer ploidy, and local copy number. Similarly, a recently created method, Continuous Representation of Codon Switches [ 124 ] (CRCS), a DL-based technique, can aid in the identification and investigation of driver genes as well as the detection of cancer-related somatic mutations in the absence of matched normal samples.

Taking colon cancer as an example, numerous studies [ 125 , 126 , 127 , 128 ] have subtyped colorectal cancer based on similar and different biological traits and pathways, and they have identified the relationships between these pathways and patient prognosis, overall survival, and responsiveness to various treatments—particularly targeted therapy and immunotherapy. Using 499 primary colorectal neoplasm diagnostic images from 502 individuals in The Cancer Genome Atlas Colon and Rectal Cancer (TCGA-CRC-DX) cohort, a retrospective study established a weakly supervised DL framework incorporating three separate CNN models [ 85 ]. After comprehensive validation, the method was shown to be helpful for patient classification for targeted medicines, with possible cost savings and quicker turnaround times compared to sequencing- or immunohistochemistry-based techniques. The research, however, examined each individual image tile without considering the significance of the spatial relationship between tiles. In a recent study [ 129 ], a method for forecasting cross-level molecular profiles involving gene mutations, copy number variations, and functional protein expression from whole-slide pictures was proposed. This method focuses on the spatialization of cancer tiles. In the training dataset, the model performed exceptionally well in predicting a variety of genetic alterations and then identifying targeted therapies for colon cancer patients.

Epigenetics

Epigenetic modification is the genetic change in the way genes operate and express without altering the DNA sequence. DNA methylation, histone modification, and chromatin structure manipulation are the three primary epigenetic modifications that are now understood [ 130 ]. Although there are high-quality data on DNA methylation, few samples have RNA-seq data due to numerous experimental difficulties. Therefore, an innovative technique called TDimpute [ 131 ] was created to reconstruct lost data on gene expression from DNA methylation data using a transfer learning-based neural network. Understanding how epigenetics regulates gene expression to govern cell functional heterogeneity is dependent on the ability to predict differentially expressed genes (DEGs) from epigenetic signal information. On the basis of epigenetic data, a multiple self-attention model (Epi-MSA) [ 132 ] was suggested to predict DEGs. To determine which gene locations are crucial for forecasting DEGs, Epi-MSA first applies CNNs for neighborhood bin information embedding and then makes use of several self-attention encoders on various input epigenetic parameters.

Transcriptomics

Transcriptomics is a useful tool for comprehending the physiology of cancer and locating biomarkers. It includes analyses of alternative transcription and alternative polyadenylation, detection of integration transcripts, investigations of noncoding RNAs, transcript annotation, and finding novel transcripts [ 133 ]. One study using DL algorithms to interpret common cancer transcriptome markers [ 134 ] showed that across a wide range of solid tumor types, dysregulation of RNA-processing genes and aberrant splicing are widespread traits on which fundamental cancer pathways may converge. Molecular pathology plays an important role in cancer, but whether it is possible to estimate the levels of gene expression based on a visual inspection of H&E-stained WSIs has never been thoroughly explored. Numerous studies have been conducted to predict cancer gene expression, including that of prostate [ 135 ] and breast [ 136 ] cancers, across the transcriptome from histopathological images. A DL model called HE2RNA [ 137 ] based on a multitasking poorly supervised technique was created using matched WSIs and RNA-Seq profiles from TCGA data, which included 8725 patients and 28 distinct cancer types. This increases the likelihood of discovering novel gene targets. Patients’ responses to treatment are significantly influenced by the quantity, composition, and geographic distribution of the cell groups in the tumor microenvironment (TME) [ 138 ]. The thorough characterization of gene regulation in the TME has been made possible by recent developments in spatial transcriptomics (ST) [ 139 , 140 ]. Three new approaches have recently been developed: Kassandra [ 141 ], XFuse [ 142 ], and TESLA [ 143 ]. Kassandra is a tree ML algorithm that was taught to precisely rebuild the tumor microenvironment (TME) using a large database of > 9,400 tissue- and blood-sorted cell RNA profiles combined into millions of artificial transcriptomes. According to Kassandra’s deconvolution of TME components, these populations play a part in tumor etiology and other biological processes. By utilizing data from H&E-stained histological images, XFuse predicts superresolution gene expression per pixel. TESLA is an ML framework that incorporates gene expression and histological image data into ST to study the TME. The innovative aspect of TESLA is the annotation of diverse immune and tumor cells on histological images directly.

In addition, the identification of lncRNAs [ 144 , 145 , 146 ] and microRNAs [ 147 , 148 ] by ML can assist in the precise treatment of cancer. In the fight against cancer, therapeutic decisions are increasingly based on molecular tumor features, and cancer tissue molecular profiling is becoming an essential component of standard diagnosis [ 149 ]. To reduce individualized patient differences, scGeneRAI [ 150 ] uses layerwise relevance propagation (LRP), an explainable AI technique, to extrapolate individual cell gene regulation networks from single-cell RNA sequencing data. Oncology drug response is a major challenge in cancer treatment. With an average Matthew correlation coefficient (MCC) and AUC of 0.56 and 0.80, respectively, the classification and regression tree (CART) model from interpretable ML models has proven to be the best model for predicting how breast cancer would react to doxorubicin [ 151 ]. At the single-cell level, ScDEAL is a deep transfer learning system that integrates bulk cell-line data to predict cancer medication response at the single-cell level. Finding drug resistance targets at the level of transcriptional profiles using AI deserves more research in the future.

Proteomics is a broad study of proteins that identifies and counts the proteins present in a biological sample, such as a sample of cells, tissues, or bodily fluids. Proteomics data offer the benefit of providing a numerical number of individual proteins throughout the body and dynamic characteristics that develop over time and among individual subjects, in contrast to other forms of omics data, such as genomic data. Mass spectrometry (MS) is a key tool used in proteomics research [ 125 ]. MS-based proteomics has advanced quickly in terms of lower cost and higher throughput, regularly permitting large-cohort studies with tens of thousands of participants and tens of millions of identified proteins in cancer cells and other biological samples. However, the majority of research concentrates on the final proteins discovered using a collection of algorithms that compare partial MS spectra with the ordered database, leaving the problem of pattern identification and categorization of the raw mass-spectrometric information unanswered. Consequently, for the analysis of massive MS data using deep neural networks (DNNs), the publicly available MSpectraAI [ 152 ] platform and the tumor classifier [ 153 ] have been developed, which could expand the intriguing use of DL techniques for classifying and predicting proteomics data from multiple cancer types and distinguishing between tumor and nontumor samples.

Sequential Window Acquisition of all Theoretical Mass Spectra-MS (SWATH-MS) is a cutting-edge MS method that enables the measurement of nearly all peptides as well as proteins present in a single sample, making it valuable in research involving massive sample cohorts [ 154 ]. It can be used to facilitate the categorization of CRC molecular subgroups and promote both diagnostics and the creation of novel medications [ 155 ]. Regarding colorectal cancer, a mechanism-based ML approach [ 156 ] has been proposed to find genes and proteins with substantial correlations to event-free patient survival and predictive potential to account for patient-specific variations in STN activity by building three linear regression models. The development of proteomics has contributed to the discovery of new targets in hematological tumors. Targetable enzyme characteristics have been revealed by proteomics of acute lymphoblastic leukemia that is resistant to Notch1 suppression [ 157 ]. Through the induction of long-lasting immune responses, T cells play critical roles in human defense against hematological tumors. In recent work [ 158 ], ML and nanoscale proteomics were coupled to subtype T cells in peripheral bloodstreams from single individuals with multiple myeloma. To reduce the possibility of overfitting the ML models, differentially expressed proteins (DEPs) were selected according to statistical significance, and only the top 13–15 DEPs were utilized. Thus, this work helped identify new targets for immunotherapy. Another DL network [ 159 ] identified the 20 proteins most strongly associated with FLT3-ITD in acute myeloid leukemia. In addition, DL and ML have been applied to proteomics data for pancreatic cancer [ 160 ] and diffuse large B-cell lymphoma [ 161 ] patients, respectively.

Metabolomics

Metabolomics is a burgeoning area of research that utilizes technologically sophisticated analytical chemistry to perform high-throughput characterization of metabolites in cells, organs, tissues, or biological fluids [ 162 ]. New therapeutic targets have been suggested to target metabolic constraints in cancer as a result of metabolomics studies, which have revealed potential medicinal weak points for treating cancer [ 163 ]. Lipidomics is a branch of metabolomics that aims to study and analyze the lipids in the metabolome and the molecules that interact with them [ 164 ]. Metabolomics analysis can be performed using GC‒MS and LC‒MS, and LC‒MS is commonly used for the analysis of lipidomics. The combination of metabolomics and AI has flourished in various areas of cancer, including breast cancer [ 165 , 166 ], head and neck cancer [ 167 ], colorectal cancer [ 168 , 169 ], glioma cancer [ 170 ], esophageal cancer [ 171 , 172 ], lung cancer [ 52 , 173 ], kidney cancer [ 174 ], and neuroendocrine tumors [ 175 ]. With the greatest prediction accuracy (AUC = 0.93) and a deeper understanding of disease biology, a DL technique has been shown to be beneficial for metabolomics-based breast cancer ER status categorization [ 176 ]. By biologically interpreting the first hidden layer, this technique can identify eight frequently enriched crucial metabolomics pathways (adjusted P value 0.05) that cannot be identified by other ML techniques [ 176 ].

Multiomics data, which include genomics, epigenomics, transcriptomics, and proteomics data, can offer profound information on the quantity and/or change in biological molecules across numerous dimensions in different tissues or cells [ 177 ]. Multiomics data have gained interest recently for their potential to offer a complete picture of patients, but their high dimensionality makes them difficult to use [ 178 ]. AI related to cancer multiomics has boomed in the last year and has strong potential for development in cancer therapy. Cancer driver genes are important targets in tumor therapy [ 179 ]. When compared to real tumors, an ML multiomics study [ 180 ] indicated carcinoma driver dysregulation in pancancer lineages of cells. Using graph convolutional networks to identify cancer driver genes is currently a popular research direction. DGMP [ 181 ] and MODIG [ 182 ] were created separately by applying pancancer multiomics data (including DNA methylation, copy number variation, mutation, and gene expression data). DGMP joins a directed graph convolutional network (DGCN) and multilayer perceptron (MLP), and MODIG is based on a graph attention network (GAT). They both have been shown to effectively identify cancer driver genes. Accurate tumor druggable gene discovery advances precision cancer therapy and deepens the comprehension of targeted cancer therapy. To determine the landscape of the genes that are capable of causing cancer, DF-CAGE [ 183 ], a novel ML-based method, combined the data from over 10,000 TCGA profiles on somatic mutations, copy number variations, DNA methylation, and RNA-Seq. DF-CAGE identified 465 putative cancer-druggable genes out of the approximately 20,000 protein-coding genes. These results provide insight into current pharmacological research and development efforts. DeepInsight-3D [ 184 ], which depends on the translation of structured data into images and then makes use of CNNs, represents a solution to the issue of the high dimensionality of the datasets combined with the lack of sufficiently large numbers of annotated samples in multiomics data. Future research toward better personalized treatment plans for various malignancies may be aided by the suggested enhancements.

The prognosis for non-small cell lung cancer (NSCLC), a heterogeneous illness, is dismal. A recent study [ 185 ] used ML models to develop a classification method and identified five novel NSCLC clusters with different genetic and clinical characteristics. Similarly, a multiomics data-affinitive AI algorithm [ 186 ] was created to identify new biomarkers in NSCLC but differently based on the graph convolutional network. Filippo Lococo et al . integrated multiomics and AI data into clinical trials, promoting better care for lung cancer patients [ 187 ]. The clinical significance of IMMT in KIRC has been validated using a combination of supervised learning and multiomics integration [ 188 ]. The majority of prognostic models for colon cancer are based on single-pathway genes. In a recent study [ 189 ], the molecular mechanisms causing the aggressiveness, recurrence, and advancement of colon cancer were explained using an integrative multiomics study, and ML methods were used to recognize the subtypes.

AI models can also help locate tumor sites during surgery. One of the most common applications is to help find tumor locations during surgery. The location, quantity, and size of cancer are critical factors for precise tumor excision, particularly in surgical patients. A study presented a novel double branch attention-driven multiscale learning method for MRI-based prostate and prostatic cancer segmentation networks [ 190 ]. The Dice similarity coefficients (DSCs) for prostate and prostate cancer MRI segmentation were 91.65% and 84.39%, respectively. Using magnetic resonance imaging, UNet +  + can automatically distinguish between liver tumors and normal hepatic tissue [ 191 ].

AI models can also help with tumor type classification. Neurosurgical cancer resection is the primary therapeutic method most frequently used for central nervous system (CNS) cancers. The kind of cancer is a crucial determinant in deciding whether the risk of a more vigorous excision is acceptable. A patient-independent transfer-learned neural network called Sturgeon was recently created to allow for the molecular subclassification of tumors of the central nervous system using sparse data [ 192 ]. In another study [ 193 ], after first-level categorization determined whether the aberrant area of the picture was a brain tumor, deep residual network (DRN)-enabled RDTDO was used for brain tumor classification, which was provided via second-level classification.

Tumor prognosis

Clinical oncologists rely heavily on prognosis prediction to guide treatment choices by providing information on the predicted course of the disease and the chance of survival [ 194 ] (Table 2 ). The Cox proportional hazard regression model is used most frequently to predict survival [ 195 ]. However, due to its linear nature [ 196 ], the complex relationships between some features are difficult to interpret, which is compensated by the current survival models of ML and DL [ 197 , 198 , 199 ]. Common models for ML are SVMs, logistic regression, random forest, CatBoost, LightGBM, and XGBoost. SVMs are one of the most widely used algorithms in ML for cancer prognosis. In recent research [ 200 ], 265 surgical resection patients were included (training cohort: 212, internal validation cohort: 43). An SVM model was created using nine clinicopathological characteristics. Their SVM-based model may be utilized to forecast OS and DFS in GC patients as well as the advantages of adjuvant treatment in TNM stage II and III GC patients. Another study [ 201 ] fed each feature set selected by LASSO into three classifiers, namely SVM, hist gradient boosting (HGB), and XGBoost (XGB), to develop predictive models. In a study of breast cancer, SVMs and random forests were utilized as ML classifiers, while principal component analysis (PCA) and variational autoencoders (VAEs) were employed as reduced-dimensionality approaches [ 202 ]. However, multimodal classifiers were not proactively prospectively evaluated on original data in the study. RSF outperformed COX and SVM by a wide margin in research on GBM [ 203 ].

Currently, there are available radioactive substance analysis and CNN-based PET/CT image prognosis techniques. However, there are intrinsic restrictions to risk stratification when obtaining radiomics or deep features in grid Euclidean space. To accurately stratify HNC risk, a functional-structural subregion graph convolutional network (FSGCN) has been proposed [ 204 ]. To overcome challenges in predicting the LNM status from original cancer histology, Siteng Chen et al. [ 205 ] presented an attention-based weakly supervised neural network that relied on self-supervised cancer-invariant characteristics, which might function as an innovative prognostic marker across different types of cancers.

The combination of ML and DL is gaining increasing attention. Even after curative resection, pancreatic ductal adenocarcinoma (PDAC) has a dismal prognosis. The prognosis may be improved by using a DL-based classification of postoperative survival in the preoperative setting to guide treatment choices. Based on this, ensemble learning was used to merge two models that were separately constructed using clinical data-based ML models and computed tomography (CT) data-based DL models [ 206 ]. The classification of CRC tissues based on anatomical histopathological information, however, may not be possible using DL structures alone. In one study [ 207 ], data were input into a deep SVM based on an ensemble learning technique called DeepSVM after the features were chosen, and the results showed that the hybrid model had an accuracy of between 98.75 and 99.76% on CRC datasets.

AI in clinical decision-making

The data required by physicians to make medical decisions are dispersed over numerous records, including a patient’s case history, test results, and imaging studies. Clinical prediction models usually use direct physician inputs or structured inputs taken from the electronic health record (EHR). The dependency on formatted inputs adds complexity to data processing as well as to the creation and use of models, resulting in the generation of many AI models. The invention of new drugs for oncology research in the era of precision medicine and the emergence of various treatment modalities, such as radiation therapy and surgery, have made the choice of oncology treatments fraught with various challenges. Given the breakthroughs in ML due to the availability of vast volumes of data, clinical decision-support systems (CDSS) driven by AI have been developed [ 229 ]. The earliest extensively used CDSS, Watson for Oncology (WFO, IBM Corporation, USA), has steadily gained popularity throughout the world in the treatment of thyroid carcinoma [ 230 ], prostate cancer [ 231 ], lung cancer [ 232 ], and breast cancer [ 233 , 234 ]. Medical personnel enter a case’s structured data into the WFO system, and then, the most common treatment technique for the individual situation is quickly output by the system, along with reliable proof.

Beyond skilled medical professionals, AI algorithms can forecast certain medical outcomes for assisting clinical decision-making in many ways. In terms of digital data, transforming unstructured electronic health information into repomics (report omics) characteristics, a radiological repomics-driven model combining medical token cognition (RadioLOGIC) [ 235 ] is presented to evaluate human health and forecast pathological prognosis by transfer learning. The system exhibits superior feature extraction performance compared to cohort models and shows potential for automated clinical diagnosis verification from electronic health information. To predict outcomes and identify prognostic characteristics that correspond with both favorable and negative outcomes, the multimodal, poorly supervised deep learning system is able to integrate disparate modalities in whole-slide pictures and molecular profile data from 14 cancer types [ 236 ].

Although the CDSS can quickly collect and categorize stored information, the current state of application is that the CDSS is dominated by hospital ratings, and there is very little true large-scale application. EHR limitations, such as the inability to conduct efficient interpretation and information retrieval, can be addressed with the use of LLMs. LLMs are one of the most intriguing new advances in contemporary AI studies [ 237 ]. They receive training on billions of words taken from books, articles, and other online information. LLMs can perform data compression and encryption to protect data privacy. In cancer and medicine, DL natural language processing (NLP) with free-text analysis is being increasingly employed [ 238 , 239 ]. Transformer models have taken over NLP [ 240 ]. To evaluate the accuracy of LLMs for deducing the cancer disease response from free-text radiology reports, a study compiled 10,602 computed tomography records from cancer patients examined at a single institution [ 241 ]. The results demonstrated that the GatorTron transformer, which had an accuracy of 0.8916 on the test set, outperformed bidirectional long- and short-term memory models, CNN models, and conventional ML techniques. This implies that transformer models may be employed as decision-support tools to offer doctors automatic second perspectives on illness responses. ChatGPT is the most representative LLM, and numerous cancer studies related to it have emerged since its introduction. It is substantially more accurate than previous large-scale language models when responding to queries concerning lung cancer [ 242 ], liver cancer [ 243 ], and prostate cancer [ 244 ]. Figure  4 indicates a response after sending a patient’s chief complaint to ChatGPT.

figure 4

Simulates how a large language model (exemplified by ChatGPT) will assist doctors in diagnosing the disease after a patient with suspected lung cancer arrives at the hospital. Sending the chief complaint to ChatGPT, it would first emphasize that it is not a doctor itself, then warn that the symptoms are indicative of a serious illness, and give possible diseases. (Created with BioRender.com)

Another excellent example is Med-PaLM [ 245 ], a Google-developed chatbot for medical Q&A. Most evaluations of a model’s clinical expertise are automated and based on a small number of standards. MultiMedQA, a benchmark incorporating six existing medical question answering datasets, was introduced to solve these constraints. It surpasses the previous state of the art by more than 17% and has 67.6% accuracy on MedQA (questions similar to those on the US Medical Licensing Exam). However, the answers provided by the model compared to those provided by a physician still have a great deal of opportunity for improvement, as demonstrated by this work and similar studies. The follow-up Med-PaLM 2 scored 86.5% on the MedQA dataset, an improvement of more than 19% over Med-PaLM.

The development of LLMs completely equips cancer practitioners with tools that may be utilized to enhance the efficacy of therapy and the accuracy of tumor diagnosis as well as serve as a guide for clinical decision-making. In addition, regular individuals may utilize these platforms to identify particular clinical symptoms, which can aid in early identification and raise public awareness of health issues. Additionally, if people utilize AI, they may more quickly detect overmedication and improper treatment prescribed by some doctors in exchange for payment.

Challenges and opportunities for the future

The future of AI in cancer research is fraught with both formidable obstacles and bright prospects for advancing cancer detection, diagnosis, therapy, and research.

Availability and reliability of data

A significant quantity of training data is required for DL to be effective and credible. Limited data might lead to overfitting and a subpar performance in an external test cohort. Obtaining enough data is quite difficult when creating AI-based models, especially DL models. Data from medical imaging cannot be immediately entered. Processing and extracting information from the image data are essential. A typical neural network cannot fit all medical images, especially whole-slide images that can easily have billions of pixels per image. One method [ 246 ] is to crop the image before sending it to an AI system, adding a manual step to what may otherwise be a fully automated approach, to isolate a smaller region of interest, such as a portion of a slide image that contains a tumor. Insufficient labeling required for supervised learning can also lead to loss of data reliability. Additionally, issues occur when bias in datasets is caused by technical variables. Single-source bias, for instance, occurs when a single system generates an entire dataset. On the one hand, models can be trained on site-specific data to adapt to the unique characteristics of each location where they are used, and they are additionally developed and verified on datasets gathered from various sources to increase generalization [ 247 ]. On the other hand, biomedical technologies such as CODEX and spatial transcriptomics are one way to combine the picture and molecular data [ 136 , 248 ]. These technologies overlay geographically resolved transcriptomics and proteomics data on images, enabling models to handle omics data in image form.

Interpretability

Over the last 5 years, research into explainable AI has accelerated. DL has come under fire for being a “black box” that does not clarify the way the model transforms given inputs into outputs. It is challenging for oncologists to comprehend how DL models assess data and make judgments because of the numerous elements involved. The biological significance of explanatory ability must be thoroughly studied in order for DL to be approved by regulators and used as a diagnostic tool. In genomics, this requires comparing significant genetic traits found by DL to those identified by traditional bioinformatics techniques. Additionally, when a DL model is unsure about its predictions, its capacity to generate the “don’t know” output is crucial. Overconfidence in forecasts, such as forecasting the cancer main site with only 40% accuracy, can lead to erroneous cancer diagnosis or management decisions in crucial situations. Both post hoc and integrated interpretability approaches are viable ways to gather explanations from trained models and help the model learn to provide predictions and explanations concurrently.

Ethics and morality

An increasing number of ethical questions around patient autonomy, prejudice, and transparency have been raised by the application of artificial intelligence (AI) in medicine [ 249 ]. The most susceptible source of information determines the total security level when we combine patient data from other sources. Clinical data are frequently the property of particular institutions due to concerns about patient privacy, and there are few methods in place to share data among institutions. It is frequently inadequate to remove personal identifiers and secret information since an attacker can still draw conclusions to retrieve some of the missing data. The good news is that multicenter information transfer agreements and safeguarding privacy distributed DL (DDL) are starting to overcome this roadblock [ 250 , 251 , 252 ]. DDL offers a mechanism that protects privacy so that several users can collaborate on learning using a deep model without directly exchanging local datasets. In addition, it is important to ascertain the level of supervision that doctors must provide and identify the person accountable for any poor choices made by DL tools. On the other hand, we should educate AI users to guarantee that they are knowledgeable consumers of the technology and endeavor to openly and clearly express to them what they should anticipate in a variety of circumstances. Many of the hazards described above may be reduced by being accessible, having varied demands, and being cautious.

When implementing AI, ethical issues are crucial since unethical data gathering or usage practices might introduce biases into models. These biases can take numerous forms, but they are mostly determined by the data and cohort composition employed by the particular AI systems. Providing and reviewing AI models lacks defined criteria or norms. Identifying the possible biases included in the established systems will be crucial; thus, future studies should fill this knowledge gap to help researchers and physicians.

Clinical integration

As mentioned above, AI has been shown in many studies to improve the correctness of cancer diagnosis. However, a different perspective has been proposed. In one study [ 253 ], the authors systematically evaluated 131 published studies using the QUality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool. They reported that the accuracy of AI in breast cancer screening programs cannot currently be evaluated based on available research, and it is unclear where in the therapeutic pathway AI could be most helpful.

For LLMs, one of their advantages is the capacity to sift through vast volumes of data and provide replies in a conversational and understandable manner. LLMs also have the potential to be used in patient education and consultation, offering patient-friendly information to aid in their understanding of their medical issues and available treatment choices, facilitating joint decision-making. More crucially, LLMs can contribute to the democratization of medical knowledge by allowing anybody, regardless of location or socioeconomic position, quick access to reliable medical information. However, special attention needs to be paid to the fact that current LLMs are not yet capable of fully replacing doctors, as they may contain errors or omit key points in the responses. Although ChatGPT-4.0 was more accurate than the other tools, neither ChatGPT nor Google Bard or the Bing or Google search engines provided 100% accurate answers to all queries [ 242 ]. The much-anticipated Med-PaLM, while promising, is evaluated by multiple choice questions; however, real life is not multiple choice, and different clinical symptoms and specificities in different patients make clinical diagnosis more complex. While AI, such as ChatGPT-4.0, might be helpful for giving broad information and responding to frequently asked queries. Nonetheless, it is important to take great caution when responding to inquiries from certain patients. It is essential to continuously upgrade AI models to include the most recent medical information.

Currently, almost all relevant AI models have been created to assist in cancer diagnosis using clinical data from the time of development. These clinical data may be derived from patient reports, complaints, or sequencing results. The question is whether there is an AI model that can recommend more tests and treatment modalities or perhaps aid in prescribing anticancer medication without relying on clinical data. The current state of affairs is that with the development of multiomics, a variety of data, such as methylation and fragmentomics [ 254 ], are being used to train AI models. If one day the data of the AI model accumulates to a large enough size, is it possible to predict the probability of cancer occurrence by only entering the data of normal people, and is it possible to give the corresponding chemotherapy regimen by only comparing the sequencing results of cancer patients and the database. This is a question worth thinking about and very interesting. First, the database must be large enough and ethical; second, there is variability between individuals, and it would be irresponsible to treat them by looking only at sequencing data at the genetic level or transcriptional level, for example.

However, if it is only in the area of cancer diagnosis, AI models have the potential to identify molecules and biomarkers associated with mutated genes and thus confirm the diagnosis of cancer independently of traditional pathology measurements. Meanwhile, with the advent of wearable and portable medical instruments, AI has shown much potential for the early screening of tumors. Therefore, we think that in the future, AI models have the potential to impact the cancer diagnostic market, but in terms of treatment, they cannot be separated from doctors and clinical data.

What must be realized is that despite the rapid development and promising future of AI, it can never replace clinicians and will only become an important tool to assist them in the future.

In summary, AI has the ability to fundamentally alter cancer treatment and move it closer to the promise of precision oncology. In an era where genomics is being incorporated into health delivery and health data are being more digitized, it is anticipated that AI would be used in the construction, verification, and application of decision-support tools to promote precision oncology. We highlighted several promising AI applications in this review, including detection, prognosis, and administration of cancer treatments. It is undeniable that large language model can greatly assist physicians in their clinical work, but it can never replace them. Important conditions for the general adoption of AI in clinical settings include phenotypically rich data for the development of models and clinical validation of the biological value of AI-generated insights. Finally, clinical validation of AI is required before it may be used in ordinary patient treatment.

Availability of data and materials

Not applicable.

Abbreviations

Artificial Intelligence

Area under the curve

Colon cancer

Clear cell renal cell carcinoma

Clinical decision-support systems

Conditional generative adversarial networks

Convolutional neural networks

Colorectal cancer

Computed tomography

Digital breast tomosynthesis

Distributed deep learning

Differentially expressed genes

Differentially expressed proteins

Disease free survival

Directed Graph convolutional network

Deep learning

Deep neural networks

Electronic health record

Glioblastoma multiforme

Graph convolutional network

Generative pretrained transformers

Hematoxylin and eosin

Head and neck cancer

Kidney renal clear cell carcinoma

K-Nearest neighbor

Low-dose computed tomography

Large language models

Lymph node metastasis

Machine learning

Multilabel softmax loss

Magnetic resonance imaging

Mass spectrometry

Microsatellite instability

Noncommunicable diseases

Natural language processing

Non-small cell lung cancer

Overall survival

Positron emission tomography

Random survival forest

Spatial transcriptomics

Support vector machine

The Cancer Genome Atlas

Tumor microenvironment

Tumor node metastasis

United States Medical Licensing Examination

Watson for Oncology

Whole-slide image

Bray F, Jemal A, Grey N, Ferlay J, Forman D. Global cancer transitions according to the human development index (2008–2030): a population-based study. Lancet Oncol. 2012;13(8):790–801.

Article   PubMed   Google Scholar  

The L. Global cancer: overcoming the narrative of despondency. Lancet (London, England). 2023;401(10374):319.

Article   Google Scholar  

Moor J. The Dartmouth College artificial intelligence conference: the next fifty years. AI Mag. 2006;27(4):87–91.

Google Scholar  

Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2(10):719–31.

Deo RC. Machine learning in medicine. Circulation. 2015;132(20):1920–30.

Article   PubMed   PubMed Central   Google Scholar  

Zhou P, Cao Y, Li M, Ma Y, Chen C, Gan X, et al. HCCANet: histopathological image grading of colorectal cancer using CNN based on multichannel fusion attention mechanism. Sci Rep. 2022;12(1):15103.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Gould MK, Huang BZ, Tammemagi MC, Kinar Y, Shiff R. Machine learning for early lung cancer identification using routine clinical and laboratory data. Am J Respir Crit Care Med. 2021;204(4):445–53.

Liang J, He Y, Xie J, Fan X, Liu Y, Wen Q, et al. Mining electronic health records using artificial intelligence: Bibliometric and content analyses for current research status and product conversion. J Biomed Inform. 2023;146: 104480.

Zhang G, Jiang Z, Zhu J, Dai T, He X, Liu X, et al. Innovative integration of augmented reality and optical surface imaging: a coarse-to-precise system for radiotherapy positioning. Med Phys. 2023;50(7):4505–20.

Yao Y, He L, Mei L, Weng Y, Huang J, Wei S, et al. Cell damage evaluation by intelligent imaging flow cytometry. Cytometry Part A : J Int Soci Anal Cytol. 2023;103(8):646–54.

Article   CAS   Google Scholar  

DiSpirito A 3rd, Vu T, Pramanik M, Yao J. Sounding out the hidden data: a concise review of deep learning in photoacoustic imaging. Exp Biol Med (Maywood). 2021;246(12):1355–67.

Article   CAS   PubMed   Google Scholar  

Silver FH, Mesica A, Gonzalez-Mercedes M, Deshmukh T. Identification of cancerous kin lesions using vibrational optical coherence tomography (VOCT): use of VOCT in conjunction with machine learning to diagnose skin cancer remotely using telemedicine. Cancers. 2022;15(1):156.

Pérez-Cota F, Martínez-Arellano G, La Cavera III S, Hardiman W, Thornton L, Fuentes-Domínguez R, et al. Classification of cancer cells at the sub-cellular level by phonon microscopy using deep learning. Sci Rep. 2023;13(1):16228.

Niehues JM, Quirke P, West NP, Grabsch HI, van Treeck M, Schirris Y, et al. Generalizable biomarker prediction from cancer pathology slides with self-supervised deep learning: a retrospective multi-centric study. Cell reports Medicine. 2023;4(4): 100980.

Rönnau MM, Lepper TW, Amaral LN, Rados PV, Oliveira MM. A CNN-based approach for joint segmentation and quantification of nuclei and NORs in AgNOR-stained images. Comput Methods Programs Biomed. 2023;242: 107788.

Balasubramaniam S, Velmurugan Y, Jaganathan D, Dhanasekaran S. A modified LeNet CNN for breast cancer diagnosis in ultrasound images. Diagnostics (Basel, Switzerland). 2023;13(17):2746.

PubMed   PubMed Central   Google Scholar  

Tang Z, Li Z, Hou T, Zhang T, Yang B, Su J, et al. SiGra: single-cell spatial elucidation through an image-augmented graph transformer. Nat Commun. 2023;14(1):5618.

Azad R, Kazerouni A, Heidari M, Aghdam EK, Molaei A, Jia Y, et al. Advances in medical image analysis with vision transformers: a comprehensive review. Med Image Anal. 2023;91: 103000.

Li X, Fang X, Yang G, Su S, Zhu L, Yu Z. TransU2-Net: an effective medical image segmentation framework based on transformer and U2-Net. IEEE J Transl Eng Health Med. 2023;11:441–50.

Dascalu A, David EO. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Walker BN, Rehg JM, Kalra A, Winters RM, Drews P, Dascalu J, et al. Dermoscopy diagnosis of cancerous lesions utilizing dual deep learning algorithms via visual and audio (sonification) outputs: laboratory and prospective observational studies. EBioMedicine. 2019;40:176–83.

Baltrusaitis T, Ahuja C, Morency LP. Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell. 2019;41(2):423–43.

Mei X, Lee HC, Diao KY, Huang M, Lin B, Liu C, et al. Artificial intelligence-enabled rapid diagnosis of patients with COVID-19. Nat Med. 2020;26(8):1224–8.

Akselrod-Ballin A, Chorev M, Shoshan Y, Spiro A, Hazan A, Melamed R, et al. Predicting breast cancer by applying deep learning to linked health records and mammograms. Radiology. 2019;292(2):331–42.

Zhang K, Liu X, Xu J, Yuan J, Cai W, Chen T, et al. Deep-learning models for the detection and incidence prediction of chronic kidney disease and type 2 diabetes from retinal fundus images. Nature Biomed Eng. 2021;5(6):533–45.

Zhou HY, Yu Y, Wang C, Zhang S, Gao Y, Pan J, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng. 2023;7(6):743–55.

Moor M, Banerjee O, Abad ZSH, Krumholz HM, Leskovec J, Topol EJ, et al. Foundation models for generalist medical artificial intelligence. Nature. 2023;616(7956):259–65.

Faiella E, Vertulli D, Esperto F, Cordelli E, Soda P, Muraca RM, et al. Quantib prostate compared to an expert radiologist for the diagnosis of prostate cancer on mpMRI: a single-center preliminary study. Tomography (Ann Arbor, Mich). 2022;8(4):2010–9.

Eloy C, Marques A, Pinto J, Pinheiro J, Campelos S, Curado M, et al. Artificial intelligence-assisted cancer diagnosis improves the efficiency of pathologists in prostatic biopsies. Virchows Archiv : Int J Pathol. 2023;482(3):595–604.

Wang JY, Qu V, Hui C, Sandhu N, Mendoza MG, Panjwani N, et al. Stratified assessment of an FDA-cleared deep learning algorithm for automated detection and contouring of metastatic brain tumors in stereotactic radiosurgery. Radiat Oncol (London, England). 2023;18(1):61.

Seager A, Sharp L, Hampton JS, Neilson LJ, Lee TJW, Brand A, et al. Trial protocol for COLO-DETECT: a randomized controlled trial of lesion detection comparing colonoscopy assisted by the GI Genius™ artificial intelligence endoscopy module with standard colonoscopy. Colorectal disease: The Off J Assoc Coloproctol Great Britain and Ireland. 2022;24(10):1227–37.

Glissen Brown JR, Mansour NM, Wang P, Chuchuca MA, Minchenberg SB, Chandnani M, et al. Deep learning computer-aided polyp detection reduces adenoma miss rate: a United States Multi-center randomized tandem colonoscopy study (CADeT-CS Trial). Clin Gastroenterol Hepatol: Off Clin Pract J Am Gastroenterol Assoc. 2022;20(7):1499-507.e4.

Eden KB, Ivlev I, Bensching KL, Franta G, Hersh AR, Case J, et al. Use of an online breast cancer risk assessment and patient decision aid in primary care practices. J Women’s Health. 2020;29(6):763–9.

Niehoff JH, Kalaitzidis J, Kroeger JR, Schoenbeck D, Borggrefe J, Michael AE. Evaluation of the clinical performance of an AI-based application for the automated analysis of chest X-rays. Sci Rep. 2023;13(1):3680.

Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS digital health. 2023;2(2): e0000198.

Conant EF, Barlow WE, Herschorn SD, Weaver DL, Beaber EF, Tosteson ANA, et al. Association of digital breast tomosynthesis vs digital mammography with cancer detection and recall rates by age and breast density. JAMA Oncol. 2019;5(5):635–42.

Hofvind S, Holen ÅS, Aase HS, Houssami N, Sebuødegård S, Moger TA, et al. Two-view digital breast tomosynthesis versus digital mammography in a population-based breast cancer screening programme (To-Be): a randomised, controlled trial. Lancet Oncol. 2019;20(6):795–805.

Pattacini P, Nitrosi A, Giorgi Rossi P, Iotti V, Ginocchi V, Ravaioli S, et al. Digital mammography versus digital mammography plus tomosynthesis for breast cancer screening: the reggio emilia tomosynthesis randomized trial. Radiology. 2018;288(2):375–85.

Dang PA, Freer PE, Humphrey KL, Halpern EF, Rafferty EA. Addition of tomosynthesis to conventional digital mammography: effect on image interpretation time of screening examinations. Radiology. 2014;270(1):49–56.

Shoshan Y, Bakalo R, Gilboa-Solomon F, Ratner V, Barkan E, Ozery-Flato M, et al. Artificial intelligence for reducing workload in breast cancer screening with digital breast tomosynthesis. Radiology. 2022;303(1):69–77.

Nam JG, Hwang EJ, Kim J, Park N, Lee EH, Kim HJ, et al. AI Improves nodule detection on chest radiographs in a health screening population: a randomized controlled trial. Radiology. 2023;307(2): e221894.

Sim Y, Chung MJ, Kotter E, Yune S, Kim M, Do S, et al. Deep convolutional neural network-based software improves radiologist detection of malignant lung nodules on chest radiographs. Radiology. 2020;294(1):199–209.

Yoo H, Kim KH, Singh R, Digumarthy SR, Kalra MK. Validation of a deep learning algorithm for the detection of malignant pulmonary nodules in chest radiographs. JAMA Netw Open. 2020;3(9): e2017135.

Aberle DR, Adams AM, Berg CD, Black WC, Clapp JD, Fagerstrom RM, et al. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med. 2011;365(5):395–409.

Lu MT, Raghu VK, Mayrhofer T, Aerts H, Hoffmann U. Deep learning using chest radiographs to identify high-risk smokers for lung cancer screening computed tomography: development and validation of a prediction model. Ann Intern Med. 2020;173(9):704–13.

Raghu VK, Walia AS, Zinzuwadia AN, Goiffon RJ, Shepard JO, Aerts H, et al. Validation of a deep learning-based model to predict lung cancer risk using chest radiographs and electronic medical record data. JAMA Netw Open. 2022;5(12): e2248793.

Huang P, Lin CT, Li Y, Tammemagi MC, Brock MV, Atkar-Khattra S, et al. Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method. The Lancet Digital health. 2019;1(7):e353–62.

Venkadesh KV, Setio AAA, Schreuder A, Scholten ET, Chung K, Wile MMW, et al. Deep learning for malignancy risk estimation of pulmonary nodules detected at low-dose screening CT. Radiology. 2021;300(2):438–47.

Mikhael PG, Wohlwend J, Yala A, Karstens L, Xiang J, Takigami AK, et al. Sybil: a validated deep learning model to predict future lung cancer risk from a single low-dose chest computed tomography. J Clin Oncol: Off J Am Soci Clin Oncol. 2023;41(12):2191–200.

Yi L, Zhang L, Xu X, Guo J. Multi-label softmax networks for pulmonary nodule classification using unbalanced and dependent categories. IEEE Trans Med Imaging. 2023;42(1):317–28.

Luo X, Song T, Wang G, Chen J, Chen Y, Li K, et al. SCPM-Net: an anchor-free 3D lung nodule detection network using sphere representation and center points matching. Med Image Anal. 2022;75: 102287.

Wang G, Qiu M, Xing X, Zhou J, Yao H, Li M, et al. Lung cancer scRNA-seq and lipidomics reveal aberrant lipid metabolism for early-stage diagnosis. Sci Transl Med. 2022;14(630):eabk2756.

Hollon T, Jiang C, Chowdury A, Nasir-Moin M, Kondepudi A, Aabedi A, et al. Artificial-intelligence-based molecular classification of diffuse gliomas using rapid, label-free optical imaging. Nat Med. 2023;29(4):828–32.

Corley DA, Jensen CD, Marks AR, Zhao WK, Lee JK, Doubeni CA, et al. Adenoma detection rate and risk of colorectal cancer and death. N Engl J Med. 2014;370(14):1298–306.

Sinonquel P, Eelbode T, Hassan C, Antonelli G, Filosofi F, Neumann H, et al. Real-time unblinding for validation of a new CADe tool for colorectal polyp detection. Gut. 2021;70(4):641–3.

Wesp P, Grosu S, Graser A, Maurus S, Schulz C, Knösel T, et al. Deep learning in CT colonography: differentiating premalignant from benign colorectal polyps. Eur Radiol. 2022;32(7):4749–59.

Grosu S, Wesp P, Graser A, Maurus S, Schulz C, Knösel T, et al. Machine learning-based differentiation of benign and premalignant colorectal polyps detected with CT colonography in an asymptomatic screening population: a proof-of-concept study. Radiology. 2021;299(2):326–35.

Troya J, Krenzer A, Flisikowski K, Sudarevic B, Banck M, Hann A, et al. New concept for colonoscopy including side optics and artificial intelligence. Gastrointest Endosc. 2022;95(4):794–8.

Zhang JK, Fanous M, Sobh N, Kajdacsy-Balla A, Popescu G. Automatic colorectal cancer screening using deep learning in spatial light interference microscopy data. Cells. 2022;11(4):716.

Xu H, Tang RSY, Lam TYT, Zhao G, Lau JYW, Liu Y, et al. Artificial intelligence-assisted colonoscopy for colorectal cancer screening: a multicenter randomized controlled trial. Clin Gastroenterol Hepatol: Off Clin Pract J Am Gastroenterol Assoc. 2023;21(2):337-46.e3.

Kudo SE, Ichimasa K, Villard B, Mori Y, Misawa M, Saito S, et al. Artificial intelligence system to determine risk of t1 colorectal cancer metastasis to lymph node. Gastroenterology. 2021;160(4):1075-84.e2.

Areia M, Mori Y, Correale L, Repici A, Bretthauer M, Sharma P, et al. Cost-effectiveness of artificial intelligence for screening colonoscopy: a modelling study. The Lancet Digital health. 2022;4(6):e436–44.

Hassan C, Balsamo G, Lorenzetti R, Zullo A, Antonelli G. Artificial intelligence allows leaving-in-situ colorectal polyps. Clin Gastroenterol Hepatol: Off Clin Pract J Am Gastroenterol Assoc. 2022;20(11):2505-13.e4.

Soares F, Becker K, Anzanello MJ. A hierarchical classifier based on human blood plasma fluorescence for non-invasive colorectal cancer screening. Artif Intell Med. 2017;82:1–10.

Konishi Y, Okumura S, Matsumoto T, Itatani Y, Nishiyama T, Okazaki Y, et al. Development and evaluation of a colorectal cancer screening method using machine learning-based gut microbiota analysis. Cancer Med. 2022;11(16):3194–206.

Ji M, Zhong J, Xue R, Su W, Kong Y, Fei Y, et al. Early detection of cervical cancer by fluorescence lifetime imaging microscopy combined with unsupervised machine learning. Int J Mol Sci. 2022;23(19):11476.

Wang S, Yin Y, Wang D, Wang Y, Jin Y. Interpretability-based multimodal convolutional neural networks for skin lesion diagnosis. IEEE Trans Cybernet. 2022;52(12):12623–37.

Sangers TE, Wakkee M, Kramer-Noels EC, Nijsten T, Lugtenberg M. Views on mobile health apps for skin cancer screening in the general population: an in-depth qualitative exploration of perceived barriers and facilitators. Br J Dermatol. 2021;185(5):961–9.

Alhazmi A, Alhazmi Y, Makrami A, Masmali A, Salawi N, Masmali K, et al. Application of artificial intelligence and machine learning for prediction of oral cancer risk. J Oral Pathol Med: Off Publ Int Assoc Oral Pathol Am Acad Oral Pathol. 2021;50(5):444–50.

Adeoye J, Zheng LW, Thomson P, Choi SW, Su YX. Explainable ensemble learning model improves identification of candidates for oral cancer screening. Oral Oncol. 2023;136: 106278.

Gao Y, Xin L, Lin H, Yao B, Zhang T, Zhou AJ, et al. Machine learning-based automated sponge cytology for screening of oesophageal squamous cell carcinoma and adenocarcinoma of the oesophagogastric junction: a nationwide, multicohort, prospective study. Lancet Gastroenterol Hepatol. 2023;8(5):432–45.

Raab SS, Grzybicki DM. Quality in cancer diagnosis. CA: Cancer J Clin. 2010;60(3):139–65.

PubMed   Google Scholar  

Veta M, van Diest PJ, Kornegoor R, Huisman A, Viergever MA, Pluim JP. Automatic nuclei segmentation in H&E stained breast cancer histopathology images. PLoS ONE. 2013;8(7): e70221.

Rezaeilouyeh H, Mahoor MH, Zhang JJ, La Rosa FG, Chang S, Werahera PN. Diagnosis of prostatic carcinoma on multiparametric magnetic resonance imaging using shearlet transform. Annual International Conference of the IEEE Engineering in Medicine and Biology Society IEEE Engineering in Medicine and Biology Society Annual International Conference. 2014;2014:6442-5

Kim I, Kang K, Song Y, Kim TJ. Application of artificial intelligence in pathology: trends and challenges. Diagnostics (Basel, Switzerland). 2022;12(11):2794.

Lu MY, Chen TY, Williamson DFK, Zhao M, Shady M, Lipkova J, et al. AI-based pathology predicts origins for cancers of unknown primary. Nature. 2021;594(7861):106–10.

Chen C, Lu MY, Williamson DFK, Chen TY, Schaumberg AJ, Mahmood F. Fast and scalable search of whole-slide images via self-supervised deep learning. Nature Biomed Eng. 2022;6(12):1420–34.

Liu P, Ji L, Ye F, Fu B. AdvMIL: adversarial multiple instance learning for the survival analysis on whole-slide images. Med Image Anal. 2023;91: 103020.

Azevedo Tosta TA, de Faria PR, Neves LA, do Nascimento MZ. Computational normalization of H&E-stained histological images: Progress, challenges and future potential. Artif Intell Med. 2019;95:118–32.

Rana A, Lowe A, Lithgow M, Horback K, Janovitz T, Da Silva A, et al. Use of deep learning to develop and analyze computational hematoxylin and eosin staining of prostate core biopsy images for tumor diagnosis. JAMA Netw Open. 2020;3(5): e205111.

Huang B, Tian S, Zhan N, Ma J, Huang Z, Zhang C, et al. Accurate diagnosis and prognosis prediction of gastric cancer using deep learning on digital pathological images: a retrospective multicentre study. EBioMedicine. 2021;73: 103631.

Shihabuddin AR, Beevi S. Multi CNN based automatic detection of mitotic nuclei in breast histopathological images. Comput Biol Med. 2023;158: 106815.

Schneider G, Schmidt-Supprian M, Rad R, Saur D. Tissue-specific tumorigenesis: context matters. Nat Rev Cancer. 2017;17(4):239–53.

Chang X, Wang J, Zhang G, Yang M, Xi Y, Xi C, et al. Predicting colorectal cancer microsatellite instability with a self-attention-enabled convolutional neural network. Cell reports Medicine. 2023;4(2): 100914.

Bilal M, Raza SEA, Azam A, Graham S, Ilyas M, Cree IA, et al. Development and validation of a weakly supervised deep learning framework to predict the status of molecular pathways and key mutations in colorectal cancer from routine histology images: a retrospective study. The Lancet Digital health. 2021;3(12):e763–72.

Gerwert K, Schörner S, Großerueschkamp F, Kraeft AL, Schuhmacher D, Sternemann C, et al. Fast and label-free automated detection of microsatellite status in early colon cancer using artificial intelligence integrated infrared imaging. European J Canc (Oxford, England: 1990). 2023;182:122–31.

CAS   Google Scholar  

Blessin NC, Yang C, Mandelkow T, Raedler JB, Li W, Bady E, et al. Automated Ki-67 labeling index assessment in prostate cancer using artificial intelligence and multiplex fluorescence immunohistochemistry. J Pathol. 2023;260(1):5–16.

Wang CW, Muzakky H, Lee YC, Lin YJ. Chao TK 2023 annotation-free deep learning-based prediction of thyroid molecular cancer biomarker BRAF (V600E) from cytological slides. Int J Mol Sci. 2023;24(3):2521.

Abele N, Tiemann K, Krech T, Wellmann A, Schaaf C, Länger F, et al. Noninferiority of artificial intelligence-assisted analysis of ki-67 and estrogen/progesterone receptor in breast cancer routine diagnostics. Modern Pathol: Off J United States Canad Acad Pathol. 2023;36(3): 100033.

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Khan MS, Alam KN, Dhruba AR, Zunair H, Mohammed N. Knowledge distillation approach towards melanoma detection. Comput Biol Med. 2022;146: 105581.

Adepu AK, Sahayam S, Jayaraman U, Arramraju R. Melanoma classification from dermatoscopy images using knowledge distillation for highly imbalanced data. Comput Biol Med. 2023;154: 106571.

Wang Y, Wang Y, Cai J, Lee TK, Miao C, Wang ZJ. SSD-KD: a self-supervised diverse knowledge distillation method for lightweight skin lesion classification using dermoscopic images. Med Image Anal. 2023;84: 102693.

Marchetti MA, Nazir ZH, Nanda JK, Dusza SW, D’Alessandro BM, DeFazio J, et al. 3D Whole-body skin imaging for automated melanoma detection. J Eur Acad Dermatol Venereol: JEADV. 2023;37(5):945–50.

Ahmedt-Aristizabal D, Nguyen C, Tychsen-Smith L, Stacey A, Li S, Pathikulangara J, et al. Monitoring of pigmented skin lesions using 3D whole body imaging. Comput Meth Progr Biomed. 2023;232: 107451.

Tajerian A, Kazemian M, Tajerian M, Akhavan MA. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS ONE. 2023;18(4): e0284437.

Venugopal V, Joseph J, Vipin Das M, Kumar NM. An EfficientNet-based modified sigmoid transform for enhancing dermatological macro-images of melanoma and nevi skin lesions. Comput Methods Programs Biomed. 2022;222: 106935.

Ain QU, Al-Sahaf H, Xue B, Zhang M. Automatically diagnosing skin cancers from multimodality images using two-stage genetic programming. IEEE Trans Cybernet. 2023;53(5):2727–40.

Kränke T, Tripolt-Droschl K, Röd L, Hofmann-Wellenhof R, Koppitz M, Tripolt M. New AI-algorithms on smartphones to detect skin cancer in a clinical setting-A validation study. PLoS ONE. 2023;18(2): e0280670.

Freeman K, Dinnes J, Chuchu N, Takwoingi Y, Bayliss SE, Matin RN, et al. Algorithm based smartphone apps to assess risk of skin cancer in adults: systematic review of diagnostic accuracy studies. BMJ (Clinical research ed). 2020;368: m127.

Yi Z, Hu S, Lin X, Zou Q, Zou M, Zhang Z, et al. Machine learning-based prediction of invisible intraprostatic prostate cancer lesions on (68) Ga-PSMA-11 PET/CT in patients with primary prostate cancer. Eur J Nucl Med Mol Imaging. 2022;49(5):1523–34.

Gao P, Shan W, Guo Y, Wang Y, Sun R, Cai J, et al. Development and validation of a deep learning model for brain tumor diagnosis and classification using magnetic resonance imaging. JAMA Netw Open. 2022;5(8): e2225608.

Knabe M, Welsch L, Blasberg T, Müller E, Heilani M, Bergen C, et al. Artificial intelligence-assisted staging in Barrett’s carcinoma. Endoscopy. 2022;54(12):1191–7.

Liang S, Dong X, Yang K, Chu Z, Tang F, Ye F, et al. A multi-perspective information aggregation network for automatedT-staging detection of nasopharyngeal carcinoma. Phys Med Biol. 2022;67(24): 245007.

Demirjian NL, Varghese BA, Cen SY, Hwang DH, Aron M, Siddiqui I, et al. CT-based radiomics stratification of tumor grade and TNM stage of clear cell renal cell carcinoma. Eur Radiol. 2022;32(4):2552–63.

van der Voort SR, Incekara F, Wijnenga MMJ, Kapsas G, Gahrmann R, Schouten JW, et al. Combined molecular subtyping, grading, and segmentation of glioma using multi-task deep learning. Neuro Oncol. 2023;25(2):279–89.

Xu Y, Klyuzhin I, Harsini S, Ortiz A, Zhang S, Bénard F, et al. Automatic segmentation of prostate cancer metastases in PSMA PET/CT images using deep neural networks with weighted batch-wise dice loss. Comput Biol Med. 2023;158: 106882.

Wang R, Gu Y, Zhang T, Yang J. Fast cancer metastasis location based on dual magnification hard example mining network in whole-slide images. Comput Biol Med. 2023;158: 106880.

Lin H, Chen H, Graham S, Dou Q, Rajpoot N, Heng PA. Fast ScanNet: fast and dense analysis of multi-gigapixel whole-slide images for cancer metastasis detection. IEEE Trans Med Imaging. 2019;38(8):1948–58.

Ehteshami Bejnordi B, Veta M, van Diest PJ, van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA. 2017;318(22):2199–210.

Zhao J, Wang H, Zhang Y, Wang R, Liu Q, Li J, et al. Deep learning radiomics model related with genomics phenotypes for lymph node metastasis prediction in colorectal cancer. Radiotherapy Oncol: J Eur Soci Therapeut Radiol Oncol. 2022;167:195–202.

Wu S, Hong G, Xu A, Zeng H, Chen X, Wang Y, et al. Artificial intelligence-based model for lymph node metastases detection on whole slide images in bladder cancer: a retrospective, multicentre, diagnostic study. Lancet Oncol. 2023;24(4):360–70.

Murai H, Kodama T, Maesaka K, Tange S, Motooka D, Suzuki Y, et al. Multiomics identifies the link between intratumor steatosis and the exhausted tumor immune microenvironment in hepatocellular carcinoma. Hepatology (Baltimore, MD). 2023;77(1):77–91.

Sammut SJ, Crispin-Ortuzar M, Chin SF, Provenzano E, Bardwell HA, Ma W, et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature. 2022;601(7894):623–9.

He X, Liu X, Zuo F, Shi H, Jing J. Artificial intelligence-based multi-omics analysis fuels cancer precision medicine. Semin Cancer Biol. 2023;88:187–200.

Srivastava R. Applications of artificial intelligence multiomics in precision oncology. J Cancer Res Clin Oncol. 2023;149(1):503–10.

Zafari N, Bathaei P, Velayati M, Khojasteh-Leylakoohi F, Khazaei M, Fiuji H, et al. Integrated analysis of multi-omics data for the discovery of biomarkers and therapeutic targets for colorectal cancer. Comput Biol Med. 2023;155: 106639.

Stamatoyannopoulos JA. What does our genome encode? Genome Res. 2012;22(9):1602–11.

Saravanan KA, Panigrahi M, Kumar H, Rajawat D, Nayak SS, Bhushan B, et al. Role of genomics in combating COVID-19 pandemic. Gene. 2022;823: 146387.

Chen HZ, Bonneville R, Roychowdhury S. Implementing precision cancer medicine in the genomic era. Semin Cancer Biol. 2019;55:16–27.

Qiu YL, Zheng H, Devos A, Selby H, Gevaert O. A meta-learning approach for genomic survival analysis. Nat Commun. 2020;11(1):6350.

Sahraeian SME, Fang LT, Karagiannis K, Moos M, Smith S, Santana-Quintero L, et al. Achieving robust somatic mutation detection with deep learning models derived from reference data sets of a cancer sample. Genome Biol. 2022;23(1):12.

Sun JX, He Y, Sanford E, Montesion M, Frampton GM, Vignot S, et al. A computational approach to distinguish somatic vs. germline origin of genomic alterations from deep sequencing of cancer specimens without a matched normal. PLoS Computat Biol. 2018;14(2):e1005965.

Gupta P, Jindal A, Ahuja G, Sengupta D. A new deep learning technique reveals the exclusive functional contributions of individual cancer mutations. J Biol Chem. 2022;298(8):102177.

Sengupta A, Naresh G, Mishra A, Parashar D, Narad P. Proteome analysis using machine learning approaches and its applications to diseases. Adv Protein Chem Struct Biol. 2021;127:161–216.

Liu Y, Sethi NS, Hinoue T, Schneider BG, Cherniack AD, Sanchez-Vega F, et al. Comparative molecular analysis of gastrointestinal adenocarcinomas. Cancer Cell. 2018;33(4):721-35.e8.

Singh MP, Rai S, Pandey A, Singh NK, Srivastava S. Molecular subtypes of colorectal cancer: an emerging therapeutic opportunity for personalized medicine. Genes & diseases. 2021;8(2):133–45.

Moreno V, Sanz-Pamplona R. Altered pathways and colorectal cancer prognosis. BMC Med. 2015;13:76.

Ding K, Zhou M, Wang H, Zhang S, Metaxas DN. Spatially aware graph neural networks and cross-level molecular profile prediction in colon cancer histopathology: a retrospective multi-cohort study. The Lancet Digital health. 2022;4(11):e787–95.

Li N, Meng G, Yang C, Li H, Liu L, Wu Y, et al. Changes in epigenetic information during the occurrence and development of gastric cancer. Int J Biochem Cell Biol. 2022;153: 106315.

Zhou X, Chai H, Zhao H, Luo CH, Yang Y. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network. GigaScience. 2020;9(7):giaa076.

Huang Z, Wang J, Yan Z, Guo M. Differentially expressed genes prediction by multiple self-attention on epigenetics data. Brief Bioinform. 2022;23(3):bbac117.

Tsimberidou AM, Fountzilas E, Bleris L, Kurzrock R. Transcriptomics and solid tumors: the next frontier in precision cancer medicine. Semin Cancer Biol. 2022;84:50–9.

Jha A, Quesnel-Vallières M, Wang D, Thomas-Tikhonenko A, Lynch KW, Barash Y. Identifying common transcriptome signatures of cancer by interpreting deep learning models. Genome Biol. 2022;23(1):117.

Weitz P, Wang Y, Kartasalo K, Egevad L, Lindberg J, Grönberg H, et al. Transcriptome-wide prediction of prostate cancer gene expression from histopathology images using co-expression-based convolutional neural networks. Bioinformatics (Oxford, England). 2022;38(13):3462–9.

CAS   PubMed   Google Scholar  

He B, Bergenstråhle L, Stenbeck L, Abid A, Andersson A, Borg Å, et al. Integrating spatial gene expression and breast tumour morphology via deep learning. Nat Biomed Eng. 2020;4(8):827–34.

Schmauch B, Romagnoni A, Pronier E, Saillard C, Maillé P, Calderaro J, et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat Commun. 2020;11(1):3877.

Quail DF, Joyce JA. Microenvironmental regulation of tumor progression and metastasis. Nat Med. 2013;19(11):1423–37.

Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science (New York, NY). 2016;353(6294):78–82.

Lewis SM, Asselin-Labat ML, Nguyen Q, Berthelet J, Tan X, Wimmer VC, et al. Spatial omics and multiplexed imaging to explore cancer biology. Nat Methods. 2021;18(9):997–1012.

Zaitsev A, Chelushkin M, Dyikanov D, Cheremushkin I, Shpak B, Nomie K, et al. Precise reconstruction of the TME using bulk RNA-seq and a machine learning algorithm trained on artificial transcriptomes. Cancer Cell. 2022;40(8):879-94.e16.

Bergenstråhle L, He B, Bergenstråhle J, Abalo X, Mirzazadeh R, Thrane K, et al. Super-resolved spatial transcriptomics by deep data fusion. Nat Biotechnol. 2022;40(4):476–9.

Hu J, Coleman K, Zhang D, Lee EB, Kadara H, Wang L, et al. Deciphering tumor ecosystems at super resolution from spatial transcriptomics with TESLA. Cell Syst. 2023;14(5):404-17.e4.

Zhang H, Zhang N, Wu W, Zhou R, Li S, Wang Z, et al. Machine learning-based tumor-infiltrating immune cell-associated lncRNAs for predicting prognosis and immunotherapy response in patients with glioblastoma. Brief Bioinform. 2022;23(6):bbac386.

Zhou M, Zhang Z, Bao S, Hou P, Yan C, Su J, et al. Computational recognition of lncRNA signature of tumor-infiltrating B lymphocytes with potential implications in prognosis and immunotherapy of bladder cancer. Briefings Bioinform. 2021;22(3):bbaa047.

Zhang N, Zhang H, Wu W, Zhou R, Li S, Wang Z, et al. Machine learning-based identification of tumor-infiltrating immune cell-associated lncRNAs for improving outcomes and immunotherapy responses in patients with low-grade glioma. Theranostics. 2022;12(13):5931–48.

Korfiati A, Grafanaki K, Kyriakopoulos GC, Skeparnias I, Georgiou S, Sakellaropoulos G, et al. Revisiting miRNA association with melanoma recurrence and metastasis from a machine learning point of view. Int J Mol Sci. 2022;23(3):1299.

Hosseiniyan Khatibi SM, Ardalan M, Teshnehlab M, Vahed SZ, Pirmoradi S. Panels of mRNAs and miRNAs for decoding molecular mechanisms of Renal Cell Carcinoma (RCC) subtypes utilizing Artificial Intelligence approaches. Sci Rep. 2022;12(1):16393.

Fenaux P, Mufti GJ, Hellstrom-Lindberg E, Santini V, Finelli C, Giagounidis A, et al. Efficacy of azacitidine compared with that of conventional care regimens in the treatment of higher-risk myelodysplastic syndromes: a randomised, open-label, phase III study. Lancet Oncol. 2009;10(3):223–32.

Keyl P, Bischoff P, Dernbach G, Bockmayr M, Fritz R, Horst D, et al. Single-cell gene regulatory network prediction by explainable AI. Nucleic Acids Res. 2023;51(4): e20.

Ogunleye AZ, Piyawajanusorn C, Gonçalves A, Ghislat G, Ballester PJ. Interpretable Machine Learning Models to Predict the Resistance of Breast Cancer Patients to Doxorubicin from Their microRNA Profiles. Adv Sci (Weinheim, Baden-Wurttemberg, Germany). 2022;9(24):e2201501.

Wang S, Zhu H, Zhou H, Cheng J, Yang H. MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks. BMC Bioinformatics. 2020;21(1):439.

Dong H, Liu Y, Zeng WF, Shu K, Zhu Y, Chang C. A deep learning-based tumor classifier directly using MS raw data. Proteomics. 2020;20(21–22): e1900344.

Ludwig C, Gillet L, Rosenberger G, Amon S, Collins BC, Aebersold R. Data-independent acquisition-based SWATH-MS for quantitative proteomics: a tutorial. Mol Syst Biol. 2018;14(8): e8126.

López-Sánchez LM, Jiménez-Izquierdo R, Peñarando J, Mena R, Guil-Luna S, Toledano M, et al. SWATH-based proteomics reveals processes associated with immune evasion and metastasis in poor prognosis colorectal tumours. J Cell Mol Med. 2019;23(12):8219–32.

Nwaokorie A, Fey D. Personalised medicine for colorectal cancer using mechanism-based machine learning models. Int J Mol Sci. 2021;22(18):9970.

Franciosa G, Smits JGA, Minuzzo S, Martinez-Val A, Indraccolo S, Olsen JV. Proteomics of resistance to Notch1 inhibition in acute lymphoblastic leukemia reveals targetable kinase signatures. Nat Commun. 2021;12(1):2507.

Ye X, Yang Y, Zhou J, Xu L, Wu L, Huang P, et al. Combinatory strategy using nanoscale proteomics and machine learning for T cell subtyping in peripheral blood of single multiple myeloma patients. Anal Chim Acta. 2021;1173: 338672.

Liang CA, Chen L, Wahed A, Nguyen AND. Proteomics analysis of FLT3-ITD mutation in acute myeloid leukemia using deep learning neural network. Ann Clin Lab Sci. 2019;49(1):119–26.

Kim H, Kim Y, Han B, Jang JY, Kim Y. Clinically applicable deep learning algorithm using quantitative proteomic data. J Proteome Res. 2019;18(8):3195–202.

Deeb SJ, Tyanova S, Hummel M, Schmidt-Supprian M, Cox J, Mann M. Machine learning-based classification of diffuse large B-cell lymphoma patients by their protein expression profiles. Mol Cell Prot: MCP. 2015;14(11):2947–60.

Wishart DS. Metabolomics for investigating physiological and pathophysiological processes. Physiol Rev. 2019;99(4):1819–75.

DePeaux K, Delgoffe GM. Metabolic barriers to cancer immunotherapy. Nat Rev Immunol. 2021;21(12):785–97.

Agarwala PK, Aneja R, Kapoor S. Lipidomic landscape in cancer: actionable insights for membrane-based therapy and diagnoses. Med Res Rev. 2022;42(2):983–1018.

Rodrigues J, Amin A, Raghushaker CR, Chandra S, Joshi MB, Prasad K, et al. Exploring photoacoustic spectroscopy-based machine learning together with metabolomics to assess breast tumor progression in a xenograft model ex vivo. Laboratory Invest: J Tech Meth Pathol. 2021;101(7):952–65.

Murata T, Yanagisawa T, Kurihara T, Kaneko M, Ota S, Enomoto A, et al. Salivary metabolomics with alternative decision tree-based machine learning methods for breast cancer discrimination. Breast Cancer Res Treat. 2019;177(3):591–601.

Ishii H, Saitoh M, Sakamoto K, Sakamoto K, Saigusa D, Kasai H, et al. Lipidome-based rapid diagnosis with machine learning for detection of TGF-β signalling activated area in head and neck cancer. Br J Cancer. 2020;122(7):995–1004.

Tian M, Lin Z, Wang X, Yang J, Zhao W, Lu H, et al. Pure ion chromatograms combined with advanced machine learning methods improve accuracy of discriminant models in LC-MS-based untargeted metabolomics. Molecules (Basel, Switzerland). 2021;26(9):2715.

Ma Y, Zhang P, Wang F, Liu W, Yang J, Qin H. An integrated proteomics and metabolomics approach for defining oncofetal biomarkers in the colorectal cancer. Ann Surg. 2012;255(4):720–30.

Zhou J, Ji N, Wang G, Zhang Y, Song H, Yuan Y, et al. Metabolic detection of malignant brain gliomas through plasma lipidomic analysis and support vector machine-based machine learning. EBioMedicine. 2022;81: 104097.

Yuan Y, Zhao Z, Xue L, Wang G, Song H, Pang R, et al. Identification of diagnostic markers and lipid dysregulation in oesophageal squamous cell carcinoma through lipidomic analysis and machine learning. Br J Cancer. 2021;125(3):351–7.

Wang H, Yin Y, Zhu ZJ. Encoding LC-MS-based untargeted metabolomics data into images toward AI-based clinical diagnosis. Anal Chem. 2023;95(16):6533–41.

Huang L, Wang L, Hu X, Chen S, Tao Y, Su H, et al. Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma. Nat Commun. 2020;11(1):3556.

Manzi M, Palazzo M, Knott ME, Beauseroy P, Yankilevich P, Giménez MI, et al. Coupled mass-spectrometry-based lipidomics machine learning approach for early detection of clear cell renal cell carcinoma. J Proteome Res. 2021;20(1):841–57.

Wallace PW, Conrad C, Brückmann S, Pang Y, Caleiras E, Murakami M, et al. Metabolomics, machine learning and immunohistochemistry to predict succinate dehydrogenase mutational status in phaeochromocytomas and paragangliomas. J Pathol. 2020;251(4):378–87.

Alakwaa FM, Chaudhary K, Garmire LX. Deep learning accurately predicts estrogen receptor status in breast cancer metabolomics data. J Proteome Res. 2018;17(1):337–47.

Yang J, Chen Y, Jing Y, Green MR, Han L. Advancing CAR T cell therapy through the use of multidimensional omics data. Nat Rev Clin Oncol. 2023;20(4):211–28.

Choi JM, Chae H. moBRCA-net: a breast cancer subtype classification framework based on multi-omics attention neural networks. BMC Bioinform. 2023;24(1):169.

Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA Jr, Kinzler KW. Cancer genome landscapes. Science (New York, NY). 2013;339(6127):1546–58.

Sanders LM, Chandra R, Zebarjadi N, Beale HC, Lyle AG, Rodriguez A, et al. Machine learning multi-omics analysis reveals cancer driver dysregulation in pan-cancer cell lines compared to primary tumors. Commun Biol. 2022;5(1):1367.

Zhang SW, Xu JY, Zhang T. DGMP: identifying cancer driver genes by jointing DGCN and MLP from multi-omics genomic data. Gen Proteom Bioinform. 2022;20(5):928–38.

Zhao W, Gu X, Chen S, Wu J, Zhou Z. MODIG: integrating multi-omics and multi-dimensional gene network for cancer driver gene identification based on graph attention network model. Bioinformatics (Oxford, England). 2022;38(21):4901–7.

Yang H, Gan L, Chen R, Li D, Zhang J, Wang Z. From multi-omics data to the cancer druggable gene discovery: a novel machine learning-based approach. Brief Bioinform. 2023;24(1):bbca528.

Sharma A, Lysenko A, Boroevich KA, Tsunoda T. DeepInsight-3D architecture for anti-cancer drug response prediction with deep-learning on multi-omics. Sci Rep. 2023;13(1):2483.

Khadirnaikar S, Shukla S, Prasanna SRM. Machine learning based combination of multi-omics data for subgroup identification in non-small cell lung cancer. Sci Rep. 2023;13(1):4636.

Park MK, Lim JM, Jeong J, Jang Y, Lee JW, Lee JC, et al. Deep-learning algorithm and concomitant biomarker identification for NSCLC prediction using multi-omics data integration. Biomolecules. 2022;12(12):1839.

Lococo F, Boldrini L, Diepriye CD, Evangelista J, Nero C, Flamini S, et al. Lung cancer multi-omics digital human avatars for integrating precision medicine into clinical practice: the LANTERN study. BMC Cancer. 2023;23(1):540.

Chen CC, Chu PY, Lin HY. Supervised learning and multi-omics integration reveals clinical significance of inner membrane mitochondrial protein (IMMT) in prognostic prediction, tumor immune microenvironment and precision medicine for kidney renal clear cell carcinoma. Int J Mol Sci. 2023;24(10):8807.

Zhu J, Kong W, Huang L, Bi S, Jiao X, Zhu S. Identification of immunotherapy and chemotherapy-related molecular subtypes in colon cancer by integrated multi-omics data analysis. Front Immunol. 2023;14:1142609.

Li Y, Wu Y, Huang M, Zhang Y, Bai Z. Attention-guided multi-scale learning network for automatic prostate and tumor segmentation on MRI. Comput Biol Med. 2023;165: 107374.

Wang J, Peng Y, Jing S, Han L, Li T, Luo J. A deep-learning approach for segmentation of liver tumors in magnetic resonance imaging using UNet+. BMC Cancer. 2023;23(1):1060.

Vermeulen C, Pagès-Gallego M, Kester L, Kranendonk MEG, Wesseling P, Verburg N, et al. Ultra-fast deep-learned CNS tumour classification during surgery. Nature. 2023;622(7984):842–9.

Raju S, Peddireddy Veera VR. Classification of brain tumours from MRI images using deep learning-enabled hybrid optimization algorithm. Network (Bristol, England). 2023;34(4):408–37.

Wong CC, Li W, Chan B, Yu J. Epigenomic biomarkers for prognostication and diagnosis of gastrointestinal cancers. Semin Cancer Biol. 2019;55:90–105.

Huang HH, Liang Y. A novel cox proportional hazards model for high-dimensional genomic data in cancer prognosis. IEEE/ACM Trans Comput Biol Bioinf. 2021;18(5):1821–30.

Tian T, Sun J. Variable selection for nonparametric additive Cox model with interval-censored data. Biometr J Biometrische Zeitschrift. 2023;65(1): e2100310.

Tong R, Zhu Z, Ling J. Comparison of linear and non-linear machine learning models for time-dependent readmission or mortality prediction among hospitalized heart failure patients. Heliyon. 2023;9(5): e16068.

Baralou V, Kalpourtzi N, Touloumi G. Individual risk prediction: comparing random forests with Cox proportional-hazards model by a simulation study. Biometrical J Biometrische Zeitschrift. 2022;65(6):2100380.

Fanizzi A, Pomarico D, Rizzo A, Bove S, Comes MC, Didonna V, et al. Machine learning survival models trained on clinical data to identify high risk patients with hormone responsive HER2 negative breast cancer. Sci Rep. 2023;13(1):8575.

Li X, Zhai Z, Ding W, Chen L, Zhao Y, Xiong W, et al. An artificial intelligence model to predict survival and chemotherapy benefits for gastric cancer patients after gastrectomy development and validation in international multicenter cohorts. Int J Surg (London, England). 2022;105: 106889.

Afrash MR, Mirbagheri E, Mashoufi M, Kazemi-Arpanahi H. Optimizing prognostic factors of five-year survival in gastric cancer patients using feature selection techniques with machine learning algorithms: a comparative study. BMC Med Inform Decis Mak. 2023;23(1):54.

Arya N, Saha S, Mathur A, Saha S. Improving the robustness and stability of a machine learning model for breast cancer prognosis through the use of multi-modal classifiers. Sci Rep. 2023;13(1):4079.

Kim Y, Kim KH, Park J, Yoon HI, Sung W. Prognosis prediction for glioblastoma multiforme patients using machine learning approaches: development of the clinically applicable model. Radiother Oncol: J Eur Soci Therap Radiol Oncol. 2023;183: 109617.

Lv W, Zhou Z, Peng J, Peng L, Lin G, Wu H, et al. Functional-structural sub-region graph convolutional network (FSGCN): application to the prognosis of head and neck cancer with PET/CT imaging. Comput Methods Progr Biomed. 2023;230: 107341.

Chen S, Xiang J, Wang X, Zhang J, Yang S, Yang W, et al. Deep learning-based pathology signature could reveal lymph node status and act as a novel prognostic marker across multiple cancer types. Br J Cancer. 2023;129(1):46–53.

Lee W, Park HJ, Lee HJ, Jun E, Song KB, Hwang DW, et al. Preoperative data-based deep learning model for predicting postoperative survival in pancreatic cancer patients. Int J Surg (London, England). 2022;105: 106851.

Khazaee Fadafen M, Rezaee K. Ensemble-based multi-tissue classification approach of colorectal cancer histology images using a novel hybrid deep learning framework. Sci Rep. 2023;13(1):8823.

Li C, Liu M, Zhang Y, Wang Y, Li J, Sun S, et al. Novel models by machine learning to predict prognosis of breast cancer brain metastases. J Transl Med. 2023;21(1):404.

Li J, Liang Y, Zhao X, Wu C. Integrating machine learning algorithms to systematically assess reactive oxygen species levels to aid prognosis and novel treatments for triple -negative breast cancer patients. Front Immunol. 2023;14:1196054.

Verghese G, Li M, Liu F, Lohan A, Kurian NC, Meena S, et al. Multiscale deep learning framework captures systemic immune features in lymph nodes predictive of triple negative breast cancer outcome in large-scale studies. J Pathol. 2023;260(4):376–89.

Li J, Qiao H, Wu F, Sun S, Feng C, Li C, et al. A novel hypoxia- and lactate metabolism-related signature to predict prognosis and immunotherapy responses for breast cancer by integrating machine learning and bioinformatic analyses. Front Immunol. 2022;13: 998140.

Wang Y, Acs B, Robertson S, Liu B, Solorzano L, Wählby C, et al. Improved breast cancer histological grading using deep learning. Ann Oncol: Off J Eur Soci Med Oncol. 2022;33(1):89–98.

Ding H, Feng Y, Huang X, Xu J, Zhang T, Liang Y, et al. Deep learning-based classification and spatial prognosis risk score on whole-slide images of lung adenocarcinoma. Histopathology. 2023;83(2):211–28.

She Y, Jin Z, Wu J, Deng J, Zhang L, Su H, et al. Development and validation of a deep learning model for non-small cell lung cancer survival. JAMA Netw Open. 2020;3(6): e205842.

Hosny A, Parmar C, Coroller TP, Grossmann P, Zeleznik R, Kumar A, et al. Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med. 2018;15(11): e1002711.

Finn CB, Sharpe JE, Tong JK, Kaufman EJ, Wachtel H, Aarons CB, et al. Development of a machine learning model to identify colorectal cancer stage in medicare claims. JCO Clin Cancer Inform. 2023;7: e2300003.

Kleppe A, Skrede OJ, De Raedt S, Hveem TS, Askautrud HA, Jacobsen JE, et al. A clinical decision support system optimising adjuvant chemotherapy for colorectal cancers by integrating deep learning and pathological staging markers: a development and validation study. Lancet Oncol. 2022;23(9):1221–32.

Bertsimas D, Margonis GA, Sujichantararat S, Boerner T, Ma Y, Wang J, et al. Using artificial intelligence to find the optimal margin width in hepatectomy for colorectal cancer liver metastases. JAMA Surg. 2022;157(8): e221819.

Skrede OJ, De Raedt S, Kleppe A, Hveem TS, Liestøl K, Maddison J, et al. Deep learning for prediction of colorectal cancer outcome: a discovery and validation study. Lancet (London, England). 2020;395(10221):350–60.

Deng S, Ding J, Wang H, Mao G, Sun J, Hu J, et al. Deep learning-based radiomic nomograms for predicting Ki67 expression in prostate cancer. BMC Cancer. 2023;23(1):638.

Saito S, Sakamoto S, Higuchi K, Sato K, Zhao X, Wakai K, et al. Machine-learning predicts time-series prognosis factors in metastatic prostate cancer patients treated with androgen deprivation therapy. Sci Rep. 2023;13(1):6325.

Lee C, Light A, Alaa A, Thurtle D, van der Schaar M, Gnanapragasam VJ. Application of a novel machine learning framework for predicting non-metastatic prostate cancer-specific mortality in men using the Surveillance, Epidemiology, and End Results (SEER) database. Lancet Digital health. 2021;3(3):e158–65.

Nimgaonkar V, Krishna V, Krishna V, Tiu E, Joshi A, Vrabac D, et al. Development of an artificial intelligence-derived histologic signature associated with adjuvant gemcitabine treatment outcomes in pancreatic cancer. Cell Report Med. 2023;4(4): 101013.

Li J, Huang L, Liao C, Liu G, Tian Y, Chen S. Two machine learning-based nomogram to predict risk and prognostic factors for liver metastasis from pancreatic neuroendocrine tumors: a multicenter study. BMC Cancer. 2023;23(1):529.

Aung TN, Shafi S, Wilmott JS, Nourmohammadi S, Vathiotis I, Gavrielatou N, et al. Objective assessment of tumor infiltrating lymphocytes as a prognostic marker in melanoma using machine learning algorithms. EBioMedicine. 2022;82: 104143.

Guan X, Lu N, Zhang J. Computed tomography-based deep learning nomogram can accurately predict lymph node metastasis in gastric cancer. Dig Dis Sci. 2023;68(4):1473–81.

Zhang X, Gleber-Netto FO, Wang S, Martins-Chaves RR, Gomez RS, Vigneswaran N, et al. Deep learning-based pathology image analysis predicts cancer progression risk in patients with oral leukoplakia. Cancer Med. 2023;12(6):7508–18.

Singh T, Malik G, Someshwar S, Le HTT, Polavarapu R, Chavali LN, et al. Machine learning heuristics on gingivobuccal cancer gene datasets reveals key candidate attributes for prognosis. Genes. 2022;13(12):2379.

Cricelli I, Marconi E, Lapi F. Clinical decision support system (CDSS) in primary care: from pragmatic use to the best approach to assess their benefit/risk profile in clinical practice. Curr Med Res Opin. 2022;38(5):827–9.

Yun HJ, Kim HJ, Kim SY, Lee YS, Lim CY, Chang HS, et al. Adequacy and effectiveness of watson for oncology in the treatment of thyroid carcinoma. Front Endocrinol. 2021;12: 585364.

Yu SH, Kim MS, Chung HS, Hwang EC, Jung SI, Kang TW, et al. Early experience with Watson for Oncology: a clinical decision-support system for prostate cancer treatment recommendations. World J Urol. 2021;39(2):407–13.

Liu C, Liu X, Wu F, Xie M, Feng Y, Hu C. Using artificial intelligence (watson for oncology) for treatment recommendations amongst chinese patients with lung cancer: feasibility study. J Med Internet Res. 2018;20(9): e11087.

Liu Y, Huo X, Li Q, Li Y, Shen G, Wang M, et al. Watson for oncology decision system for treatment consistency study in breast cancer. Clin Exper Med. 2022;23(5):1649–57.

Somashekhar SP, Sepúlveda MJ, Puglielli S, Norden AD, Shortliffe EH, Rohit Kumar C, et al. Watson for Oncology and breast cancer treatment recommendations: agreement with an expert multidisciplinary tumor board. Ann Oncol: Off J Eur Soci Med Oncol. 2018;29(2):418–23.

Zhang T, Tan T, Wang X, Gao Y, Han L, Balkenende L, et al. RadioLOGIC, a healthcare model for processing electronic health records and decision-making in breast disease. Cell Reports Medicine. 2023;4(8): 101131.

Chen RJ, Lu MY, Williamson DFK, Chen TY, Lipkova J, Noor Z, et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell. 2022;40(8):865-78.e6.

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nature Med. 2023;29(8):1930–40.

Kehl KL, Xu W, Lepisto E, Elmarakeby H, Hassett MJ, Van Allen EM, et al. Natural language processing to ascertain cancer outcomes from medical oncologist notes. JCO Clin Canc Inform. 2020;4:680–90.

Savova GK, Danciu I, Alamudun F, Miller T, Lin C, Bitterman DS, et al. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records. Can Res. 2019;79(21):5463–70.

Remedios D, Remedios A. Transformers, codes and labels: large language modelling for natural language processing in clinical radiology. Eur Radiol. 2023;33(6):4226–7.

Tan R, Lin Q, Low GH, Lin R, Goh TC, Chang CCE, et al. Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting. J Am Med Inform Assoc: JAMIA. 2023;30(10):1657–64.

Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs google bard. Radiology. 2023;307(5): e230922.

Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29(3):721–32.

Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 2023;21(1):269.

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.

van der Laak J, Litjens G, Ciompi F. Deep learning in histopathology: the path to the clinic. Nat Med. 2021;27(5):775–84.

Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Med. 2019;17(1):195.

Bae S, Choi H, Lee DS. Discovery of molecular features underlying the morphological landscape by integrating spatial transcriptomic data with deep features of tissue images. Nucleic Acids Res. 2021;49(10): e55.

Weidener L, Fischer M. Teaching AI ethics in medical education: a scoping review of current literature and practices. Perspectives on medical education. 2023;12(1):399–410.

Tian Y, Wang S, Xiong J, Bi R, Zhou Z, Bhuiyan MZA. Robust and privacy-preserving decentralized deep federated learning training: focusing on digital healthcare applications. In: IEEE/ACM Transactions on computational biology and bioinformatics. 2023;pp.

Kumar R, Kumar J, Khan AA, Zakria, Ali H, Bernard CM, et al. Blockchain and homomorphic encryption based privacy-preserving model aggregation for medical images. Comput Med Imag Graph: Off J Comput Med Imag Soci. 2022;102:102139.

Ali A, Almaiah MA, Hajjej F, Pasha MF, Fang OH, Khan R, et al. An Industrial IoT-based blockchain-enabled secure searchable encryption approach for healthcare systems using neural network. Sensors (Basel, Switzerland). 2022;22(2):572.

Freeman K, Geppert J, Stinton C, Todkill D, Johnson S, Clarke A, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ (Clinical research ed). 2021;374: n1872.

Mathios D, Johansen JS, Cristiano S, Medina JE, Phallen J, Larsen KR, et al. Detection and characterization of lung cancer using cell-free DNA fragmentomes. Nat Commun. 2021;12(1):5060.

Download references

Acknowledgements

This study was jointly supported by the National Natural Science Foundation of China (U21A20374 and 82072698), Shanghai Municipal Science and Technology Major Project (21JC1401500), Scientific Innovation Project of Shanghai Education Committee (2019-01-07-00-07-E00057), and Natural Science Foundation of Shanghai (23ZR1479300).

Author information

Chaoyi Zhang and Jin Xu contributed equally to this work.

Authors and Affiliations

Department of Pancreatic Surgery, Fudan University Shanghai Cancer Center, No. 270 Dong’An Road, Shanghai, 200032, People’s Republic of China

Chaoyi Zhang, Jin Xu, Rong Tang, Jianhui Yang, Wei Wang, Xianjun Yu & Si Shi

Department of Oncology, Shanghai Medical College, Fudan University, Shanghai, 200032, People’s Republic of China

Shanghai Pancreatic Cancer Institute, No. 399 Lingling Road, Shanghai, 200032, People’s Republic of China

Pancreatic Cancer Institute, Fudan University, Shanghai, 200032, People’s Republic of China

You can also search for this author in PubMed   Google Scholar

Contributions

CZ and JX collected the related studies and drafted the manuscript. RT, JY and WW participated in the design of the review. XY and SS initiated the study and revised the manuscript. The authors read and approved the final manuscript.

Corresponding authors

Correspondence to Xianjun Yu or Si Shi .

Ethics declarations

Ethics approval and consent to participate, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Zhang, C., Xu, J., Tang, R. et al. Novel research and future prospects of artificial intelligence in cancer diagnosis and treatment. J Hematol Oncol 16 , 114 (2023). https://doi.org/10.1186/s13045-023-01514-5

Download citation

Received : 07 October 2023

Accepted : 20 November 2023

Published : 27 November 2023

DOI : https://doi.org/10.1186/s13045-023-01514-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Journal of Hematology & Oncology

ISSN: 1756-8722

research abstracts dataset

Robotics and Autonomous Systems Lab at Vanderbilt

Stress detection of autistic adults during simulated job interviews using a novel physiological dataset and machine learning.

The interview process has been identified as one of the major barriers to employment of autistic individuals, which contributes to the staggering rate of under and unemployment of autistic adults. Decreasing stress during the interview has been shown to improve interview performance. However, in order to effectively provide insights on stress to both interviewees and interviewers, it is necessary to first effectively measure stress. This work explores physiological stress detection through wearable sensing as a means of obtaining quantitative stress measures from young autistic adults undergoing a virtual simulated interview using supervised machine learning techniques. Several supervised learning models were explored and it was found that Elastic Net Regression had the best accuracy with individual models with an accuracy of 84.8% while Support Vector Regression models evaluated with leave-one-out cross validation had a group accuracy of 75.4%. The predictions from the stress model were used with data visualization techniques in order to provide insights on the interview process from both a group and individual viewpoint, showing that stress trends can be found and evaluated using the stress model. This work also addresses a major gap in physiological stress detection literature by presenting a novel dataset of physiological data and ground truth labels for 15 autistic young adults undergoing a simulated interview.

Comments are closed

IMAGES

  1. How to Create Attractive Graphical Abstracts

    research abstracts dataset

  2. Methods used in research abstracts (n=180)

    research abstracts dataset

  3. A good abstract for a research paper. How to make your Abstract more

    research abstracts dataset

  4. (PDF) COVID-19 Open Research Dataset Challenge (CORD-19)

    research abstracts dataset

  5. GitHub

    research abstracts dataset

  6. 😎 Sample of abstract in research paper. Abstract examples for research

    research abstracts dataset

VIDEO

  1. Information Science & Technology Abstracts (ISTA)

  2. Congrats

  3. Drafting Project- Based Research Abstracts

  4. Easy Abstracts #artstudio #acrylics #creativespace

  5. Writing Research Abstracts

  6. Publishing Skills Lab: Communications Challenges

COMMENTS

  1. arxiv_dataset · Datasets at Hugging Face

    To help make the arXiv more accessible, a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more is presented to empower new use cases that can lead to the exploration of richer machine ...

  2. scientific_papers

    Description: Scientific papers datasets contains two sets of long and structured documents. The datasets are obtained from ArXiv and PubMed OpenAccess repositories. Both "arxiv" and "pubmed" have two features: article: the body of the document, pagragraphs seperated by "/n". abstract: the abstract of the document, pagragraphs seperated by "/n".

  3. arXiv Paper Abstracts

    arXiv paper abstract dataset for building multi-label text classifiers. arXiv paper abstract dataset for building multi-label text classifiers. code ... table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New ...

  4. D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of

    Investigating papers' abstracts reveals that recent topic trends are clearly reflected in D3. Finally, we list further applications of D3 and pose supplemental research questions. The D3 dataset, our findings, and source code are publicly available for research purposes.

  5. AGENDA Dataset

    Abstract GENeration DAtaset (AGENDA) is a dataset of knowledge graphs paired with scientific abstracts. The dataset consists of 40k paper titles and abstracts from the Semantic Scholar Corpus taken from the proceedings of 12 top AI conferences. Browse State-of-the-Art ... research developments, libraries, methods, and datasets.

  6. A dataset for plain language adaptation of biomedical abstracts

    The PLABA dataset includes 75 health-related questions asked by MedlinePlus users, 750 PubMed abstracts from relevant scientific articles, and corresponding human created adaptations of the abstracts.

  7. [1710.06071] PubMed 200k RCT: a Dataset for Sequential Sentence

    In the dataset we present in this paper, PubMed 200k RCT, each short text we consider is one sentence. We focus on classifying sentences in medical abstracts, and particularly in randomized controlled trials (RCTs), as they are commonly considered to be the best source of medical evidence Tianjing Li ().Since sentences in an abstract appear in a sequence, we call this task the sequential ...

  8. AI-GA: AI-Generated Abstracts dataset

    The AI-GA (Artificial Intelligence Generated Abstracts) dataset is a collection of abstracts and titles, with half of the abstracts being AI-generated and the other half being original. This dataset is designed to be used for research and experimentation in the field of natural language processing, particularly in the context of language generation and machine learning.

  9. [2204.13384] D3: A Massive Dataset of Scholarly Metadata for Analyzing

    Download PDF Abstract: DBLP is the largest open-access repository of scientific articles on computer science and provides metadata associated with publications, authors, and venues. We retrieved more than 6 million publications from DBLP and extracted pertinent metadata (e.g., abstracts, author affiliations, citations) from the publication texts to create the DBLP Discovery Dataset (D3).

  10. Building Multi-Label Text Classifiers for arXiv Paper Abstract Dataset

    The Kaggle arXiv Paper Abstract Dataset provides more than 38000 unique paper titles along with their summaries and subject areas. The dataset is uploaded just a few days ago ...

  11. A Dataset of 1.7 Million ArXiv Articles Available on Kaggle

    The dataset contains relevant features such as article titles, authors, categories, content (both abstract and full text) and citations of 1.7 million scholarly articles avaiable on arXiv. This dataset is amazing resource to do machine learning and deep learning applications. Some of the applications that can be done are:

  12. BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent

    In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content ...

  13. Abstractive text summarization and new large-scale datasets for

    The research and the available resources remain mostly limited to the English language, which prevents progress in other languages. There is need in low-resourced languages for gathering large-scale resources suitable for such tasks. ... The content and abstract lengths of both datasets are around the half of the sizes of CNN/Daily Mail and NY ...

  14. arXiv-10 Dataset

    Benchmark dataset for abstracts and titles of 100,000 ArXiv scientific papers. This dataset contains 10 classes and is balanced (exactly 10,000 per class). The classes include subcategories of computer science, physics, and math. • Direct link: Download • Citation: @inproceedings{farhangi2022protoformer, title={Protoformer: Embedding Prototypes for Transformers}, author={Farhangi, Ashkan ...

  15. Datasets from a research project examining the role of politics in

    We present four datasets from a project examining the role of politics in social psychological research. These include thousands of independent raters who coded scientific abstracts for political ...

  16. Datasets

    Support for data sets associated with arXiv articles. arXiv is primarily an archive and distribution service for research articles. arXiv provides support for data sets and other ancillary materials only in direct connection with research articles submitted.. arXiv supports the inclusion of ancillary files of modest size with articles. If you are including multiple page datasets or code with ...

  17. Abstract

    The importance of datasets for machine learning research cannot be overstated. Datasets have been seen as the limiting factor for algorithmic development and scientific progress [Halevy et al., 2009, Sun et al., 2017], and a select few benchmark datasets have shaped some of the most significant developmentsin the field.

  18. Datasets for Industry

    Our datasets help you answer your specific questions and meet your requirements, from our user tool datasets to custom ones tailored to specific use cases. ... Three datasets of abstracts, authors and affiliations, and evaluation metrics cover 24 research disciplines from 7,000 publishers. Extracted data from peer-reviewed scientific journals ...

  19. Datasets

    ScreenQA Short. The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.

  20. arXiv Dataset

    arXiv dataset and metadata of 1.7M+ scholarly papers across STEM. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more. menu ...

  21. Dataset Search

    Learn more about Dataset Search.. ‫العربية‬ ‪Deutsch‬ ‪English‬ ‪Español (España)‬ ‪Español (Latinoamérica)‬ ‪Français‬ ‪Italiano‬ ‪日本語‬ ‪한국어‬ ‪Nederlands‬ Polski‬ ‪Português‬ ‪Русский‬ ‪ไทย‬ ‪Türkçe‬ ‪简体中文‬ ‪中文(香港)‬ ‪繁體中文‬

  22. Globe-LFMC 2.0, an enhanced and updated dataset for live fuel ...

    Globe-LFMC 2.0, an updated version of Globe-LFMC, is a comprehensive dataset of over 280,000 Live Fuel Moisture Content (LFMC) measurements. These measurements were gathered through field ...

  23. ACL Title and Abstract Dataset Dataset

    This dataset gathers 10,874 title and abstract pairs from the ACL Anthology Network (until 2016). The structure of the data is as follows: - title - abstract - \newline This dataset is used in our published paper: Paper Abstract Writing through Editing Mechanism Citation @inproceedings{wang-etal-2018-paper, title = "Paper Abstract Writing through Editing Mechanism", author = "Wang, Qingyun and ...

  24. Novel research and future prospects of artificial intelligence in

    Research into the potential benefits of artificial intelligence for comprehending the intricate biology of cancer has grown as a result of the widespread use of deep learning and machine learning in the healthcare sector and the availability of highly specialized cancer datasets. Here, we review new artificial intelligence approaches and how they are being used in oncology.

  25. Robotics and Autonomous Systems Lab

    ABSTRACT. The interview process has been identified as one of the major barriers to employment of autistic individuals, which contributes to the staggering rate of under and unemployment of autistic adults. Decreasing stress during the interview has been shown to improve interview performance.

  26. Ais: streamlining segmentation of cryo-electron tomography datasets

    Abstract. Segmentation is a critical data processing step in many applications of cryo-electron tomography. Downstream analyses, such as subtomogram averaging, are often based on segmentation results, and are thus critically dependent on the availability of open-source software for accurate as well as high-throughput tomogram segmentation.

  27. Research Paper Abstracts

    Refresh. Kaggle is the world's largest data science community with powerful tools and resources to help you achieve your data science goals.

  28. Towards System Modelling to Support Diseases Data Extraction from the

    The use of Electronic Health Records (EHRs) has increased dramatically in the past 15 years, as, it is considered an important source of managing data od patients. The EHRs are primary sources of disease diagnosis and demographic data of patients worldwide. Therefore, the data can be utilized for secondary tasks such as research. This paper aims to make such data usable for research activities ...

  29. The Turkish minimum dataset for chronic low back pain research: a cross

    The US National Institutes of Health (NIH) has produced a minimal data set to promote more accurate and consistent reporting of clinical trials, facilitating easier comparison of research on low back pain patients worldwide. The NIH-minimal dataset has not been previously translated into Turkish, and its features are currently unknown.

  30. The Role of Partner Gender: How Sexual Expectations Shape the Pursuit

    Previous research has established that gendered sexual scripts shape sexual behavior. This study seeks to expand prior work on orgasm disparities for women across sexual orientations by exploring the role of partner gender.