data science Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Assessing the effects of fuel energy consumption, foreign direct investment and GDP on CO2 emission: New data science evidence from Europe & Central Asia

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks.

Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore how human-centered AI systems can support human data scientists in the machine learning code documentation scenario. Themisto facilitates the creation of documentation via three approaches: a deep-learning-based approach to generate documentation for source code, a query-based approach to retrieve online API documentation for source code, and a user prompt approach to nudge users to write documentation. We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants’ satisfaction with their computational notebook.

Data science in the business environment: Insight management for an Executive MBA

Adventures in financial data science, gecoagent: a conversational agent for empowering genomic data extraction and analysis.

With the availability of reliable and low-cost DNA sequencing, human genomics is relevant to a growing number of end-users, including biologists and clinicians. Typical interactions require applying comparative data analysis to huge repositories of genomic information for building new knowledge, taking advantage of the latest findings in applied genomics for healthcare. Powerful technology for data extraction and analysis is available, but broad use of the technology is hampered by the complexity of accessing such methods and tools. This work presents GeCoAgent, a big-data service for clinicians and biologists. GeCoAgent uses a dialogic interface, animated by a chatbot, for supporting the end-users’ interaction with computational tools accompanied by multi-modal support. While the dialogue progresses, the user is accompanied in extracting the relevant data from repositories and then performing data analysis, which often requires the use of statistical methods or machine learning. Results are returned using simple representations (spreadsheets and graphics), while at the end of a session the dialogue is summarized in textual format. The innovation presented in this article is concerned with not only the delivery of a new tool but also our novel approach to conversational technologies, potentially extensible to other healthcare domains or to general data science.

Differentially Private Medical Texts Generation Using Generative Neural Networks

Technological advancements in data science have offered us affordable storage and efficient algorithms to query a large volume of data. Our health records are a significant part of this data, which is pivotal for healthcare providers and can be utilized in our well-being. The clinical note in electronic health records is one such category that collects a patient’s complete medical information during different timesteps of patient care available in the form of free-texts. Thus, these unstructured textual notes contain events from a patient’s admission to discharge, which can prove to be significant for future medical decisions. However, since these texts also contain sensitive information about the patient and the attending medical professionals, such notes cannot be shared publicly. This privacy issue has thwarted timely discoveries on this plethora of untapped information. Therefore, in this work, we intend to generate synthetic medical texts from a private or sanitized (de-identified) clinical text corpus and analyze their utility rigorously in different metrics and levels. Experimental results promote the applicability of our generated data as it achieves more than 80\% accuracy in different pragmatic classification problems and matches (or outperforms) the original text data.

Impact on Stock Market across Covid-19 Outbreak

Abstract: This paper analysis the impact of pandemic over the global stock exchange. The stock listing values are determined by variety of factors including the seasonal changes, catastrophic calamities, pandemic, fiscal year change and many more. This paper significantly provides analysis on the variation of listing price over the world-wide outbreak of novel corona virus. The key reason to imply upon this outbreak was to provide notion on underlying regulation of stock exchanges. Daily closing prices of the stock indices from January 2017 to January 2022 has been utilized for the analysis. The predominant feature of the research is to analyse the fact that does global economy downfall impacts the financial stock exchange. Keywords: Stock Exchange, Matplotlib, Streamlit, Data Science, Web scrapping.

Information Resilience: the nexus of responsible and agile approaches to information use

AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this vision paper, we present a series of case studies that highlight these interconnected challenges, across a range of application areas. We use the insights from the case studies to introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim of this paper is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of responsible data management.

qEEG Analysis in the Diagnosis of Alzheimers Disease; a Comparison of Functional Connectivity and Spectral Analysis

Alzheimers disease (AD) is a brain disorder that is mainly characterized by a progressive degeneration of neurons in the brain, causing a decline in cognitive abilities and difficulties in engaging in day-to-day activities. This study compares an FFT-based spectral analysis against a functional connectivity analysis based on phase synchronization, for finding known differences between AD patients and Healthy Control (HC) subjects. Both of these quantitative analysis methods were applied on a dataset comprising bipolar EEG montages values from 20 diagnosed AD patients and 20 age-matched HC subjects. Additionally, an attempt was made to localize the identified AD-induced brain activity effects in AD patients. The obtained results showed the advantage of the functional connectivity analysis method compared to a simple spectral analysis. Specifically, while spectral analysis could not find any significant differences between the AD and HC groups, the functional connectivity analysis showed statistically higher synchronization levels in the AD group in the lower frequency bands (delta and theta), suggesting that the AD patients brains are in a phase-locked state. Further comparison of functional connectivity between the homotopic regions confirmed that the traits of AD were localized in the centro-parietal and centro-temporal areas in the theta frequency band (4-8 Hz). The contribution of this study is that it applies a neural metric for Alzheimers detection from a data science perspective rather than from a neuroscience one. The study shows that the combination of bipolar derivations with phase synchronization yields similar results to comparable studies employing alternative analysis methods.

Big Data Analytics for Long-Term Meteorological Observations at Hanford Site

A growing number of physical objects with embedded sensors with typically high volume and frequently updated data sets has accentuated the need to develop methodologies to extract useful information from big data for supporting decision making. This study applies a suite of data analytics and core principles of data science to characterize near real-time meteorological data with a focus on extreme weather events. To highlight the applicability of this work and make it more accessible from a risk management perspective, a foundation for a software platform with an intuitive Graphical User Interface (GUI) was developed to access and analyze data from a decommissioned nuclear production complex operated by the U.S. Department of Energy (DOE, Richland, USA). Exploratory data analysis (EDA), involving classical non-parametric statistics, and machine learning (ML) techniques, were used to develop statistical summaries and learn characteristic features of key weather patterns and signatures. The new approach and GUI provide key insights into using big data and ML to assist site operation related to safety management strategies for extreme weather events. Specifically, this work offers a practical guide to analyzing long-term meteorological data and highlights the integration of ML and classical statistics to applied risk and decision science.

Export Citation Format

Share document.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Open Choice

The role of data science in healthcare advancements: applications, benefits, and future prospects

Sri venkat gunturi subrahmanya.

1 Department of Electrical and Electronics Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka India

Dasharathraj K. Shetty

2 Department of Humanities and Management, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka India

Vathsala Patil

3 Department of Oral Medicine and Radiology, Manipal College of Dental Sciences, Manipal, Manipal Academy of Higher Education, Manipal Karnataka, India

B. M. Zeeshan Hameed

4 Department of Urology, Father Muller Medical College, Mangalore, Karnataka India

5 Department of Radiation Oncology, Massachusetts General Hospital, Boston, MA USA

Komal Smriti

Nithesh naik.

6 Department of Mechanical and Manufacturing Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka India

Bhaskar K. Somani

7 Department of Urology, University Hospital Southampton NHS Trust, Southampton, UK

Data science is an interdisciplinary field that extracts knowledge and insights from many structural and unstructured data, using scientific methods, data mining techniques, machine-learning algorithms, and big data. The healthcare industry generates large datasets of useful information on patient demography, treatment plans, results of medical examinations, insurance, etc. The data collected from the Internet of Things (IoT) devices attract the attention of data scientists. Data science provides aid to process, manage, analyze, and assimilate the large quantities of fragmented, structured, and unstructured data created by healthcare systems. This data requires effective management and analysis to acquire factual results. The process of data cleansing, data mining, data preparation, and data analysis used in healthcare applications is reviewed and discussed in the article. The article provides an insight into the status and prospects of big data analytics in healthcare, highlights the advantages, describes the frameworks and techniques used, briefs about the challenges faced currently, and discusses viable solutions. Data science and big data analytics can provide practical insights and aid in the decision-making of strategic decisions concerning the health system. It helps build a comprehensive view of patients, consumers, and clinicians. Data-driven decision-making opens up new possibilities to boost healthcare quality.

Introduction

The evolution in the digital era has led to the confluence of healthcare and technology resulting in the emergence of newer data-related applications [ 1 ]. Due to the voluminous amounts of clinical data generated from the health care sector like the Electronic Health Records (EHR) of patients, prescriptions, clinical reports, information about the purchase of medicines, medical insurance-related data, investigations, and laboratory reports, there lies an immense opportunity to analyze and study these using recent technologies [ 2 ]. The huge volume of data can be pooled together and analyzed effectively using machine-learning algorithms. Analyzing the details and understanding the patterns in the data can help in better decision-making resulting in a better quality of patient care. It can aid to understand the trends to improvise the outcome of medical care, life expectancy, early detection, and identification of disease at an initial stage and required treatment at an affordable cost [ 3 ]. Health Information Exchange (HIE) can be implemented which will help in extracting clinical information across various distinct repositories and merge it into a single person’s health record allowing all care providers to access it securely. Hence, the organizations associated with healthcare must attempt to procure all the available tools and infrastructure to make use of the big data, which can augment the revenue and profits and can establish better healthcare networks, and stand apart to reap significant benefits [ 4 , 5 ]. Data mining techniques can create a shift from conventional medical databases to a knowledge-rich, evidence-based healthcare environment in the coming decade.

Big data and its utility in healthcare and medical sciences have become more critical with the dawn of the social media era (platforms such as Facebook and Twitter) and smartphone apps that can monitor personal health parameters using sensors and analyzers [ 6 , 7 ]. The role of data mining is to improvise the stored user information to provide superior treatment and care. This review article provides an insight into the advantages and methodologies of big data usage in health care systems. It highlights the voluminous data generated in these systems, their qualities, possible security-related problems, data handling, and how this analytics support gaining significant insight into these data set.

Search strategy

A non-systematic review of all data science, big data in healthcare-related English language literature published in the last decade (2010–2020) was conducted in November 2020 using MEDLINE, Scopus, EMBASE, and Google Scholar. Our search strategy involved creating a search string based on a combination of keywords. They were: “Big Data,” “Big Data Analytics,” “Healthcare,” “Artificial Intelligence,” “AI,” “Machine learning,” “ML,” “ANN,” “Convolutional Networks,” “Electronic Health Records,” “EHR,” “EMR,” “Bioinformatics,” and “Data Science.” We included original articles published in English.

Inclusion criteria

  • Articles on big data analytics, data science, and AI.
  • Full-text original articles on all aspects of application of data science in medical sciences.

Exclusion criteria

  • Commentaries, reviews, and articles with no full-text context and book chapters.
  • Animal, laboratory, or cadaveric studies.

The literature review was performed as per the above-mentioned strategy. The evaluation of titles and abstracts, screening, and the full article text was conducted for the chosen articles that satisfied the inclusion criteria. Furthermore, the authors manually reviewed the selected article’s references list to screen for any additional work of interest. The authors resolved the disagreements about eligibility for a consensus decision after discussion.

Knowing more about “big data”

Big data consists of vast volumes of data, which cannot be managed using conventional technologies. Although there are many ways to define big data, we can consider the one defined by Douglas Laney [ 8 ] that represents three dimensions, namely, volume, velocity, and variety (3 Vs). The “big” in big data implies its large volume. Velocity demonstrates the speed or rate at which data is processed. Variety focuses on the various forms of structured and raw data obtained by any method or device, such as transaction-level data, videos, audios, texts, emails, and logs. The 3 Vs became the default description of big data, while many other Vs are added to the definition [ 9 ]. “Veracity” remains the most agreed 4th “V.” Data veracity focuses on the accuracy and reliability of a dataset. It helps to filter through what is important and what is not. The data with high veracity has many records that are valuable to analyze and that contribute in a meaningful way to the overall results. This aspect poses the biggest challenge when it comes to big data. With so much data available, ensuring that it is relevant and of high quality is important. Over recent years, big data has become increasingly popular across all parts of the globe.

Big data needs technologically sophisticated applications that use high-end computing resources and Artificial Intelligence (AI)-based algorithms to understand such huge volumes of data. Machine learning (ML) approaches for automatic decision-making by applying fuzzy logic and neural networks will be added advantage. Innovative and efficient strategies for dealing with data, smart cloud-based applications, effective storage, and user-friendly visualization are required for big data to gain practical insights [ 10 ].

Medical care as a repository for big data

Healthcare is a multilayered system developed specifically for preventing, diagnosing, and treating diseases. The key elements of medical care are health practitioners (physicians and nurses), healthcare facilities (which include clinics, drug delivery centers, and other testing or treatment technologies), and a funding agency that funds the former. Health care practitioners belong to different fields of health such as dentistry, pharmacy, medicine, nursing, psychology, allied health sciences, and many more. Depending on the severity of the cases, health care is provided at many levels. In all these stages, health practitioners need different forms of information such as the medical history of the patient (data related to medication and prescriptions), clinical data (such as data from laboratory assessments), and other personal or private medical data. The usual practice for a clinic, hospital, or patient to retain these medical documents would be maintaining either written notes or in the form of printed reports [ 11 ].

The clinical case records preserve the incidence and outcome of disease in a person’s body as a tale in the family, and the doctor plays an integral role in this tale [ 12 ]. With the emergence of electronic systems and their capacity, digitizing medical exams, health records, and investigations is a common procedure today. In 2003, the Institute of Medicine, a division in the National Academies of Sciences and Engineering coined the term “Electronic Health Records” for representing an electronic portal that saves the records of the patients. Electronic health records (EHRs) are automated medical records of patients related to an individual’s physical/mental health or significant reports that are saved in an electronic system and used to record, send, receive, store, retrieve, and connect the medical personnel and patient with medical services [ 13 ].

Open-source big data platforms

It is an inefficient idea to work with big data or vast volumes of data into storage considering even the most powerful computers. Hence, the only logical approach to process large quantities of big data available in a complex form is by spreading and processing it on several parallel connected nodes. Nevertheless, the volume of the data is typically so high that a large number of computing machines are needed in a reasonable period to distribute and finish processing. Working with thousands of nodes involves coping with issues related to paralleling the computation, spreading of data, and manage failures. Table ​ Table1 1 shows the few open sources of big data platforms and their utilities for data scientists.

source big data platforms and their utilities

Data mining

Data types can be classified based on their nature, source, and data collection methods [ 14 ]. Data mining techniques include data grouping, data clustering, data correlation, and mining of sequential patterns, regression, and data storage. There are several sources to obtain healthcare-related data (Fig.  1 ). The most commonly used type (77%) is the data generated by humans (HG data) which includes Electronic Medical Records (EMR), Electronic Health Records (EHR), and Electronic Patient Records (EPR). Online data through Web Service (WS) is considered as the second largest form of data (11%) due to the increase in the number of people using social media day by day and current digital development in the medical sector [ 15 ]. Recent advances in the Natural Language Processing (NLP)-based methodologies are also making WS simpler to use [ 16 ]. The other data forms such as Sensor Data (SD), Big Transactional Data (BTD), and Biometric Data (BM) make around 12% of overall data use, but wearable personal health monitoring devices’ prominence and market growth [ 17 ] may need SD and BM data.

An external file that holds a picture, illustration, etc.
Object name is 11845_2021_2730_Fig1_HTML.jpg

Sources of big data in healthcare

Applications of analytics in healthcare

There are six areas of applications of analytics in healthcare (Fig.  2 ) including disease surveillance, health care management and administration, privacy protection and fraud detection, mental health, public health, and pharmacovigilance. Researchers have implemented data extraction for data deposition and cloud-based computing, optimizing quality, lowering costs, leveraging resources, handling patients, and other fields.

An external file that holds a picture, illustration, etc.
Object name is 11845_2021_2730_Fig2_HTML.jpg

Various applications of data science in healthcare

Disease surveillance

It involves the perception of the disease, understanding its condition, etiology (the manner of causation of a disease), and prevention (Fig.  3 ).

An external file that holds a picture, illustration, etc.
Object name is 11845_2021_2730_Fig3_HTML.jpg

The disease analysis system

Information obtained with the help of EHRs, and the Internet has a huge prospect for disease analysis. The various surveillance methods would aid the planning of services, evaluation of treatments, priority setting, and the development of health policy and practice.

Image processing of healthcare data from the big data point of view

Image processing on healthcare data offers valuable knowledge about anatomy and organ functioning and identifies the disease and patient health conditions. The technique currently has been used for organ delineation, identification of lung tumors, diagnosis of spinal deformity, detection of arterial stenosis, detection of an aneurysm, etc. [ 18 ]. The wavelets technique is commonly used for image processing techniques such as segmentation, enhancement, and noise reduction. The use of artificial intelligence in image processing will enhance aspects of health care including screening, diagnosis, and prognosis, and integrating medical images with other types of data and genomic data will increase accuracy and facilitate early diagnosis of diseases [ 18 , 19 ]. The exponential increase in the count of medical facilities and patients has led to better use of clinical settings of computer-based healthcare diagnostics and decision-making systems.

Data from wearable technology

Multi-National Companies like Apple and Google are working on health-based apps and wearable technology as part of a broader range of electronic sensors, the so-called IoT, and toolkits for healthcare-related apps. The possibility of collecting accurate medical data on real-time (e.g., mood, diet followed, exercise, and sleep cycles patterns), linked to physiological indicators (e.g., heart rate, calories burned, level of blood glucose, cortisol levels), is perhaps discrete and omnipresent at minimum cost, unrelated to traditional health care. “True Colors” is a wearable designed to collect continuous patient-centric data with the accessibility and acceptability needed to allow for accurate longitudinal follow-up. More importantly, this system is presently being piloted as a daily health-monitoring substitute.

Medical signal analytics

Telemetry and the devices for the monitoring of physiological parameters generate large amounts of data. The data generated generally are retained for a shorter duration, and thus, extensive research into produced data is neglected. However, advancements in data science in the field of healthcare attempt to ensure better management of data and provide enhanced patient care [ 20 – 23 ].

The use of continuous waveform in health records containing information generated through the application of statistical disciplines (e.g., statistical, quantitative, contextual, cognitive, predictive, etc.) can drive comprehensive care decision-making. Data acquisition apart from an ingestion-streaming platform is needed that can control a set of waveforms at various fidelity rates. The integration of this waveform data with the EHR’s static data results in an important component for giving analytics engine situational as well as contextual awareness. Enhancing the data collected by analytics will not just make the method more reliable, but will also help in balancing predictive analytics’ sensitivity and specificity. The signal processing species must mainly rely on the kind of disease population under observation.

Various signal-processing techniques can be used to derive a large number of target properties that are later consumed to provide actionable insight by a pre-trained machine-learning model. Such observations may be analytical, prescriptive, or predictive. Such insights can be furthermore built to activate other techniques such as alarms and physician notifications. Maintaining these continuous waveforms–based data along with specific data obtained from the remaining sources in perfect harmony to find the appropriate patient information to improve diagnosis and treatments of the next generation can be a daunting task [ 24 ]. Several technological criteria and specifications at the framework, analytical, and clinical levels need to be planned and implemented for the bedside implementation of these systems into medical setups.

Healthcare administration

Knowledge obtained from big data analysis gives healthcare providers insights not available otherwise (Fig.  4 ). Researchers have implemented data mining techniques to data warehousing as well as cloud computing, increasing quality, minimizing costs, handling patients, and several other fields of healthcare.

An external file that holds a picture, illustration, etc.
Object name is 11845_2021_2730_Fig4_HTML.jpg

Role of big data in accelerating the treatment process

Data storage and cloud computing

Data warehousing and cloud storage are primarily used for storing the increasing amount of electronic patient-centric data [ 25 , 26 ] safely and cost-effectively to enhance medical outcomes. Besides medical purposes, data storage is utilized for purposes of research, training, education, and quality control. Users can also extract files from a repository containing the radiology results by using keywords following the predefined patient privacy policy.

Cost and quality of healthcare and utilization of resources

The migration of imaging reports to electronic medical recording systems offers tremendous potential for advancing research and practice on radiology through the continuous updating, incorporation, and exchange of a large volume of data. However, the heterogeneity in how these data can be formatted still poses major challenges. The overall objective of NLP is that the natural human language is translated into structured with a standardized set of value choices that are easily manipulated into subsections or searches for the presence or absence of a finding through software, among other things [ 27 ].

Greaves et al. [ 28 ] analyzed sentiment (computationally dividing them into categories such as optimistic, pessimistic, and neutral) based on the online response of patients stating their overall experience to predict healthcare quality. They found an agreement above 80% between online platform sentiment analysis and conventional paper-based quality prediction surveys (e.g., cleanliness, positive conduct, recommendation). The newer solution can be a cost-effective alternative to conventional healthcare surveys and studies. The physician’s overuse of screening and testing often leads to surplus data and excess costs [ 29 ]. The present practice in pathology is restricted by the emphasis on illness. Zhuang et al. [ 29 ] compared the disease-based approach in conjunction with database reasoning and used the data mining technique to build a decision support system based on evidence to minimize the unnecessary testing to reduce the total expense of patient care.

Patient data management

Patient data management involves effective scheduling and the delivery of patient care during the period of a patient’s stay in a hospital. The framework of patient-centric healthcare is shown in Fig.  5 . Daggy et al. [ 30 ] conducted a study on “no shows” or missing appointments that lead to the clinical capability that has been underused. A logistical regression model is developed using electronic medical records to estimate the probabilities of patients to no-show and show the use of estimates for creating clinical schedules that optimize clinical capacity use while retaining limited waiting times and clinical extra-time. The 400-day clinical call-in process was simulated, and two timetables were developed per day: the conventional method, which assigns one patient per appointment slot, and the proposed method, which schedules patients to balance patient waiting time, additional time, and income according to no-show likelihood.

An external file that holds a picture, illustration, etc.
Object name is 11845_2021_2730_Fig5_HTML.jpg

Elemental structure of patient-centric healthcare and ecosystem

If patient no-show models are mixed with advanced programming approaches, more patients can be seen a day thus enhancing clinical performance. The advantages of implementation of planning software, including certain methodologies, should be considered by clinics as regards no-show costs [ 30 ].

A study conducted by Cubillas et al. [ 31 ] pointed out that it takes less time for patients who came for administrative purposes than for patients for health reasons. They also developed a statistical design for estimating the number of administrative visits. With a time saving of 21.73% (660,538 min), their model enhanced the scheduling system. Unlike administrative data/target finding patients, a few come very regularly for their medical treatment and cover a significant amount of medical workload. Koskela et al. [ 32 ] used both supervised and unsupervised learning strategies to identify and cluster records; the supervised strategy performed well in one cluster with 86% accuracy in distinguishing fare documents from the incorrect ones, whereas the unsupervised technique failed. This approach can be applied to the semi-automate EMR entry system [ 32 ].

Privacy of medical data and fraudulency detection

The anonymization of patient data, maintaining the privacy of the medical data and fraudulency detection in healthcare, is crucial. This demands efforts from data scientists to protect the big data from hackers. Mohammed et al. [ 33 ] introduced a unique anonymization algorithm that works for both distributed and centralized anonymization and discussed the problems of privacy security. For maintaining data usefulness without the loss of any data privacy, the researchers further proposed a model that performed far better than the traditional K-anonymization model. In addition to this, their algorithm could also deal with voluminous, multi-dimensional datasets.

A mobile-based cloud-computing framework [ 34 ] of big data has been introduced to overcome the shortcomings of today’s medical records systems. EHR data systems are constrained due to a lack of interoperability, size of data, and privacy. This unique cloud-based system proposed to store EHR data from multiple healthcare providers within the facility of an internet provider to provide authorized restricted access to healthcare providers and patients. They used algorithms for encryption, One Time Password (OTP), or a 2-factor authentication to ensure data security.

The analytics of the big data can be performed using Google’s efficient tools such as big query tools and MapReduce. This approach will reduce costs, improve efficiency, and provide data protection compared to conventional techniques that are used for anonymization. The conventional approach generally leaves data open to re-identification. Li et al. in a case study showed that hacking can make a connection between tiny chunks of information as well as recognize patients [ 35 ]. Fraud detection and abuse (i.e., suspicious care behavior, deliberate act of falsely representing facts, and unwanted repeated visits) make excellent use of big data analytics [ 36 ].

By using data from gynecology-based reports, Yang et al. framed a system that manually distinguishes characteristics of suspicious specimens from a set of medical care plans that any doctor would mostly adopt [ 37 ]. The technique was implemented on the data from Taiwan’s Bureau of National Health Insurance (BNHI), where the proposed technique managed to detect 69% of the total cases as fraudulent, enhancing the current model, which detected only 63% of fraudulent cases. To sum up, the protection of patient data and the detection of fraud are of significant concern due to the growing usage of social media technology and the propensity of people to place personal information on these platforms. The already existing strategies for anonymizing the data may become less successful if they are not implemented because a significant section of the personal details of everyone is now accessible through these platforms.

Mental health

According to National Survey conducted on Drug Use and Health (NSDUH), 52.2% of the total population in the United States (U.S.) was affected by either mental problems or drug addiction/abuse [ 38 ]. In addition, approximately 30 million suffer from panic attacks and anxiety disorders [ 39 ].

Panagiotakopoulos et al. [ 40 ] developed a data analysis–focused treatment technique to help doctors in managing patients with anxiety disorders. The authors used static information that includes personal information such as the age of the individual, sex, body and skin types, and family details and dynamic information like the context of stress, climate, and symptoms to construct static and dynamic information based on user models. For the first three services, relationships between different complex parameters were established, and the remaining one was mainly used to predict stress rates under various scenarios. This model was verified with the help of data collected from twenty-seven volunteers who are selected via the anxiety assessment survey. The applications of data analytics in the disease diagnosis, examination, or treatment of patients with mental wellbeing are very different from using analytics to anticipate cancer or diabetes. In this case, the data context (static, dynamic, or non-observable environment) seems to be more important compared to data volume [ 39 ].

The leading cause of perinatal morbidity and death is premature birth, but an exact mechanism is still unclear. The research carried by Chen et al. [ 41 ] intended to investigate the risk factors of preterm use of neural networks and decision tree C5.0 data mining. The original medical data was obtained by a specialist study group at the National University of Taiwan from a prospective pregnancy cohort. A total of 910 mother–child dyads from 14,551 in the original data have been recruited using the nest case–control design. In this data, thousands of variables are studied, including basic features, medical background, the climate and parents’ occupational factors, and the variables related to children. The findings suggest that the main risk factors for pre-born birth are multiple births, blood pressure during pregnancy, age, disease, prior preterm history, body weight and height of pregnant women, and paternal life risks associated with drinking and smoking. The results of the study are therefore helpful in the attempt to diagnose high-risk pregnant women and to provide intervention early to minimize and avoid early births in parents, healthcare workers, and public health workers [ 41 , 42 ].

Public health

Data analytics have also been applied to the detection of disease during outbreaks. Kostkova et al. [ 43 ] analyzed online records based on behavior patterns and media reporting the factors that affect the public as well as professional patterns of search-related disease outbreaks. They found distinct factors affecting the public health agencies’ skilled and layperson search patterns with indications for targeted communications during emergencies and outbreaks. Rathore et al. [ 44 ] have suggested an emergency tackling response unit using IoT-based wireless network of wearable devices called body area networks (BANs). The device consists of “intelligent construction,” a model that helps in processing and decision making from the data obtained from the sensors. The system was able to process millions of users’ wireless BAN data to provide an emergency response in real-time.

Consultation online is becoming increasingly common and a possible solution to the scarcity of healthcare resources and inefficient delivery of resources. Numerous online consultation sites do however struggle to attract customers who are prepared to pay and maintain them, and health care providers on the site have the additional challenge to stand out from a large number of doctors who can provide similar services [ 45 ]. In this research, Jiang et al. [ 45 ] used ML approaches to mine massive service data, in order (1) to define the important characteristics related to patient payment rather than free trial appointments, (2) explore the relative importance of those features, and (3) understand how these attributes work concerning payment, whether linearly or not. The dataset refers to the largest online medical consultation platform in China, covering 1,582,564 consultation documents among patient pairs between 2009 and 2018. The results showed that compared with features relating to reputation as a physician, service-related features such as quality of service (e.g., intensity of consultation dialogue and response rate), the source of patients (e.g., online vs offline patients), and the involvement of patients (e.g., social returns and previous treatments revealed). To facilitate payment, it is important to promote multiple timely responses in patient-provider interactions.

Pharmacovigilance

Pharmacovigilance requires tracking and identification of adverse drug reactions (ADRs) after launch, to guarantee patient safety. ADR events’ approximate social cost per year reaches a billion dollars, showing it as a significant aspect of the medical care system [ 46 ]. Data mining findings from adverse event reports (AERs) revealed that mild to lethal reactions might be caused in paclitaxel among which docetaxel is linked with the lethal reaction while the remaining 4 drugs were not associated with hypersensitivity [ 47 ] while testing ADR’s “hypersensitivity” to six anticancer agents [ 47 ]. Harpaz et al. [ 46 ] disagreed with the theory that adverse events might be caused not just due to a single medication but also due to a mixture of synthetic drugs. It is found that there is a correlation between a minimum of one drug and two AEs or two drugs and one AE in 84% of AERs studies. Harpaz R et al. [ 47 ] improved precision in the identification of ADRs by jointly considering several data sources. When using EHRs that are available publicly in conjunction with the AER studies of the FDA, they achieved a 31% (on average) increase in detection [ 45 ]. The authors identified dose-dependent ADRs with the help of models built from structured as well as unstructured EHR data [ 48 ]. Of the top 5 ADR-related drugs, 4 were observed to be dose-related [ 49 ]. The use of text data that is unstructured in EHRs [ 50 ]; pharmacovigilance operation was also given priority.

ADRs are uncommon in conventional pharmacovigilance, though it is possible to get false signals while finding a connection between a drug and any potential ADRs. These false alarms can be avoided because there is already a list of potential ADRs that can be of great help in potential pharmacovigilance activities [ 18 ].

Overcoming the language barrier

Having electronic health records shared worldwide can be beneficial in analyzing and comparing disease incidence and treatments in different countries. However, every country would use their language for data recording. This language barrier can be dealt with the help of multilingual language models, which would allow diversified opportunities for Data Science proliferation and to develop a model for personalization of services. These models will be able to understand the semantics — the grammatical structure and rules of the language along with the context — the general understanding of words in different contexts.

For example: “I’ll meet you at the river bank.”

“I have to deposit some money in my bank account.”

The word bank means different things in the two contexts, and a well-trained language model should be able to differentiate between these two. Cross-lingual language model trains on multiple languages simultaneously. Some of the cross lingual language models include:

mBERT — the multilingual BERT which was developed by Google Research team.

XLM — cross lingual model developed by Facebook AI, which is an improvisation over mBERT.

Multifit — a QRNN-based model developed by Fast.Ai that addresses challenges faced by low resource language models.

Millions of data points are accessible for EHR-based phenotyping involving a large number of clinical elements inside the EHRs. Like sequence data, handling and controlling the complete data of millions of individuals would also become a major challenge [ 51 ]. The key challenges faced include:

  • The data collected was mostly either unorganized or inaccurate, thus posing a problem to gain insights into it.
  • The correct balance between preserving patient-centric information and ensuring the quality and accessibility of this data is difficult to decide.
  • Data standardization, maintaining privacy, efficient storage, and transfers require a lot of manpower to constantly monitor and make sure that the needs are met.
  • Integrating genomic data into medical studies is critical due to the absence of standards for producing next-generation sequencing (NGS) data, handling bioinformatics, data deposition, and supporting medical decision-making [ 52 ].
  • Language barrier when dealing data

Future directions

Healthcare services are constantly on the lookout for better options for improving the quality of treatment. It has embraced technological innovations intending to develop for a better future. Big data is a revolution in the world of health care. The attitude of patients, doctors, and healthcare providers to care delivery has only just begun to transform. The discussed use of big data is just the iceberg edge. With the proliferation of data science and the advent of various data-driven applications, the health sector remains a leading provider of data-driven solutions to a better life and tailored services to its customers. Data scientists can gain meaningful insights into improving the productivity of pharmaceutical and medical services through their broad range of data on the healthcare sector including financial, clinical, R&D, administration, and operational details.

Larger patient datasets can be obtained from medical care organizations that include data from surveillance, laboratory, genomics, imaging, and electronic healthcare records. This data requires proper management and analysis to derive meaningful information. Long-term visions for self-management, improved patient care, and treatment can be realized by utilizing big data. Data Science can bring in instant predictive analytics that can be used to obtain insights into a variety of disease processes and deliver patient-centric treatment. It will help to improvise the ability of researchers in the field of science, epidemiological studies, personalized medicine, etc. Predictive accuracy, however, is highly dependent on efficient data integration obtained from different sources to enable it to be generalized. Modern health organizations can revolutionize medical therapy and personalized medicine by integrating biomedical and health data. Data science can effectively handle, evaluate, and interpret big data by creating new paths in comprehensive medical care.

OOpen access funding provided by Manipal Academy of Higher Education, Manipal.

Declarations

The authors declare no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Advertisement

Advertisement

Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective

  • Review Article
  • Published: 12 July 2021
  • Volume 2 , article number  377 , ( 2021 )

Cite this article

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

59k Accesses

99 Citations

Explore all metrics

The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science, advanced analytics methods including machine learning modeling can provide actionable insights or deeper knowledge about data, which makes the computing process automatic and smart. In this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and capabilities of an application through smart decision-making in different scenarios. We also discuss and summarize ten potential real-world application domains including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making. Based on this, we finally highlight the challenges and potential research directions within the scope of our study. Overall, this paper aims to serve as a reference point on data science and advanced analytics to the researchers and decision-makers as well as application developers, particularly from the data-driven solution point of view for real-world problems.

Similar content being viewed by others

data science research paper

Artificial intelligence for waste management in smart cities: a review

Bingbing Fang, Jiacheng Yu, … Pow-Seng Yap

data science research paper

AI-Based Modeling: Techniques, Applications and Research Issues Towards Automation, Intelligent and Smart Systems

Iqbal H. Sarker

data science research paper

RETRACTED ARTICLE: A Review and State of Art of Internet of Things (IoT)

Asif Ali Laghari, Kaishan Wu, … Abdullah Ayub Khan

Avoid common mistakes on your manuscript.

Introduction

We are living in the age of “data science and advanced analytics”, where almost everything in our daily lives is digitally recorded as data [ 17 ]. Thus the current electronic world is a wealth of various kinds of data, such as business data, financial data, healthcare data, multimedia data, internet of things (IoT) data, cybersecurity data, social media data, etc [ 112 ]. The data can be structured, semi-structured, or unstructured, which increases day by day [ 105 ]. Data science is typically a “concept to unify statistics, data analysis, and their related methods” to understand and analyze the actual phenomena with data. According to Cao et al. [ 17 ] “data science is the science of data” or “data science is the study of data”, where a data product is a data deliverable, or data-enabled or guided, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, or system. The popularity of “Data science” is increasing day-by-day, which is shown in Fig. 1 according to Google Trends data over the last 5 years [ 36 ]. In addition to data science, we have also shown the popularity trends of the relevant areas such as “Data analytics”, “Data mining”, “Big data”, “Machine learning” in the figure. According to Fig. 1 , the popularity indication values for these data-driven domains, particularly “Data science”, and “Machine learning” are increasing day-by-day. This statistical information and the applicability of the data-driven smart decision-making in various real-world application areas, motivate us to study briefly on “Data science” and machine-learning-based “Advanced analytics” in this paper.

figure 1

The worldwide popularity score of data science comparing with relevant  areas in a range of 0 (min) to 100 (max) over time where x -axis represents the timestamp information and y -axis represents the corresponding score

Usually, data science is the field of applying advanced analytics methods and scientific concepts to derive useful business information from data. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to analyze granular data, which we are interested in. In the field of data science, several types of analytics are popular, such as "Descriptive analytics" which answers the question of what happened; "Diagnostic analytics" which answers the question of why did it happen; "Predictive analytics" which predicts what will happen in the future; and "Prescriptive analytics" which prescribes what action should be taken, discussed briefly in “ Advanced analytics methods and smart computing ”. Such advanced analytics and decision-making based on machine learning techniques [ 105 ], a major part of artificial intelligence (AI) [ 102 ] can also play a significant role in the Fourth Industrial Revolution (Industry 4.0) due to its learning capability for smart computing as well as automation [ 121 ].

Although the area of “data science” is huge, we mainly focus on deriving useful insights through advanced analytics, where the results are used to make smart decisions in various real-world application areas. For this, various advanced analytics methods such as machine learning modeling, natural language processing, sentiment analysis, neural network, or deep learning analysis can provide deeper knowledge about data, and thus can be used to develop data-driven intelligent applications. More specifically, regression analysis, classification, clustering analysis, association rules, time-series analysis, sentiment analysis, behavioral patterns, anomaly detection, factor analysis, log analysis, and deep learning which is originated from the artificial neural network, are taken into account in our study. These machine learning-based advanced analytics methods are discussed briefly in “ Advanced analytics methods and smart computing ”. Thus, it’s important to understand the principles of various advanced analytics methods mentioned above and their applicability to apply in various real-world application areas. For instance, in our earlier paper Sarker et al. [ 114 ], we have discussed how data science and machine learning modeling can play a significant role in the domain of cybersecurity for making smart decisions and to provide data-driven intelligent security services. In this paper, we broadly take into account the data science application areas and real-world problems in ten potential domains including the area of business data science, health data science, IoT data science, behavioral data science, urban data science, and so on, discussed briefly in “ Real-world application domains ”.

Based on the importance of machine learning modeling to extract the useful insights from the data mentioned above and data-driven smart decision-making, in this paper, we present a comprehensive view on “Data Science” including various types of advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application. The key contribution of this study is thus understanding data science modeling, explaining different analytic methods for solution perspective and their applicability in various real-world data-driven applications areas mentioned earlier. Overall, the purpose of this paper is, therefore, to provide a basic guide or reference for those academia and industry people who want to study, research, and develop automated and intelligent applications or systems based on smart computing and decision making within the area of data science.

The main contributions of this paper are summarized as follows:

To define the scope of our study towards data-driven smart computing and decision-making in our real-world life. We also make a brief discussion on the concept of data science modeling from business problems to data product and automation, to understand its applicability and provide intelligent services in real-world scenarios.

To provide a comprehensive view on data science including advanced analytics methods that can be applied to enhance the intelligence and the capabilities of an application.

To discuss the applicability and significance of machine learning-based analytics methods in various real-world application areas. We also summarize ten potential real-world application areas, from business to personalized applications in our daily life, where advanced analytics with machine learning modeling can be used to achieve the expected outcome.

To highlight and summarize the challenges and potential research directions within the scope of our study.

The rest of the paper is organized as follows. The next section provides the background and related work and defines the scope of our study. The following section presents the concepts of data science modeling for building a data-driven application. After that, briefly discuss and explain different advanced analytics methods and smart computing. Various real-world application areas are discussed and summarized in the next section. We then highlight and summarize several research issues and potential future directions, and finally, the last section concludes this paper.

Background and Related Work

In this section, we first discuss various data terms and works related to data science and highlight the scope of our study.

Data Terms and Definitions

There is a range of key terms in the field, such as data analysis, data mining, data analytics, big data, data science, advanced analytics, machine learning, and deep learning, which are highly related and easily confusing. In the following, we define these terms and differentiate them with the term “Data Science” according to our goal.

The term “Data analysis” refers to the processing of data by conventional (e.g., classic statistical, empirical, or logical) theories, technologies, and tools for extracting useful information and for practical purposes [ 17 ]. The term “Data analytics”, on the other hand, refers to the theories, technologies, instruments, and processes that allow for an in-depth understanding and exploration of actionable data insight [ 17 ]. Statistical and mathematical analysis of the data is the major concern in this process. “Data mining” is another popular term over the last decade, which has a similar meaning with several other terms such as knowledge mining from data, knowledge extraction, knowledge discovery from data (KDD), data/pattern analysis, data archaeology, and data dredging. According to Han et al. [ 38 ], it should have been more appropriately named “knowledge mining from data”. Overall, data mining is defined as the process of discovering interesting patterns and knowledge from large amounts of data [ 38 ]. Data sources may include databases, data centers, the Internet or Web, other repositories of data, or data dynamically streamed through the system. “Big data” is another popular term nowadays, which may change the statistical and data analysis approaches as it has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous” [ 74 ]. Big data can be generated by mobile devices, social networks, the Internet of Things, multimedia, and many other new applications [ 129 ]. Several unique features including volume, velocity, variety, veracity, value (5Vs), and complexity are used to understand and describe big data [ 69 ].

In terms of analytics, basic analytics provides a summary of data whereas the term “Advanced Analytics” takes a step forward in offering a deeper understanding of data and helps to analyze granular data. Advanced analytics is characterized or defined as autonomous or semi-autonomous data or content analysis using advanced techniques and methods to discover deeper insights, predict or generate recommendations, typically beyond traditional business intelligence or analytics. “Machine learning”, a branch of artificial intelligence (AI), is one of the major techniques used in advanced analytics which can automate analytical model building [ 112 ]. This is focused on the premise that systems can learn from data, recognize trends, and make decisions, with minimal human involvement [ 38 , 115 ]. “Deep Learning” is a subfield of machine learning that discusses algorithms inspired by the human brain’s structure and the function called artificial neural networks [ 38 , 139 ].

Unlike the above data-related terms, “Data science” is an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies. In [ 17 ], Cao et al. defined data science from the disciplinary perspective as “data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management, and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) to transform data to insights and decisions by following a data-to-knowledge-to-wisdom thinking and methodology”. In “ Understanding data science modeling ”, we briefly discuss the data science modeling from a practical perspective starting from business problems to data products that can assist the data scientists to think and work in a particular real-world problem domain within the area of data science and analytics.

Related Work

In the area, several papers have been reviewed by the researchers based on data science and its significance. For example, the authors in [ 19 ] identify the evolving field of data science and its importance in the broader knowledge environment and some issues that differentiate data science and informatics issues from conventional approaches in information sciences. Donoho et al. [ 27 ] present 50 years of data science including recent commentary on data science in mass media, and on how/whether data science varies from statistics. The authors formally conceptualize the theory-guided data science (TGDS) model in [ 53 ] and present a taxonomy of research themes in TGDS. Cao et al. include a detailed survey and tutorial on the fundamental aspects of data science in [ 17 ], which considers the transition from data analysis to data science, the principles of data science, as well as the discipline and competence of data education.

Besides, the authors include a data science analysis in [ 20 ], which aims to provide a realistic overview of the use of statistical features and related data science methods in bioimage informatics. The authors in [ 61 ] study the key streams of data science algorithm use at central banks and show how their popularity has risen over time. This research contributes to the creation of a research vector on the role of data science in central banking. In [ 62 ], the authors provide an overview and tutorial on the data-driven design of intelligent wireless networks. The authors in [ 87 ] provide a thorough understanding of computational optimal transport with application to data science. In [ 97 ], the authors present data science as theoretical contributions in information systems via text analytics.

Unlike the above recent studies, in this paper, we concentrate on the knowledge of data science including advanced analytics methods, machine learning modeling, real-world application domains, and potential research directions within the scope of our study. The advanced analytics methods based on machine learning techniques discussed in this paper can be applied to enhance the capabilities of an application in terms of data-driven intelligent decision making and automation in the final data product or systems.

Understanding Data Science Modeling

In this section, we briefly discuss how data science can play a significant role in the real-world business process. For this, we first categorize various types of data and then discuss the major steps of data science modeling starting from business problems to data product and automation.

Types of Real-World Data

Typically, to build a data-driven real-world system in a particular domain, the availability of data is the key [ 17 , 112 , 114 ]. The data can be in different types such as (i) Structured—that has a well-defined data structure and follows a standard order, examples are names, dates, addresses, credit card numbers, stock information, geolocation, etc.; (ii) Unstructured—has no pre-defined format or organization, examples are sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, etc.; (iii) Semi-structured—has elements of both the structured and unstructured data containing certain organizational properties, examples are HTML, XML, JSON documents, NoSQL databases, etc.; and (iv) Metadata—that represents data about the data, examples are author, file type, file size, creation date and time, last modification date and time, etc. [ 38 , 105 ].

In the area of data science, researchers use various widely-used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 127 ], UNSW-NB15 [ 79 ], Bot-IoT [ 59 ], ISCX’12 [ 15 ], CIC-DDoS2019 [ 22 ], etc., smartphone datasets such as phone call logs [ 88 , 110 ], mobile application usages logs [ 124 , 149 ], SMS Log [ 28 ], mobile phone notification logs [ 77 ] etc., IoT data [ 56 , 11 , 64 ], health data such as heart disease [ 99 ], diabetes mellitus [ 86 , 147 ], COVID-19 [ 41 , 78 ], etc., agriculture and e-commerce data [ 128 , 150 ], and many more in various application domains. In “ Real-world application domains ”, we discuss ten potential real-world application domains of data science and analytics by taking into account data-driven smart computing and decision making, which can help the data scientists and application developers to explore more in various real-world issues.

Overall, the data used in data-driven applications can be any of the types mentioned above, and they can differ from one application to another in the real world. Data science modeling, which is briefly discussed below, can be used to analyze such data in a specific problem domain and derive insights or useful information from the data to build a data-driven model or data product.

Steps of Data Science Modeling

Data science is typically an umbrella term that encompasses advanced data analytics, data mining, machine, and deep learning modeling, and several other related disciplines like statistics, to extract insights or useful knowledge from the datasets and transform them into actionable business strategies, mentioned earlier in “ Background and related work ”. In this section, we briefly discuss how data science can play a significant role in the real-world business process. Figure 2 shows an example of data science modeling starting from real-world data to data-driven product and automation. In the following, we briefly discuss each module of the data science process.

figure 2

An example of data science modeling from real-world data to data-driven system and decision making

Understanding business problems: This involves getting a clear understanding of the problem that is needed to solve, how it impacts the relevant organization or individuals, the ultimate goals for addressing it, and the relevant project plan. Thus to understand and identify the business problems, the data scientists formulate relevant questions while working with the end-users and other stakeholders. For instance, how much/many, which category/group, is the behavior unrealistic/abnormal, which option should be taken, what action, etc. could be relevant questions depending on the nature of the problems. This helps to get a better idea of what business needs and what we should be extracted from data. Such business knowledge can enable organizations to enhance their decision-making process, is known as “Business Intelligence” [ 65 ]. Identifying the relevant data sources that can help to answer the formulated questions and what kinds of actions should be taken from the trends that the data shows, is another important task associated with this stage. Once the business problem has been clearly stated, the data scientist can define the analytic approach to solve the problem.

Understanding data: As we know that data science is largely driven by the availability of data [ 114 ]. Thus a sound understanding of the data is needed towards a data-driven model or system. The reason is that real-world data sets are often noisy, missing values, have inconsistencies, or other data issues, which are needed to handle effectively [ 101 ]. To gain actionable insights, the appropriate data or the quality of the data must be sourced and cleansed, which is fundamental to any data science engagement. For this, data assessment that evaluates what data is available and how it aligns to the business problem could be the first step in data understanding. Several aspects such as data type/format, the quantity of data whether it is sufficient or not to extract the useful knowledge, data relevance, authorized access to data, feature or attribute importance, combining multiple data sources, important metrics to report the data, etc. are needed to take into account to clearly understand the data for a particular business problem. Overall, the data understanding module involves figuring out what data would be best needed and the best ways to acquire it.

Data pre-processing and exploration: Exploratory data analysis is defined in data science as an approach to analyzing datasets to summarize their key characteristics, often with visual methods [ 135 ]. This examines a broad data collection to discover initial trends, attributes, points of interest, etc. in an unstructured manner to construct meaningful summaries of the data. Thus data exploration is typically used to figure out the gist of data and to develop a first step assessment of its quality, quantity, and characteristics. A statistical model can be used or not, but primarily it offers tools for creating hypotheses by generally visualizing and interpreting the data through graphical representation such as a chart, plot, histogram, etc [ 72 , 91 ]. Before the data is ready for modeling, it’s necessary to use data summarization and visualization to audit the quality of the data and provide the information needed to process it. To ensure the quality of the data, the data  pre-processing technique, which is typically the process of cleaning and transforming raw data [ 107 ] before processing and analysis is important. It also involves reformatting information, making data corrections, and merging data sets to enrich data. Thus, several aspects such as expected data, data cleaning, formatting or transforming data, dealing with missing values, handling data imbalance and bias issues, data distribution, search for outliers or anomalies in data and dealing with them, ensuring data quality, etc. could be the key considerations in this step.

Machine learning modeling and evaluation: Once the data is prepared for building the model, data scientists design a model, algorithm, or set of models, to address the business problem. Model building is dependent on what type of analytics, e.g., predictive analytics, is needed to solve the particular problem, which is discussed briefly in “ Advanced analytics methods and smart computing ”. To best fits the data according to the type of analytics, different types of data-driven or machine learning models that have been summarized in our earlier paper Sarker et al. [ 105 ], can be built to achieve the goal. Data scientists typically separate training and test subsets of the given dataset usually dividing in the ratio of 80:20 or data considering the most popular k -folds data splitting method [ 38 ]. This is to observe whether the model performs well or not on the data, to maximize the model performance. Various model validation and assessment metrics, such as error rate, accuracy, true positive, false positive, true negative, false negative, precision, recall, f-score, ROC (receiver operating characteristic curve) analysis, applicability analysis, etc. [ 38 , 115 ] are used to measure the model performance, which can guide the data scientists to choose or design the learning method or model. Besides, machine learning experts or data scientists can take into account several advanced analytics such as feature engineering, feature selection or extraction methods, algorithm tuning, ensemble methods, modifying existing algorithms, or designing new algorithms, etc. to improve the ultimate data-driven model to solve a particular business problem through smart decision making.

Data product and automation: A data product is typically the output of any data science activity [ 17 ]. A data product, in general terms, is a data deliverable, or data-enabled or guide, which can be a discovery, prediction, service, suggestion, insight into decision-making, thought, model, paradigm, tool, application, or system that process data and generate results. Businesses can use the results of such data analysis to obtain useful information like churn (a measure of how many customers stop using a product) prediction and customer segmentation, and use these results to make smarter business decisions and automation. Thus to make better decisions in various business problems, various machine learning pipelines and data products can be developed. To highlight this, we summarize several potential real-world data science application areas in “ Real-world application domains ”, where various data products can play a significant role in relevant business problems to make them smart and automate.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in business practices. The interesting part of the data science process indicates having a deeper understanding of the business problem to solve. Without that, it would be much harder to gather the right data and extract the most useful information from the data for making decisions to solve the problem. In terms of role, “Data Scientists” typically interpret and manage data to uncover the answers to major questions that help organizations to make objective decisions and solve complex problems. In a summary, a data scientist proactively gathers and analyzes information from multiple sources to better understand how the business performs, and  designs machine learning or data-driven tools/methods, or algorithms, focused on advanced analytics, which can make today’s computing process smarter and intelligent, discussed briefly in the following section.

Advanced Analytics Methods and Smart Computing

As mentioned earlier in “ Background and related work ”, basic analytics provides a summary of data whereas advanced analytics takes a step forward in offering a deeper understanding of data and helps in granular data analysis. For instance, the predictive capabilities of advanced analytics can be used to forecast trends, events, and behaviors. Thus, “advanced analytics” can be defined as the autonomous or semi-autonomous analysis of data or content using advanced techniques and methods to discover deeper insights, make predictions, or produce recommendations, where machine learning-based analytical modeling is considered as the key technologies in the area. In the following section, we first summarize various types of analytics and outcome that are needed to solve the associated business problems, and then we briefly discuss machine learning-based analytical modeling.

Types of Analytics and Outcome

In the real-world business process, several key questions such as “What happened?”, “Why did it happen?”, “What will happen in the future?”, “What action should be taken?” are common and important. Based on these questions, in this paper, we categorize and highlight the analytics into four types such as descriptive, diagnostic, predictive, and prescriptive, which are discussed below.

Descriptive analytics: It is the interpretation of historical data to better understand the changes that have occurred in a business. Thus descriptive analytics answers the question, “what happened in the past?” by summarizing past data such as statistics on sales and operations or marketing strategies, use of social media, and engagement with Twitter, Linkedin or Facebook, etc. For instance, using descriptive analytics through analyzing trends, patterns, and anomalies, etc., customers’ historical shopping data can be used to predict the probability of a customer purchasing a product. Thus, descriptive analytics can play a significant role to provide an accurate picture of what has occurred in a business and how it relates to previous times utilizing a broad range of relevant business data. As a result, managers and decision-makers can pinpoint areas of strength and weakness in their business, and eventually can take more effective management strategies and business decisions.

Diagnostic analytics: It is a form of advanced analytics that examines data or content to answer the question, “why did it happen?” The goal of diagnostic analytics is to help to find the root cause of the problem. For example, the human resource management department of a business organization may use these diagnostic analytics to find the best applicant for a position, select them, and compare them to other similar positions to see how well they perform. In a healthcare example, it might help to figure out whether the patients’ symptoms such as high fever, dry cough, headache, fatigue, etc. are all caused by the same infectious agent. Overall, diagnostic analytics enables one to extract value from the data by posing the right questions and conducting in-depth investigations into the answers. It is characterized by techniques such as drill-down, data discovery, data mining, and correlations.

Predictive analytics: Predictive analytics is an important analytical technique used by many organizations for various purposes such as to assess business risks, anticipate potential market patterns, and decide when maintenance is needed, to enhance their business. It is a form of advanced analytics that examines data or content to answer the question, “what will happen in the future?” Thus, the primary goal of predictive analytics is to identify and typically answer this question with a high degree of probability. Data scientists can use historical data as a source to extract insights for building predictive models using various regression analyses and machine learning techniques, which can be used in various application domains for a better outcome. Companies, for example, can use predictive analytics to minimize costs by better anticipating future demand and changing output and inventory, banks and other financial institutions to reduce fraud and risks by predicting suspicious activity, medical specialists to make effective decisions through predicting patients who are at risk of diseases, retailers to increase sales and customer satisfaction through understanding and predicting customer preferences, manufacturers to optimize production capacity through predicting maintenance requirements, and many more. Thus predictive analytics can be considered as the core analytical method within the area of data science.

Prescriptive analytics: Prescriptive analytics focuses on recommending the best way forward with actionable information to maximize overall returns and profitability, which typically answer the question, “what action should be taken?” In business analytics, prescriptive analytics is considered the final step. For its models, prescriptive analytics collects data from several descriptive and predictive sources and applies it to the decision-making process. Thus, we can say that it is related to both descriptive analytics and predictive analytics, but it emphasizes actionable insights instead of data monitoring. In other words, it can be considered as the opposite of descriptive analytics, which examines decisions and outcomes after the fact. By integrating big data, machine learning, and business rules, prescriptive analytics helps organizations to make more informed decisions to produce results that drive the most successful business decisions.

In summary, to clarify what happened and why it happened, both descriptive analytics and diagnostic analytics look at the past. Historical data is used by predictive analytics and prescriptive analytics to forecast what will happen in the future and what steps should be taken to impact those effects. In Table 1 , we have summarized these analytics methods with examples. Forward-thinking organizations in the real world can jointly use these analytical methods to make smart decisions that help drive changes in business processes and improvements. In the following, we discuss how machine learning techniques can play a big role in these analytical methods through their learning capabilities from the data.

Machine Learning Based Analytical Modeling

In this section, we briefly discuss various advanced analytics methods based on machine learning modeling, which can make the computing process smart through intelligent decision-making in a business process. Figure 3 shows a general structure of a machine learning-based predictive modeling considering both the training and testing phase. In the following, we discuss a wide range of methods such as regression and classification analysis, association rule analysis, time-series analysis, behavioral analysis, log analysis, and so on within the scope of our study.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Regression Analysis

In data science, one of the most common statistical approaches used for predictive modeling and data mining tasks is regression techniques [ 38 ]. Regression analysis is a form of supervised machine learning that examines the relationship between a dependent variable (target) and independent variables (predictor) to predict continuous-valued output [ 105 , 117 ]. The following equations Eqs. 1 , 2 , and 3 [ 85 , 105 ] represent the simple, multiple or multivariate, and polynomial regressions respectively, where x represents independent variable and y is the predicted/target output mentioned above:

Regression analysis is typically conducted for one of two purposes: to predict the value of the dependent variable in the case of individuals for whom some knowledge relating to the explanatory variables is available, or to estimate the effect of some explanatory variable on the dependent variable, i.e., finding the relationship of causal influence between the variables. Linear regression cannot be used to fit non-linear data and may cause an underfitting problem. In that case, polynomial regression performs better, however, increases the model complexity. The regularization techniques such as Ridge, Lasso, Elastic-Net, etc. [ 85 , 105 ] can be used to optimize the linear regression model. Besides, support vector regression, decision tree regression, random forest regression techniques [ 85 , 105 ] can be used for building effective regression models depending on the problem type, e.g., non-linear tasks. Financial forecasting or prediction, cost estimation, trend analysis, marketing, time-series estimation, drug response modeling, etc. are some examples where the regression models can be used to solve real-world problems in the domain of data science and analytics.

Classification Analysis

Classification is one of the most widely used and best-known data science processes. This is a form of supervised machine learning approach that also refers to a predictive modeling problem in which a class label is predicted for a given example [ 38 ]. Spam identification, such as ‘spam’ and ‘not spam’ in email service providers, can be an example of a classification problem. There are several forms of classification analysis available in the area such as binary classification—which refers to the prediction of one of two classes; multi-class classification—which involves the prediction of one of more than two classes; multi-label classification—a generalization of multiclass classification in which the problem’s classes are organized hierarchically [ 105 ].

figure 4

An example of a random forest structure considering multiple decision trees

Several popular classification techniques, such as k-nearest neighbors [ 5 ], support vector machines [ 55 ], navies Bayes [ 49 ], adaptive boosting [ 32 ], extreme gradient boosting [ 85 ], logistic regression [ 66 ], decision trees ID3 [ 92 ], C4.5 [ 93 ], and random forests [ 13 ] exist to solve classification problems. The tree-based classification technique, e.g., random forest considering multiple decision trees, performs better than others to solve real-world problems in many cases as due to its capability of producing logic rules [ 103 , 115 ]. Figure 4 shows an example of a random forest structure considering multiple decision trees. In addition, BehavDT recently proposed by Sarker et al. [ 109 ], and IntrudTree [ 106 ] can be used for building effective classification or prediction models in the relevant tasks within the domain of data science and analytics.

Cluster Analysis

Clustering is a form of unsupervised machine learning technique and is well-known in many data science application areas for statistical data analysis [ 38 ]. Usually, clustering techniques search for the structures inside a dataset and, if the classification is not previously identified, classify homogeneous groups of cases. This means that data points are identical to each other within a cluster, and different from data points in another cluster. Overall, the purpose of cluster analysis is to sort various data points into groups (or clusters) that are homogeneous internally and heterogeneous externally [ 105 ]. To gain insight into how data is distributed in a given dataset or as a preprocessing phase for other algorithms, clustering is often used. Data clustering, for example, assists with customer shopping behavior, sales campaigns, and retention of consumers for retail businesses, anomaly detection, etc.

Many clustering algorithms with the ability to group data have been proposed in machine learning and data science literature [ 98 , 138 , 141 ]. In our earlier paper Sarker et al. [ 105 ], we have summarized this based on several perspectives, such as partitioning methods, density-based methods, hierarchical-based methods, model-based methods, etc. In the literature, the popular K-means [ 75 ], K-Mediods [ 84 ], CLARA [ 54 ] etc. are known as partitioning methods; DBSCAN [ 30 ], OPTICS [ 8 ] etc. are known as density-based methods; single linkage [ 122 ], complete linkage [ 123 ], etc. are known as hierarchical methods. In addition, grid-based clustering methods, such as STING [ 134 ], CLIQUE [ 2 ], etc.; model-based clustering such as neural network learning [ 141 ], GMM [ 94 ], SOM [ 18 , 104 ], etc.; constrained-based methods such as COP K-means [ 131 ], CMWK-Means [ 25 ], etc. are used in the area. Recently, Sarker et al. [ 111 ] proposed a hierarchical clustering method, BOTS [ 111 ] based on bottom-up agglomerative technique for capturing user’s similar behavioral characteristics over time. The key benefit of agglomerative hierarchical clustering is that the tree-structure hierarchy created by agglomerative clustering is more informative than an unstructured set of flat clusters, which can assist in better decision-making in relevant application areas in data science.

Association Rule Analysis

Association rule learning is known as a rule-based machine learning system, an unsupervised learning method is typically used to establish a relationship among variables. This is a descriptive technique often used to analyze large datasets for discovering interesting relationships or patterns. The association learning technique’s main strength is its comprehensiveness, as it produces all associations that meet user-specified constraints including minimum support and confidence value [ 138 ].

Association rules allow a data scientist to identify trends, associations, and co-occurrences between data sets inside large data collections. In a supermarket, for example, associations infer knowledge about the buying behavior of consumers for different items, which helps to change the marketing and sales plan. In healthcare, to better diagnose patients, physicians may use association guidelines. Doctors can assess the conditional likelihood of a given illness by comparing symptom associations in the data from previous cases using association rules and machine learning-based data analysis. Similarly, association rules are useful for consumer behavior analysis and prediction, customer market analysis, bioinformatics, weblog mining, recommendation systems, etc.

Several types of association rules have been proposed in the area, such as frequent pattern based [ 4 , 47 , 73 ], logic-based [ 31 ], tree-based [ 39 ], fuzzy-rules [ 126 ], belief rule [ 148 ] etc. The rule learning techniques such as AIS [ 3 ], Apriori [ 4 ], Apriori-TID and Apriori-Hybrid [ 4 ], FP-Tree [ 39 ], Eclat [ 144 ], RARM [ 24 ] exist to solve the relevant business problems. Apriori [ 4 ] is the most commonly used algorithm for discovering association rules from a given dataset among the association rule learning techniques [ 145 ]. The recent association rule-learning technique ABC-RuleMiner proposed in our earlier paper by Sarker et al. [ 113 ] could give significant results in terms of generating non-redundant rules that can be used for smart decision making according to human preferences, within the area of data science applications.

Time-Series Analysis and Forecasting

A time series is typically a series of data points indexed in time order particularly, by date, or timestamp [ 111 ]. Depending on the frequency, the time-series can be different types such as annually, e.g., annual budget, quarterly, e.g., expenditure, monthly, e.g., air traffic, weekly, e.g., sales quantity, daily, e.g., weather, hourly, e.g., stock price, minute-wise, e.g., inbound calls in a call center, and even second-wise, e.g., web traffic, and so on in relevant domains.

A mathematical method dealing with such time-series data, or the procedure of fitting a time series to a proper model is termed time-series analysis. Many different time series forecasting algorithms and analysis methods can be applied to extract the relevant information. For instance, to do time-series forecasting for future patterns, the autoregressive (AR) model [ 130 ] learns the behavioral trends or patterns of past data. Moving average (MA) [ 40 ] is another simple and common form of smoothing used in time series analysis and forecasting that uses past forecasted errors in a regression-like model to elaborate an averaged trend across the data. The autoregressive moving average (ARMA) [ 12 , 120 ] combines these two approaches, where autoregressive extracts the momentum and pattern of the trend and moving average capture the noise effects. The most popular and frequently used time-series model is the autoregressive integrated moving average (ARIMA) model [ 12 , 120 ]. ARIMA model, a generalization of an ARMA model, is more flexible than other statistical models such as exponential smoothing or simple linear regression. In terms of data, the ARMA model can only be used for stationary time-series data, while the ARIMA model includes the case of non-stationarity as well. Similarly, seasonal autoregressive integrated moving average (SARIMA), autoregressive fractionally integrated moving average (ARFIMA), autoregressive moving average model with exogenous inputs model (ARMAX model) are also used in time-series models [ 120 ].

figure 5

An example of producing aggregate time segments from initial time slices based on similar behavioral characteristics

In addition to the stochastic methods for time-series modeling and forecasting, machine and deep learning-based approach can be used for effective time-series analysis and forecasting. For instance, in our earlier paper, Sarker et al. [ 111 ] present a bottom-up clustering-based time-series analysis to capture the mobile usage behavioral patterns of the users. Figure 5 shows an example of producing aggregate time segments Seg_i from initial time slices TS_i based on similar behavioral characteristics that are used in our bottom-up clustering approach, where D represents the dominant behavior BH_i of the users, mentioned above [ 111 ]. The authors in [ 118 ], used a long short-term memory (LSTM) model, a kind of recurrent neural network (RNN) deep learning model, in forecasting time-series that outperform traditional approaches such as the ARIMA model. Time-series analysis is commonly used these days in various fields such as financial, manufacturing, business, social media, event data (e.g., clickstreams and system events), IoT and smartphone data, and generally in any applied science and engineering temporal measurement domain. Thus, it covers a wide range of application areas in data science.

Opinion Mining and Sentiment Analysis

Sentiment analysis or opinion mining is the computational study of the opinions, thoughts, emotions, assessments, and attitudes of people towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [ 71 ]. There are three kinds of sentiments: positive, negative, and neutral, along with more extreme feelings such as angry, happy and sad, or interested or not interested, etc. More refined sentiments to evaluate the feelings of individuals in various situations can also be found according to the problem domain.

Although the task of opinion mining and sentiment analysis is very challenging from a technical point of view, it’s very useful in real-world practice. For instance, a business always aims to obtain an opinion from the public or customers about its products and services to refine the business policy as well as a better business decision. It can thus benefit a business to understand the social opinion of their brand, product, or service. Besides, potential customers want to know what consumers believe they have when they use a service or purchase a product. Document-level, sentence level, aspect level, and concept level, are the possible levels of opinion mining in the area [ 45 ].

Several popular techniques such as lexicon-based including dictionary-based and corpus-based methods, machine learning including supervised and unsupervised learning, deep learning, and hybrid methods are used in sentiment analysis-related tasks [ 70 ]. To systematically define, extract, measure, and analyze affective states and subjective knowledge, it incorporates the use of statistics, natural language processing (NLP), machine learning as well as deep learning methods. Sentiment analysis is widely used in many applications, such as reviews and survey data, web and social media, and healthcare content, ranging from marketing and customer support to clinical practice. Thus sentiment analysis has a big influence in many data science applications, where public sentiment is involved in various real-world issues.

Behavioral Data and Cohort Analysis

Behavioral analytics is a recent trend that typically reveals new insights into e-commerce sites, online gaming, mobile and smartphone applications, IoT user behavior, and many more [ 112 ]. The behavioral analysis aims to understand how and why the consumers or users behave, allowing accurate predictions of how they are likely to behave in the future. For instance, it allows advertisers to make the best offers with the right client segments at the right time. Behavioral analytics, including traffic data such as navigation paths, clicks, social media interactions, purchase decisions, and marketing responsiveness, use the large quantities of raw user event information gathered during sessions in which people use apps, games, or websites. In our earlier papers Sarker et al. [ 101 , 111 , 113 ] we have discussed how to extract users phone usage behavioral patterns utilizing real-life phone log data for various purposes.

In the real-world scenario, behavioral analytics is often used in e-commerce, social media, call centers, billing systems, IoT systems, political campaigns, and other applications, to find opportunities for optimization to achieve particular outcomes. Cohort analysis is a branch of behavioral analytics that involves studying groups of people over time to see how their behavior changes. For instance, it takes data from a given data set (e.g., an e-commerce website, web application, or online game) and separates it into related groups for analysis. Various machine learning techniques such as behavioral data clustering [ 111 ], behavioral decision tree classification [ 109 ], behavioral association rules [ 113 ], etc. can be used in the area according to the goal. Besides, the concept of RecencyMiner, proposed in our earlier paper Sarker et al. [ 108 ] that takes into account recent behavioral patterns could be effective while analyzing behavioral data as it may not be static in the real-world changes over time.

Anomaly Detection or Outlier Analysis

Anomaly detection, also known as Outlier analysis is a data mining step that detects data points, events, and/or findings that deviate from the regularities or normal behavior of a dataset. Anomalies are usually referred to as outliers, abnormalities, novelties, noise, inconsistency, irregularities, and exceptions [ 63 , 114 ]. Techniques of anomaly detection may discover new situations or cases as deviant based on historical data through analyzing the data patterns. For instance, identifying fraud or irregular transactions in finance is an example of anomaly detection.

It is often used in preprocessing tasks for the deletion of anomalous or inconsistency in the real-world data collected from various data sources including user logs, devices, networks, and servers. For anomaly detection, several machine learning techniques can be used, such as k-nearest neighbors, isolation forests, cluster analysis, etc [ 105 ]. The exclusion of anomalous data from the dataset also results in a statistically significant improvement in accuracy during supervised learning [ 101 ]. However, extracting appropriate features, identifying normal behaviors, managing imbalanced data distribution, addressing variations in abnormal behavior or irregularities, the sparse occurrence of abnormal events, environmental variations, etc. could be challenging in the process of anomaly detection. Detection of anomalies can be applicable in a variety of domains such as cybersecurity analytics, intrusion detections, fraud detection, fault detection, health analytics, identifying irregularities, detecting ecosystem disturbances, and many more. This anomaly detection can be considered a significant task for building effective systems with higher accuracy within the area of data science.

Factor Analysis

Factor analysis is a collection of techniques for describing the relationships or correlations between variables in terms of more fundamental entities known as factors [ 23 ]. It’s usually used to organize variables into a small number of clusters based on their common variance, where mathematical or statistical procedures are used. The goals of factor analysis are to determine the number of fundamental influences underlying a set of variables, calculate the degree to which each variable is associated with the factors, and learn more about the existence of the factors by examining which factors contribute to output on which variables. The broad purpose of factor analysis is to summarize data so that relationships and patterns can be easily interpreted and understood [ 143 ].

Exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are the two most popular factor analysis techniques. EFA seeks to discover complex trends by analyzing the dataset and testing predictions, while CFA tries to validate hypotheses and uses path analysis diagrams to represent variables and factors [ 143 ]. Factor analysis is one of the algorithms for unsupervised machine learning that is used for minimizing dimensionality. The most common methods for factor analytics are principal components analysis (PCA), principal axis factoring (PAF), and maximum likelihood (ML) [ 48 ]. Methods of correlation analysis such as Pearson correlation, canonical correlation, etc. may also be useful in the field as they can quantify the statistical relationship between two continuous variables, or association. Factor analysis is commonly used in finance, marketing, advertising, product management, psychology, and operations research, and thus can be considered as another significant analytical method within the area of data science.

Log Analysis

Logs are commonly used in system management as logs are often the only data available that record detailed system runtime activities or behaviors in production [ 44 ]. Log analysis is thus can be considered as the method of analyzing, interpreting, and capable of understanding computer-generated records or messages, also known as logs. This can be device log, server log, system log, network log, event log, audit trail, audit record, etc. The process of creating such records is called data logging.

Logs are generated by a wide variety of programmable technologies, including networking devices, operating systems, software, and more. Phone call logs [ 88 , 110 ], SMS Logs [ 28 ], mobile apps usages logs [ 124 , 149 ], notification logs [ 77 ], game Logs [ 82 ], context logs [ 16 , 149 ], web logs [ 37 ], smartphone life logs [ 95 ], etc. are some examples of log data for smartphone devices. The main characteristics of these log data is that it contains users’ actual behavioral activities with their devices. Similar other log data can be search logs [ 50 , 133 ], application logs [ 26 ], server logs [ 33 ], network logs [ 57 ], event logs [ 83 ], network and security logs [ 142 ] etc.

Several techniques such as classification and tagging, correlation analysis, pattern recognition methods, anomaly detection methods, machine learning modeling, etc. [ 105 ] can be used for effective log analysis. Log analysis can assist in compliance with security policies and industry regulations, as well as provide a better user experience by encouraging the troubleshooting of technical problems and identifying areas where efficiency can be improved. For instance, web servers use log files to record data about website visitors. Windows event log analysis can help an investigator draw a timeline based on the logging information and the discovered artifacts. Overall, advanced analytics methods by taking into account machine learning modeling can play a significant role to extract insightful patterns from these log data, which can be used for building automated and smart applications, and thus can be considered as a key working area in data science.

Neural Networks and Deep Learning Analysis

Deep learning is a form of machine learning that uses artificial neural networks to create a computational architecture that learns from data by combining multiple processing layers, such as the input, hidden, and output layers [ 38 ]. The key benefit of deep learning over conventional machine learning methods is that it performs better in a variety of situations, particularly when learning from large datasets [ 114 , 140 ].

The most common deep learning algorithms are: multi-layer perceptron (MLP) [ 85 ], convolutional neural network (CNN or ConvNet) [ 67 ], long short term memory recurrent neural network (LSTM-RNN) [ 34 ]. Figure 6 shows a structure of an artificial neural network modeling with multiple processing layers. The Backpropagation technique [ 38 ] is used to adjust the weight values internally while building the model. Convolutional neural networks (CNNs) [ 67 ] improve on the design of traditional artificial neural networks (ANNs), which include convolutional layers, pooling layers, and fully connected layers. It is commonly used in a variety of fields, including natural language processing, speech recognition, image processing, and other autocorrelated data since it takes advantage of the two-dimensional (2D) structure of the input data. AlexNet [ 60 ], Xception [ 21 ], Inception [ 125 ], Visual Geometry Group (VGG) [ 42 ], ResNet [ 43 ], etc., and other advanced deep learning models based on CNN are also used in the field.

In addition to CNN, recurrent neural network (RNN) architecture is another popular method used in deep learning. Long short-term memory (LSTM) is a popular type of recurrent neural network architecture used broadly in the area of deep learning. Unlike traditional feed-forward neural networks, LSTM has feedback connections. Thus, LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, sorting, and predicting data based on time-series data. Therefore, when the data is in a sequential format, such as time, sentence, etc., LSTM can be used, and it is widely used in the areas of time-series analysis, natural language processing, speech recognition, and so on.

figure 6

A structure of an artificial neural network modeling with multiple processing layers

In addition to the most popular deep learning methods mentioned above, several other deep learning approaches [ 104 ] exist in the field for various purposes. The self-organizing map (SOM) [ 58 ], for example, uses unsupervised learning to represent high-dimensional data as a 2D grid map, reducing dimensionality. Another learning technique that is commonly used for dimensionality reduction and feature extraction in unsupervised learning tasks is the autoencoder (AE) [ 10 ]. Restricted Boltzmann machines (RBM) can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling, according to [ 46 ]. A deep belief network (DBN) is usually made up of a backpropagation neural network and unsupervised networks like restricted Boltzmann machines (RBMs) or autoencoders (BPNN) [ 136 ]. A generative adversarial network (GAN) [ 35 ] is a deep learning network that can produce data with characteristics that are similar to the input data. Transfer learning is common worldwide presently because it can train deep neural networks with a small amount of data, which is usually the re-use of a pre-trained model on a new problem [ 137 ]. These deep learning methods can perform  well, particularly, when learning from large-scale datasets [ 105 , 140 ]. In our previous article Sarker et al. [ 104 ], we have summarized a brief discussion of various artificial neural networks (ANN) and deep learning (DL) models mentioned above, which can be used in a variety of data science and analytics tasks.

Real-World Application Domains

Almost every industry or organization is impacted by data, and thus “Data Science” including advanced analytics with machine learning modeling can be used in business, marketing, finance, IoT systems, cybersecurity, urban management, health care, government policies, and every possible industries, where data gets generated. In the following, we discuss ten most popular application areas based on data science and analytics.

Business or financial data science: In general, business data science can be considered as the study of business or e-commerce data to obtain insights about a business that can typically lead to smart decision-making as well as taking high-quality actions [ 90 ]. Data scientists can develop algorithms or data-driven models predicting customer behavior, identifying patterns and trends based on historical business data, which can help companies to reduce costs, improve service delivery, and generate recommendations for better decision-making. Eventually, business automation, intelligence, and efficiency can be achieved through the data science process discussed earlier, where various advanced analytics methods and machine learning modeling based on the collected data are the keys. Many online retailers, such as Amazon [ 76 ], can improve inventory management, avoid out-of-stock situations, and optimize logistics and warehousing using predictive modeling based on machine learning techniques [ 105 ]. In terms of finance, the historical data is related to financial institutions to make high-stakes business decisions, which is mostly used for risk management, fraud prevention, credit allocation, customer analytics, personalized services, algorithmic trading, etc. Overall, data science methodologies can play a key role in the future generation business or finance industry, particularly in terms of business automation, intelligence, and smart decision-making and systems.

Manufacturing or industrial data science: To compete in global production capability, quality, and cost, manufacturing industries have gone through many industrial revolutions [ 14 ]. The latest fourth industrial revolution, also known as Industry 4.0, is the emerging trend of automation and data exchange in manufacturing technology. Thus industrial data science, which is the study of industrial data to obtain insights that can typically lead to optimizing industrial applications, can play a vital role in such revolution. Manufacturing industries generate a large amount of data from various sources such as sensors, devices, networks, systems, and applications [ 6 , 68 ]. The main categories of industrial data include large-scale data devices, life-cycle production data, enterprise operation data, manufacturing value chain sources, and collaboration data from external sources [ 132 ]. The data needs to be processed, analyzed, and secured to help improve the system’s efficiency, safety, and scalability. Data science modeling thus can be used to maximize production, reduce costs and raise profits in manufacturing industries.

Medical or health data science: Healthcare is one of the most notable fields where data science is making major improvements. Health data science involves the extrapolation of actionable insights from sets of patient data, typically collected from electronic health records. To help organizations, improve the quality of treatment, lower the cost of care, and improve the patient experience, data can be obtained from several sources, e.g., the electronic health record, billing claims, cost estimates, and patient satisfaction surveys, etc., to analyze. In reality, healthcare analytics using machine learning modeling can minimize medical costs, predict infectious outbreaks, prevent preventable diseases, and generally improve the quality of life [ 81 , 119 ]. Across the global population, the average human lifespan is growing, presenting new challenges to today’s methods of delivery of care. Thus health data science modeling can play a role in analyzing current and historical data to predict trends, improve services, and even better monitor the spread of diseases. Eventually, it may lead to new approaches to improve patient care, clinical expertise, diagnosis, and management.

IoT data science: Internet of things (IoT) [ 9 ] is a revolutionary technical field that turns every electronic system into a smarter one and is therefore considered to be the big frontier that can enhance almost all activities in our lives. Machine learning has become a key technology for IoT applications because it uses expertise to identify patterns and generate models that help predict future behavior and events [ 112 ]. One of the IoT’s main fields of application is a smart city, which uses technology to improve city services and citizens’ living experiences. For example, using the relevant data, data science methods can be used for traffic prediction in smart cities, to estimate the total usage of energy of the citizens for a particular period. Deep learning-based models in data science can be built based on a large scale of IoT datasets [ 7 , 104 ]. Overall, data science and analytics approaches can aid modeling in a variety of IoT and smart city services, including smart governance, smart homes, education, connectivity, transportation, business, agriculture, health care, and industry, and many others.

Cybersecurity data science: Cybersecurity, or the practice of defending networks, systems, hardware, and data from digital attacks, is one of the most important fields of Industry 4.0 [ 114 , 121 ]. Data science techniques, particularly machine learning, have become a crucial cybersecurity technology that continually learns to identify trends by analyzing data, better detecting malware in encrypted traffic, finding insider threats, predicting where bad neighborhoods are online, keeping people safe while surfing, or protecting information in the cloud by uncovering suspicious user activity [ 114 ]. For instance, machine learning and deep learning-based security modeling can be used to effectively detect various types of cyberattacks or anomalies [ 103 , 106 ]. To generate security policy rules, association rule learning can play a significant role to build rule-based systems [ 102 ]. Deep learning-based security models can perform better when utilizing the large scale of security datasets [ 140 ]. Thus data science modeling can enable professionals in cybersecurity to be more proactive in preventing threats and reacting in real-time to active attacks, through extracting actionable insights from the security datasets.

Behavioral data science: Behavioral data is information produced as a result of activities, most commonly commercial behavior, performed on a variety of Internet-connected devices, such as a PC, tablet, or smartphones [ 112 ]. Websites, mobile applications, marketing automation systems, call centers, help desks, and billing systems, etc. are all common sources of behavioral data. Behavioral data is much more than just data, which is not static data [ 108 ]. Advanced analytics of these data including machine learning modeling can facilitate in several areas such as predicting future sales trends and product recommendations in e-commerce and retail; predicting usage trends, load, and user preferences in future releases in online gaming; determining how users use an application to predict future usage and preferences in application development; breaking users down into similar groups to gain a more focused understanding of their behavior in cohort analysis; detecting compromised credentials and insider threats by locating anomalous behavior, or making suggestions, etc. Overall, behavioral data science modeling typically enables to make the right offers to the right consumers at the right time on various common platforms such as e-commerce platforms, online games, web and mobile applications, and IoT. In social context, analyzing the behavioral data of human being using advanced analytics methods and the extracted insights from social data can be used for data-driven intelligent social services, which can be considered as social data science.

Mobile data science: Today’s smart mobile phones are considered as “next-generation, multi-functional cell phones that facilitate data processing, as well as enhanced wireless connectivity” [ 146 ]. In our earlier paper [ 112 ], we have shown that users’ interest in “Mobile Phones” is more and more than other platforms like “Desktop Computer”, “Laptop Computer” or “Tablet Computer” in recent years. People use smartphones for a variety of activities, including e-mailing, instant messaging, online shopping, Internet surfing, entertainment, social media such as Facebook, Linkedin, and Twitter, and various IoT services such as smart cities, health, and transportation services, and many others. Intelligent apps are based on the extracted insight from the relevant datasets depending on apps characteristics, such as action-oriented, adaptive in nature, suggestive and decision-oriented, data-driven, context-awareness, and cross-platform operation [ 112 ]. As a result, mobile data science, which involves gathering a large amount of mobile data from various sources and analyzing it using machine learning techniques to discover useful insights or data-driven trends, can play an important role in the development of intelligent smartphone applications.

Multimedia data science: Over the last few years, a big data revolution in multimedia management systems has resulted from the rapid and widespread use of multimedia data, such as image, audio, video, and text, as well as the ease of access and availability of multimedia sources. Currently, multimedia sharing websites, such as Yahoo Flickr, iCloud, and YouTube, and social networks such as Facebook, Instagram, and Twitter, are considered as valuable sources of multimedia big data [ 89 ]. People, particularly younger generations, spend a lot of time on the Internet and social networks to connect with others, exchange information, and create multimedia data, thanks to the advent of new technology and the advanced capabilities of smartphones and tablets. Multimedia analytics deals with the problem of effectively and efficiently manipulating, handling, mining, interpreting, and visualizing various forms of data to solve real-world problems. Text analysis, image or video processing, computer vision, audio or speech processing, and database management are among the solutions available for a range of applications including healthcare, education, entertainment, and mobile devices.

Smart cities or urban data science: Today, more than half of the world’s population live in urban areas or cities [ 80 ] and considered as drivers or hubs of economic growth, wealth creation, well-being, and social activity [ 96 , 116 ]. In addition to cities, “Urban area” can refer to the surrounding areas such as towns, conurbations, or suburbs. Thus, a large amount of data documenting daily events, perceptions, thoughts, and emotions of citizens or people are recorded, that are loosely categorized into personal data, e.g., household, education, employment, health, immigration, crime, etc., proprietary data, e.g., banking, retail, online platforms data, etc., government data, e.g., citywide crime statistics, or government institutions, etc., Open and public data, e.g., data.gov, ordnance survey, and organic and crowdsourced data, e.g., user-generated web data, social media, Wikipedia, etc. [ 29 ]. The field of urban data science typically focuses on providing more effective solutions from a data-driven perspective, through extracting knowledge and actionable insights from such urban data. Advanced analytics of these data using machine learning techniques [ 105 ] can facilitate the efficient management of urban areas including real-time management, e.g., traffic flow management, evidence-based planning decisions which pertain to the longer-term strategic role of forecasting for urban planning, e.g., crime prevention, public safety, and security, or framing the future, e.g., political decision-making [ 29 ]. Overall, it can contribute to government and public planning, as well as relevant sectors including retail, financial services, mobility, health, policing, and utilities within a data-rich urban environment through data-driven smart decision-making and policies, which lead to smart cities and improve the quality of human life.

Smart villages or rural data science: Rural areas or countryside are the opposite of urban areas, that include villages, hamlets, or agricultural areas. The field of rural data science typically focuses on making better decisions and providing more effective solutions that include protecting public safety, providing critical health services, agriculture, and fostering economic development from a data-driven perspective, through extracting knowledge and actionable insights from the collected rural data. Advanced analytics of rural data including machine learning [ 105 ] modeling can facilitate providing new opportunities for them to build insights and capacity to meet current needs and prepare for their futures. For instance, machine learning modeling [ 105 ] can help farmers to enhance their decisions to adopt sustainable agriculture utilizing the increasing amount of data captured by emerging technologies, e.g., the internet of things (IoT), mobile technologies and devices, etc. [ 1 , 51 , 52 ]. Thus, rural data science can play a very important role in the economic and social development of rural areas, through agriculture, business, self-employment, construction, banking, healthcare, governance, or other services, etc. that lead to smarter villages.

Overall, we can conclude that data science modeling can be used to help drive changes and improvements in almost every sector in our real-world life, where the relevant data is available to analyze. To gather the right data and extract useful knowledge or actionable insights from the data for making smart decisions is the key to data science modeling in any application domain. Based on our discussion on the above ten potential real-world application domains by taking into account data-driven smart computing and decision making, we can say that the prospects of data science and the role of data scientists are huge for the future world. The “Data Scientists” typically analyze information from multiple sources to better understand the data and business problems, and develop machine learning-based analytical modeling or algorithms, or data-driven tools, or solutions, focused on advanced analytics, which can make today’s computing process smarter, automated, and intelligent.

Challenges and Research Directions

Our study on data science and analytics, particularly data science modeling in “ Understanding data science modeling ”, advanced analytics methods and smart computing in “ Advanced analytics methods and smart computing ”, and real-world application areas in “ Real-world application domains ” open several research issues in the area of data-driven business solutions and eventual data products. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions to build data-driven products.

Understanding the real-world business problems and associated data including nature, e.g., what forms, type, size, labels, etc., is the first challenge in the data science modeling, discussed briefly in “ Understanding data science modeling ”. This is actually to identify, specify, represent and quantify the domain-specific business problems and data according to the requirements. For a data-driven effective business solution, there must be a well-defined workflow before beginning the actual data analysis work. Furthermore, gathering business data is difficult because data sources can be numerous and dynamic. As a result, gathering different forms of real-world data, such as structured, or unstructured, related to a specific business issue with legal access, which varies from application to application, is challenging. Moreover, data annotation, which is typically the process of categorization, tagging, or labeling of raw data, for the purpose of building data-driven models, is another challenging issue. Thus, the primary task is to conduct a more in-depth analysis of data collection and dynamic annotation methods. Therefore, understanding the business problem, as well as integrating and managing the raw data gathered for efficient data analysis, may be one of the most challenging aspects of working in the field of data science and analytics.

The next challenge is the extraction of the relevant and accurate information from the collected data mentioned above. The main focus of data scientists is typically to disclose, describe, represent, and capture data-driven intelligence for actionable insights from data. However, the real-world data may contain many ambiguous values, missing values, outliers, and meaningless data [ 101 ]. The advanced analytics methods including machine and deep learning modeling, discussed in “ Advanced analytics methods and smart computing ”, highly impact the quality, and availability of the data. Thus understanding real-world business scenario and associated data, to whether, how, and why they are insufficient, missing, or problematic, then extend or redevelop the existing methods, such as large-scale hypothesis testing, learning inconsistency, and uncertainty, etc. to address the complexities in data and business problems is important. Therefore, developing new techniques to effectively pre-process the diverse data collected from multiple sources, according to their nature and characteristics could be another challenging task.

Understanding and selecting the appropriate analytical methods to extract the useful insights for smart decision-making for a particular business problem is the main issue in the area of data science. The emphasis of advanced analytics is more on anticipating the use of data to detect patterns to determine what is likely to occur in the future. Basic analytics offer a description of data in general, while advanced analytics is a step forward in offering a deeper understanding of data and helping to granular data analysis. Thus, understanding the advanced analytics methods, especially machine and deep learning-based modeling is the key. The traditional learning techniques mentioned in “ Advanced analytics methods and smart computing ” may not be directly applicable for the expected outcome in many cases. For instance, in a rule-based system, the traditional association rule learning technique [ 4 ] may  produce redundant rules from the data that makes the decision-making process complex and ineffective [ 113 ]. Thus, a scientific understanding of the learning algorithms, mathematical properties, how the techniques are robust or fragile to input data, is needed to understand. Therefore, a deeper understanding of the strengths and drawbacks of the existing machine and deep learning methods [ 38 , 105 ] to solve a particular business problem is needed, consequently to improve or optimize the learning algorithms according to the data characteristics, or to propose the new algorithm/techniques with higher accuracy becomes a significant challenging issue for the future generation data scientists.

The traditional data-driven models or systems typically use a large amount of business data to generate data-driven decisions. In several application fields, however, the new trends are more likely to be interesting and useful for modeling and predicting the future than older ones. For example, smartphone user behavior modeling, IoT services, stock market forecasting, health or transport service, job market analysis, and other related areas where time-series and actual human interests or preferences are involved over time. Thus, rather than considering the traditional data analysis, the concept of RecencyMiner, i.e., recent pattern-based extracted insight or knowledge proposed in our earlier paper Sarker et al. [ 108 ] might be effective. Therefore, to propose the new techniques by taking into account the recent data patterns, and consequently to build a recency-based data-driven model for solving real-world problems, is another significant challenging issue in the area.

The most crucial task for a data-driven smart system is to create a framework that supports data science modeling discussed in “ Understanding data science modeling ”. As a result, advanced analytical methods based on machine learning or deep learning techniques can be considered in such a system to make the framework capable of resolving the issues. Besides, incorporating contextual information such as temporal context, spatial context, social context, environmental context, etc. [ 100 ] can be used for building an adaptive, context-aware, and dynamic model or framework, depending on the problem domain. As a result, a well-designed data-driven framework, as well as experimental evaluation, is a very important direction to effectively solve a business problem in a particular domain, as well as a big challenge for the data scientists.

In several important application areas such as autonomous cars, criminal justice, health care, recruitment, housing, management of the human resource, public safety, where decisions made by models, or AI agents, have a direct effect on human lives. As a result, there is growing concerned about whether these decisions can be trusted, to be right, reasonable, ethical, personalized, accurate, robust, and secure, particularly in the context of adversarial attacks [ 104 ]. If we can explain the result in a meaningful way, then the model can be better trusted by the end-user. For machine-learned models, new trust properties yield new trade-offs, such as privacy versus accuracy; robustness versus efficiency; fairness versus robustness. Therefore, incorporating trustworthy AI particularly, data-driven or machine learning modeling could be another challenging issue in the area.

In the above, we have summarized and discussed several challenges and the potential research opportunities and directions, within the scope of our study in the area of data science and advanced analytics. The data scientists in academia/industry and the researchers in the relevant area have the opportunity to contribute to each issue identified above and build effective data-driven models or systems, to make smart decisions in the corresponding business domains.

In this paper, we have presented a comprehensive view on data science including various types of advanced analytical methods that can be applied to enhance the intelligence and the capabilities of an application. We have also visualized the current popularity of data science and machine learning-based advanced analytical modeling and also differentiate these from the relevant terms used in the area, to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according to our goal, we have briefly discussed how different data modules can play a significant role in a data-driven business solution through the data science process. For this, we have also summarized various types of advanced analytical methods and outcomes as well as machine learning modeling that are needed to solve the associated business problems. Thus, this study’s key contribution has been identified as the explanation of different advanced analytical methods and their applicability in various real-world data-driven applications areas including business, healthcare, cybersecurity, urban and rural data science, and so on by taking into account data-driven smart computing and decision making.

Finally, within the scope of our study, we have outlined and discussed the challenges we faced, as well as possible research opportunities and future directions. As a result, the challenges identified provide promising research opportunities in the field that can be explored with effective solutions to improve the data-driven model and systems. Overall, we conclude that our study of advanced analytical solutions based on data science and machine learning methods, leads in a positive direction and can be used as a reference guide for future research and applications in the field of data science and its real-world applications by both academia and industry professionals.

Adnan N, Nordin SM, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data. 1998. p. 94–105.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD record, vol 22. ACM. 1993. p. 207–16.

Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the international joint conference on very large data bases, Santiago, Chile, vol 1215. 1994. p. 487–99.

Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Al-Abassi A, Karimipour H, HaddadPajouh H, Dehghantanha A, Parizi RM. Industrial big data analytics: challenges and opportunities. In: Handbook of big data privacy. Springer; 2020. p. 37–61.

Al-Garadi MA, Mohamed A, Al-Ali AK, Du X, Ali I, Guizani M. A survey of machine and deep learning methods for internet of things (iot) security. IEEE Commun Surv Tutor. 2020;22(3):1646–85.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Rec. 1999;28(2):49–60.

Atzori L, Iera A, Morabito G. The internet of things: a survey. Comput Netw. 2010;54(15):2787–805.

Article   MATH   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning. 2012. p. 37–49.

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Box GEP, Jenkins GM, Reinsel GC, Ljung GM. Time series analysis: forecasting and control. New York: Wiley; 2015.

MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Brettel M, Friederichsen N, Keller M, Rosenberg M. How virtualization, decentralization and network building change the manufacturing landscape: an industry 4.0 perspective. FormaMente 2017;12.

Canadian institute of cybersecurity. University of new Brunswick, iscx dataset. http://www.unb.ca/cic/datasets/index.html/ . Accessed 20 Oct 2019.

Cao H, Bao T, Yang Q, Chen E, Tian J. An effective approach for mining mobile user habits. In: Proceedings of the international conference on information and knowledge management, Toronto, ON, Canada, 26–30 October. New York: ACM; 2010. p. 1677–80.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):1–42.

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Cervone HF. Informatics and data science: an overview for the information professional. Digital Library Perspectives. 2016.

Chessel A. An overview of data science uses in bioimage informatics. Methods. 2017;115:110–8.

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 1251–58.

Cic-ddos2019 [online]. https://www.unb.ca/cic/datasets/ddos-2019.html/ . Accessed 28 Mar 2020.

Cudeck R. Exploratory factor analysis. In: Handbook of applied multivariate statistics and mathematical modeling. Elsevier. p. 265–96. 2000.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management. ACM; 2001. p. 474–481.

de Amorim V. Constrained clustering with Minkowski weighted k-means. In: 2012 IEEE 13th international symposium on computational intelligence and informatics (CINTI). IEEE. 2012. p. 13–17.

Dev H, Liu Z. Identifying frequent user tasks from application logs. In: Proceedings of the 22nd international conference on intelligent user interfaces. 2017. p. 263–73.

Donoho D. 50 years of data science. J Comput Graph Stat. 2017;26(4):745–66.

Article   MathSciNet   Google Scholar  

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Pers Ubiquitous Comput. 2006;10(4):255–68.

Engin Z, van Dijk J, Lan T, Longley PA, Treleaven P, Batty M, Penn A. Data-driven urban management: mapping the landscape. J Urban Manag. 2020;9(2):140–50.

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Google Scholar  

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, vol 96. Citeseer; 1996. p. 148–156.

Ghavare P, Ahire P. Big data classification of users navigation and behavior using web server logs. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA). IEEE. 2018. p. 1–6.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014. p. 2672–80.

Google trends. 2019. https://trends.google.com/trends/ .

Halvey M, Keane MT, Smyth B. Time based segmentation of log data for user navigation prediction in personalization. In: Proceedings of the international conference on web intelligence, Compiegne, France, 19–22 September. Washington, DC: IEEE Computer Society; 2005. p. 636–40.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, vol 29. ACM; 2000. p. 1–12.

Hansun S. A new approach of moving average method in time series analysis. In: 2013 conference on new media studies (CoNMedia). IEEE; 2013. p. 1–4.

Harmon SA, Sanford TH, Xu S, Turkbey EB, Holger R, Ziyue X, Dong Y, Andriy M, Victoria A, Amel A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 770–78.

He P, Zhu J, He S, Li J, Lyu MR. Towards automated log parsing for large-scale log data analysis. IEEE Trans Dependable Secure Comput. 2017;15(6):931–44.

Hemmatian F, Sohrabi MK. A survey on classification techniques for opinion mining and sentiment analysis. In: Artificial intelligence review. 2019. p. 1–51.

Hinton GE. A practical guide to training restricted Boltzmann machines. In: Neural networks: tricks of the trade. Springer; 2012. p. 599–619.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Proceedings of the eleventh international conference on data engineering. IEEE; 1995. p. 25–33.

Howard MC. A review of exploratory factor analysis decisions and overview of current practices: what we are doing and how can we improve? Int J Hum Comput Interact. 2016;32(1):51–62.

John GH, Langley P. Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the eleventh conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.; 1995. p. 338–45.

Kacprzak E, Koesten L, Ibá nez L-D, Blount T, Tennison J, Simperl E. Characterising dataset search-an analysis of search logs and data requests. J Web Semant. 2019;55:37–55.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Prot. 2018;117:408–425.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Karpatne A, Atluri G, Faghmous JH, Steinbach M, Banerjee A, Ganguly A, Shekhar S, Samatova N, Kumar V. Theory-guided data science: a new paradigm for scientific discovery from data. IEEE Trans Knowl Data Eng. 2017;29(10):2318–31.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. New York: Wiley; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to Platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA). IEEE; 2018. p. 1–6.

Kimura T, Watanabe A, Toyono T, Ishibashi K. Proactive failure detection learning generation patterns of large-scale network logs. IEICE Trans Commun. 2018.

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Future Gener Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. 2012. p. 1097–1105.

Krukovets D, et al. Data science opportunities at central banks: overview. Visnyk Natl Bank Ukr. 2020;249:13–24.

Kulin M, Fortuna C, De Poorter E, Deschrijver D, Moerman I. Data-driven design of intelligent wireless networks: an overview and tutorial. Sensors. 2016;16(6):790.

Kwon D, Kim H, Kim J, Suh SC, Kim I, Kim KJ. A survey of deep learning-based network anomaly detection. Cluster Comput. 2019;22(1):949–61.

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Larson D, Chang V. A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manag. 2016;36(5):700–10.

Le Cessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Applied Statistics). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Lee J, Bagheri B, Kao H-A. Recent advances and trends of cyber-physical systems and big data analytics in industrial informatics. In: International proceeding of int conference on industrial informatics (INDIN). 2014. p. 1–6.

Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

Li Z, Fan Y, Jiang B, Lei T, Liu W. A survey on sentiment analysis and opinion mining for social multimedia. Multimed Tools Appl. 2019;78(6):6939–67.

Liu B. Sentiment analysis: mining opinions, sentiments, and emotions. Cambridge: Cambridge University Press; 2020.

Book   Google Scholar  

Liu J, Tang T, Wang W, Bo X, Kong X, Xia F. A survey of scholarly data visualization. IEEE Access. 2018;6:19205–21.

Ma B, Liu W, Hsu Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining. 1998.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, vol 1. 1967. p. 281–297.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the international joint conference on pervasive and ubiquitous computing, Heidelberg, 12–16 September, ACM, New York. 2016. p. 1223–1234.

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS). IEEE. 2015. p. 1–6.

Nations U. Revision of world urbanization prospects. New York: United Nations; 2018.

Nilashi M, Ibrahim O, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Paireekreng W, Rapeepisarn K, Wong KW. Time-based personalised mobile game downloading. In: Transactions on edutainment II. 2009. p. 59–69.

Pan Y, Zhang L, Li Z. Mining event logs for knowledge discovery based on adaptive efficient fuzzy Kohonen clustering network. Knowl Based Syst. 2020:209.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Peyré G, Cuturi M, et al. Computational optimal transport: with applications to data science. Found Trends Mach Learn. 2019;11(5–6):355–607.

Phithakkitnukoon S, Dantu R, Claxton R, Eagle N. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Pouyanfar S, Yang Y, Chen S-C, Shyu M-L, Iyengar SS. Multimedia big data analytics: a survey. ACM Comput Surv (CSUR). 2018;51(1):1–34.

Provost F, Fawcett T. Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc.; 2013.

Qin X, Luo Y, Tang N, Li G. Making data visualization more efficient and effective: a survey. VLDB J. 2020;29(1):93–117.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1(1):81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite Gaussian mixture model. Adv Neural Inf Process Syst. 1999;12:554–60.

Rawassizadeh R, Tomitsch M, Wac K, Tjoa AM. Ubiqlog: a generic mobile phone-based life-log framework. Pers Ubiquitous Comput. 2013;17(4):621–37.

Resch B, Szell M. Human-centric data science for urban studies. 2019.

Rizk A, Elragal A. Data science: developing theoretical contributions in information systems via text analytics. J Big Data. 2020;7(1):1–26.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook. Springer; 2010. p. 269–298.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Cyberlearning: effectiveness analysis of machine learning security modeling to detect cyber-anomalies and multi-attacks. Internet Things. 2021:100393.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci. 2021;2(3):1–21.

Sarker IH, Abushark YB, Alsolami F, Khan AI. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Alqahtani H, Alsolami F, Khan AI, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Colman A, Han J. Recencyminer: mining recency-based personalized behavior from contextual smartphone data. J Big Data. 2019;6(1):1–21.

Sarker IH, Colman A, Han J, Khan AI, Abushark YB, Salah K. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2020;25(3):1151–61.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM international joint conference on pervasive and ubiquitous computing (Ubicomp): adjunct, Germany. ACM. 2016. p. 630–634.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J. 2018;61(3):349–68.

Sarker IH, Hoque MM, Uddin MK, Alsanoosy T. Mobile data science and intelligent apps: Concepts, ai-based modeling and research directions. Mob Netw Appl. 2020:1–19.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020:102762.

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Kayes ASM, Watters P. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Schläpfer M, Bettencourt LMA, Grauwin S, Raschke M, Claxton R, Smoreda Z, West GB, Ratti C. The scaling of human interactions with city size. J R Soc Interface. 2014;11(98):20130789.

Shukla N, Fricklas K. Machine learning with TensorFlow. Greenwich: Manning; 2018.

Siami-Namini S, Tavakoli N, Namin AS. A comparison of arima and lstm in forecasting time series. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE. 2018. p. 1394–1401.

Silahtaroğlu G, Yılmaztürk N. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10.

Silvestrini A, Veredas D. Temporal aggregation of univariate and multivariate time series models: a survey. J Econ Surv. 2008;22(3):458–97.

Ślusarczyk B. Industry 4.0: are we ready? Pol J Manag Stud. 2018:17.

Sneath PHA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol. Skr. 1948:5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the international joint conference on pervasive and ubiquitous computing, Seattle, WA, USA, 13–17 September. New York: ACM; 2014. p. 389–400

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 1–9.

Tajbakhsh A, Rahmati M, Mirzaei A. Intrusion detection using fuzzy association rules. Appl Soft Comput. 2009;9(2):462–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In: 2009 IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009. p. 1–6.

Tsagkias M, Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR forum, vol 54. New York: ACM; 2021. p. 1–23.

Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV. Big data analytics: a survey. J Big Data. 2015;2(1):1–32.

Tuncel KS, Baydogan MG. Autoregressive forests for multivariate time series modeling. Pattern Recognit. 2018;73:202–15.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. ICML. 2001;1:577–84.

Wang J, Zhang W, Shi Y, Duan S, Liu J. Industrial big data analytics: challenges, methodologies, and applications. 2018. arXiv:1807.01016 .

Wang L, Zhang J, Chen G, Qiao D. Identifying comparable entities with indirectly associative relations and word embeddings from web search logs. Decis Support Syst. 2021:141.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Waskom ML. Seaborn: statistical data visualization. J Open Source Softw. 2021;6(60):3021.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big Data. 2016;3(1):9.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Ya J, Liu T, Li Q, Shi J, Zhang H, Lv P, Guo L. Mining host behavior patterns from massive network and security logs. Proc Comput Sci. 2017;108:38–47.

Yong AG, Pearce S, et al. A beginner’s guide to factor analysis: Focusing on exploratory factor analysis. Tutor Quant Methods Psychol. 2013;9(2):79–94.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng P, Ni LM. Spotlight: the rise of the smart phone. IEEE Distrib Syst Online. 2006;7(3):3.

Zheng T, Xie W, Liling X, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zhou Z-J, Hu G-Y, Hu C-H, Wen C-L, Chang L-L. A survey of belief rule-base expert system. IEEE Trans Syst Man Cybern Syst. 2019.

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 international conference on internet of things and intelligent applications (ITIA). IEEE. 2020. p. 1–7.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, Chittagong, 4349, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Data Science and Analytics: An Overview from Data-Driven Smart Computing, Decision-Making and Applications Perspective. SN COMPUT. SCI. 2 , 377 (2021). https://doi.org/10.1007/s42979-021-00765-8

Download citation

Received : 09 August 2019

Accepted : 02 July 2021

Published : 12 July 2021

DOI : https://doi.org/10.1007/s42979-021-00765-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data science
  • Advanced analytics
  • Machine learning
  • Deep learning
  • Smart computing
  • Decision-making
  • Predictive analytics
  • Data science applications
  • Find a journal
  • Publish with us
  • Track your research

data science research paper

Data Science Journal

Press logo

A Deep Dissertion of Data Science: Related Issues and its Applications

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Published: 06 March 2024

Artificial intelligence and illusions of understanding in scientific research

  • Lisa Messeri   ORCID: orcid.org/0000-0002-0964-123X 1   na1 &
  • M. J. Crockett   ORCID: orcid.org/0000-0001-8800-410X 2 , 3   na1  

Nature volume  627 ,  pages 49–58 ( 2024 ) Cite this article

18k Accesses

3 Citations

697 Altmetric

Metrics details

  • Human behaviour
  • Interdisciplinary studies
  • Research management
  • Social anthropology

Scientists are enthusiastically imagining ways in which artificial intelligence (AI) tools might improve research. Why are AI tools so attractive and what are the risks of implementing them across the research pipeline? Here we develop a taxonomy of scientists’ visions for AI, observing that their appeal comes from promises to improve productivity and objectivity by overcoming human shortcomings. But proposed AI solutions can also exploit our cognitive limitations, making us vulnerable to illusions of understanding in which we believe we understand more about the world than we actually do. Such illusions obscure the scientific community’s ability to see the formation of scientific monocultures, in which some types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to errors. The proliferation of AI tools in science risks introducing a phase of scientific enquiry in which we produce more but understand less. By analysing the appeal of these tools, we provide a framework for advancing discussions of responsible knowledge production in the age of AI.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 51 print issues and online access

185,98 € per year

only 3,65 € per issue

Rent or buy this article

Prices vary by article type

Prices may be subject to local taxes which are calculated during checkout

data science research paper

Similar content being viewed by others

data science research paper

Nobel Turing Challenge: creating the engine for scientific discovery

Hiroaki Kitano

data science research paper

Accelerating science with human-aware artificial intelligence

Jamshid Sourati & James A. Evans

data science research paper

On scientific understanding with artificial intelligence

Mario Krenn, Robert Pollice, … Alán Aspuru-Guzik

Crabtree, G. Self-driving laboratories coming of age. Joule 4 , 2538–2541 (2020).

Article   CAS   Google Scholar  

Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620 , 47–60 (2023). This review explores how AI can be incorporated across the research pipeline, drawing from a wide range of scientific disciplines .

Article   CAS   PubMed   ADS   Google Scholar  

Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends Cogn. Sci. 27 , 597–600 (2023).

Article   PubMed   Google Scholar  

Grossmann, I. et al. AI and the transformation of social science research. Science 380 , 1108–1109 (2023). This forward-looking article proposes a variety of ways to incorporate generative AI into social-sciences research .

Gil, Y. Will AI write scientific papers in the future? AI Mag. 42 , 3–15 (2022).

Google Scholar  

Kitano, H. Nobel Turing Challenge: creating the engine for scientific discovery. npj Syst. Biol. Appl. 7 , 29 (2021).

Article   PubMed   PubMed Central   Google Scholar  

Benjamin, R. Race After Technology: Abolitionist Tools for the New Jim Code (Oxford Univ. Press, 2020). This book examines how social norms about race become embedded in technologies, even those that are focused on providing good societal outcomes .

Broussard, M. More Than a Glitch: Confronting Race, Gender, and Ability Bias in Tech (MIT Press, 2023).

Noble, S. U. Algorithms of Oppression: How Search Engines Reinforce Racism (New York Univ. Press, 2018).

Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: can language models be too big? in Proc. 2021 ACM Conference on Fairness, Accountability, and Transparency 610–623 (Association for Computing Machinery, 2021). One of the first comprehensive critiques of large language models, this article draws attention to a host of issues that ought to be considered before taking up such tools .

Crawford, K. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence (Yale Univ. Press, 2021).

Johnson, D. G. & Verdicchio, M. Reframing AI discourse. Minds Mach. 27 , 575–590 (2017).

Article   Google Scholar  

Atanasoski, N. & Vora, K. Surrogate Humanity: Race, Robots, and the Politics of Technological Futures (Duke Univ. Press, 2019).

Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl Acad. Sci. USA 120 , e2215907120 (2023).

Kidd, C. & Birhane, A. How AI can distort human beliefs. Science 380 , 1222–1223 (2023).

Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev. Phys. 5 , 277–280 (2023).

Kapoor, S. & Narayanan, A. Leakage and the reproducibility crisis in machine-learning-based science. Patterns 4 , 100804 (2023).

Hullman, J., Kapoor, S., Nanayakkara, P., Gelman, A. & Narayanan, A. The worst of both worlds: a comparative analysis of errors in learning from data in psychology and machine learning. In Proc. 2022 AAAI/ACM Conference on AI, Ethics, and Society (eds Conitzer, V. et al.) 335–348 (Association for Computing Machinery, 2022).

Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1 , 206–215 (2019). This paper articulates the problems with attempting to explain AI systems that lack interpretability, and advocates for building interpretable models instead .

Crockett, M. J., Bai, X., Kapoor, S., Messeri, L. & Narayanan, A. The limitations of machine learning models for predicting scientific replicability. Proc. Natl Acad. Sci. USA 120 , e2307596120 (2023).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Lazar, S. & Nelson, A. AI safety on whose terms? Science 381 , 138 (2023).

Article   PubMed   ADS   Google Scholar  

Collingridge, D. The Social Control of Technology (St Martin’s Press, 1980).

Wagner, G., Lukyanenko, R. & Paré, G. Artificial intelligence and the conduct of literature reviews. J. Inf. Technol. 37 , 209–226 (2022).

Hutson, M. Artificial-intelligence tools aim to tame the coronavirus literature. Nature https://doi.org/10.1038/d41586-020-01733-7 (2020).

Haas, Q. et al. Utilizing artificial intelligence to manage COVID-19 scientific evidence torrent with Risklick AI: a critical tool for pharmacology and therapy development. Pharmacology 106 , 244–253 (2021).

Article   CAS   PubMed   Google Scholar  

Müller, H., Pachnanda, S., Pahl, F. & Rosenqvist, C. The application of artificial intelligence on different types of literature reviews – a comparative study. In 2022 International Conference on Applied Artificial Intelligence (ICAPAI) https://doi.org/10.1109/ICAPAI55158.2022.9801564 (Institute of Electrical and Electronics Engineers, 2022).

van Dinter, R., Tekinerdogan, B. & Catal, C. Automation of systematic literature reviews: a systematic literature review. Inf. Softw. Technol. 136 , 106589 (2021).

Aydın, Ö. & Karaarslan, E. OpenAI ChatGPT generated literature review: digital twin in healthcare. In Emerging Computer Technologies 2 (ed. Aydın, Ö.) 22–31 (İzmir Akademi Dernegi, 2022).

AlQuraishi, M. AlphaFold at CASP13. Bioinformatics 35 , 4862–4865 (2019).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Article   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Lee, J. S., Kim, J. & Kim, P. M. Score-based generative modeling for de novo protein design. Nat. Computat. Sci. 3 , 382–392 (2023).

Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15 , 1120–1127 (2016).

Krenn, M. et al. On scientific understanding with artificial intelligence. Nat. Rev. Phys. 4 , 761–769 (2022).

Extance, A. How AI technology can tame the scientific literature. Nature 561 , 273–274 (2018).

Hastings, J. AI for Scientific Discovery (CRC Press, 2023). This book reviews current and future incorporation of AI into the scientific research pipeline .

Ahmed, A. et al. The future of academic publishing. Nat. Hum. Behav. 7 , 1021–1026 (2023).

Gray, K., Yam, K. C., Zhen’An, A. E., Wilbanks, D. & Waytz, A. The psychology of robots and artificial intelligence. In The Handbook of Social Psychology (eds Gilbert, D. et al.) (in the press).

Argyle, L. P. et al. Out of one, many: using language models to simulate human samples. Polit. Anal. 31 , 337–351 (2023).

Aher, G., Arriaga, R. I. & Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 337–371 (JMLR.org, 2023).

Binz, M. & Schulz, E. Using cognitive psychology to understand GPT-3. Proc. Natl Acad. Sci. USA 120 , e2218523120 (2023).

Ornstein, J. T., Blasingame, E. N. & Truscott, J. S. How to train your stochastic parrot: large language models for political texts. Github , https://joeornstein.github.io/publications/ornstein-blasingame-truscott.pdf (2023).

He, S. et al. Learning to predict the cosmological structure formation. Proc. Natl Acad. Sci. USA 116 , 13825–13832 (2019).

Article   MathSciNet   CAS   PubMed   PubMed Central   ADS   Google Scholar  

Mahmood, F. et al. Deep adversarial training for multi-organ nuclei segmentation in histopathology images. IEEE Trans. Med. Imaging 39 , 3257–3267 (2020).

Teixeira, B. et al. Generating synthetic X-ray images of a person from the surface geometry. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 9059–9067 (Institute of Electrical and Electronics Engineers, 2018).

Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11 , 166 (2020).

Watts, D. J. A twenty-first century science. Nature 445 , 489 (2007).

boyd, d. & Crawford, K. Critical questions for big data. Inf. Commun. Soc. 15 , 662–679 (2012). This article assesses the ethical and epistemic implications of scientific and societal moves towards big data and provides a parallel case study for thinking about the risks of artificial intelligence .

Jolly, E. & Chang, L. J. The Flatland fallacy: moving beyond low–dimensional thinking. Top. Cogn. Sci. 11 , 433–454 (2019).

Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: lessons from machine learning. Perspect. Psychol. Sci. 12 , 1100–1122 (2017).

Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10 , 221–227 (2013).

Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40 , 932–937 (2022).

Barkas, N. et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods 16 , 695–698 (2019).

Demszky, D. et al. Using large language models in psychology. Nat. Rev. Psychol. 2 , 688–701 (2023).

Karjus, A. Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence. Preprint at https://arxiv.org/abs/2309.14379 (2023).

Davies, A. et al. Advancing mathematics by guiding human intuition with AI. Nature 600 , 70–74 (2021).

Peterson, J. C., Bourgin, D. D., Agrawal, M., Reichman, D. & Griffiths, T. L. Using large-scale experiments and machine learning to discover theories of human decision-making. Science 372 , 1209–1214 (2021).

Ilyas, A. et al. Adversarial examples are not bugs, they are features. Preprint at https://doi.org/10.48550/arXiv.1905.02175 (2019)

Semel, B. M. Listening like a computer: attentional tensions and mechanized care in psychiatric digital phenotyping. Sci. Technol. Hum. Values 47 , 266–290 (2022).

Gil, Y. Thoughtful artificial intelligence: forging a new partnership for data science and scientific discovery. Data Sci. 1 , 119–129 (2017).

Checco, A., Bracciale, L., Loreti, P., Pinfield, S. & Bianchi, G. AI-assisted peer review. Humanit. Soc. Sci. Commun. 8 , 25 (2021).

Thelwall, M. Can the quality of published academic journal articles be assessed with machine learning? Quant. Sci. Stud. 3 , 208–226 (2022).

Dhar, P. Peer review of scholarly research gets an AI boost. IEEE Spectrum spectrum.ieee.org/peer-review-of-scholarly-research-gets-an-ai-boost (2020).

Heaven, D. AI peer reviewers unleashed to ease publishing grind. Nature 563 , 609–610 (2018).

Conroy, G. How ChatGPT and other AI tools could disrupt scientific publishing. Nature 622 , 234–236 (2023).

Nosek, B. A. et al. Replicability, robustness, and reproducibility in psychological science. Annu. Rev. Psychol. 73 , 719–748 (2022).

Altmejd, A. et al. Predicting the replicability of social science lab experiments. PLoS ONE 14 , e0225826 (2019).

Yang, Y., Youyou, W. & Uzzi, B. Estimating the deep replicability of scientific findings using human and artificial intelligence. Proc. Natl Acad. Sci. USA 117 , 10762–10768 (2020).

Youyou, W., Yang, Y. & Uzzi, B. A discipline-wide investigation of the replicability of psychology papers over the past two decades. Proc. Natl Acad. Sci. USA 120 , e2208863120 (2023).

Rabb, N., Fernbach, P. M. & Sloman, S. A. Individual representation in a community of knowledge. Trends Cogn. Sci. 23 , 891–902 (2019). This comprehensive review paper documents the empirical evidence for distributed cognition in communities of knowledge and the resultant vulnerabilities to illusions of understanding .

Rozenblit, L. & Keil, F. The misunderstood limits of folk science: an illusion of explanatory depth. Cogn. Sci. 26 , 521–562 (2002). This paper provided an empirical demonstration of the illusion of explanatory depth, and inspired a programme of research in cognitive science on communities of knowledge .

Hutchins, E. Cognition in the Wild (MIT Press, 1995).

Lave, J. & Wenger, E. Situated Learning: Legitimate Peripheral Participation (Cambridge Univ. Press, 1991).

Kitcher, P. The division of cognitive labor. J. Philos. 87 , 5–22 (1990).

Hardwig, J. Epistemic dependence. J. Philos. 82 , 335–349 (1985).

Keil, F. in Oxford Studies In Epistemology (eds Gendler, T. S. & Hawthorne, J.) 143–166 (Oxford Academic, 2005).

Weisberg, M. & Muldoon, R. Epistemic landscapes and the division of cognitive labor. Philos. Sci. 76 , 225–252 (2009).

Sloman, S. A. & Rabb, N. Your understanding is my understanding: evidence for a community of knowledge. Psychol. Sci. 27 , 1451–1460 (2016).

Wilson, R. A. & Keil, F. The shadows and shallows of explanation. Minds Mach. 8 , 137–159 (1998).

Keil, F. C., Stein, C., Webb, L., Billings, V. D. & Rozenblit, L. Discerning the division of cognitive labor: an emerging understanding of how knowledge is clustered in other minds. Cogn. Sci. 32 , 259–300 (2008).

Sperber, D. et al. Epistemic vigilance. Mind Lang. 25 , 359–393 (2010).

Wilkenfeld, D. A., Plunkett, D. & Lombrozo, T. Depth and deference: when and why we attribute understanding. Philos. Stud. 173 , 373–393 (2016).

Sparrow, B., Liu, J. & Wegner, D. M. Google effects on memory: cognitive consequences of having information at our fingertips. Science 333 , 776–778 (2011).

Fisher, M., Goddu, M. K. & Keil, F. C. Searching for explanations: how the internet inflates estimates of internal knowledge. J. Exp. Psychol. Gen. 144 , 674–687 (2015).

De Freitas, J., Agarwal, S., Schmitt, B. & Haslam, N. Psychological factors underlying attitudes toward AI tools. Nat. Hum. Behav. 7 , 1845–1854 (2023).

Castelo, N., Bos, M. W. & Lehmann, D. R. Task-dependent algorithm aversion. J. Mark. Res. 56 , 809–825 (2019).

Cadario, R., Longoni, C. & Morewedge, C. K. Understanding, explaining, and utilizing medical artificial intelligence. Nat. Hum. Behav. 5 , 1636–1642 (2021).

Oktar, K. & Lombrozo, T. Deciding to be authentic: intuition is favored over deliberation when authenticity matters. Cognition 223 , 105021 (2022).

Bigman, Y. E., Yam, K. C., Marciano, D., Reynolds, S. J. & Gray, K. Threat of racial and economic inequality increases preference for algorithm decision-making. Comput. Hum. Behav. 122 , 106859 (2021).

Claudy, M. C., Aquino, K. & Graso, M. Artificial intelligence can’t be charmed: the effects of impartiality on laypeople’s algorithmic preferences. Front. Psychol. 13 , 898027 (2022).

Snyder, C., Keppler, S. & Leider, S. Algorithm reliance under pressure: the effect of customer load on service workers. Preprint at SSRN https://doi.org/10.2139/ssrn.4066823 (2022).

Bogert, E., Schecter, A. & Watson, R. T. Humans rely more on algorithms than social influence as a task becomes more difficult. Sci Rep. 11 , 8028 (2021).

Raviv, A., Bar‐Tal, D., Raviv, A. & Abin, R. Measuring epistemic authority: studies of politicians and professors. Eur. J. Personal. 7 , 119–138 (1993).

Cummings, L. The “trust” heuristic: arguments from authority in public health. Health Commun. 29 , 1043–1056 (2014).

Lee, M. K. Understanding perception of algorithmic decisions: fairness, trust, and emotion in response to algorithmic management. Big Data Soc. 5 , https://doi.org/10.1177/2053951718756684 (2018).

Kissinger, H. A., Schmidt, E. & Huttenlocher, D. The Age of A.I. And Our Human Future (Little, Brown, 2021).

Lombrozo, T. Explanatory preferences shape learning and inference. Trends Cogn. Sci. 20 , 748–759 (2016). This paper provides an overview of philosophical theories of explanatory virtues and reviews empirical evidence on the sorts of explanations people find satisfying .

Vrantsidis, T. H. & Lombrozo, T. Simplicity as a cue to probability: multiple roles for simplicity in evaluating explanations. Cogn. Sci. 46 , e13169 (2022).

Johnson, S. G. B., Johnston, A. M., Toig, A. E. & Keil, F. C. Explanatory scope informs causal strength inferences. In Proc. 36th Annual Meeting of the Cognitive Science Society 2453–2458 (Cognitive Science Society, 2014).

Khemlani, S. S., Sussman, A. B. & Oppenheimer, D. M. Harry Potter and the sorcerer’s scope: latent scope biases in explanatory reasoning. Mem. Cognit. 39 , 527–535 (2011).

Liquin, E. G. & Lombrozo, T. Motivated to learn: an account of explanatory satisfaction. Cogn. Psychol. 132 , 101453 (2022).

Hopkins, E. J., Weisberg, D. S. & Taylor, J. C. V. The seductive allure is a reductive allure: people prefer scientific explanations that contain logically irrelevant reductive information. Cognition 155 , 67–76 (2016).

Weisberg, D. S., Hopkins, E. J. & Taylor, J. C. V. People’s explanatory preferences for scientific phenomena. Cogn. Res. Princ. Implic. 3 , 44 (2018).

Jerez-Fernandez, A., Angulo, A. N. & Oppenheimer, D. M. Show me the numbers: precision as a cue to others’ confidence. Psychol. Sci. 25 , 633–635 (2014).

Kim, J., Giroux, M. & Lee, J. C. When do you trust AI? The effect of number presentation detail on consumer trust and acceptance of AI recommendations. Psychol. Mark. 38 , 1140–1155 (2021).

Nguyen, C. T. The seductions of clarity. R. Inst. Philos. Suppl. 89 , 227–255 (2021). This article describes how reductive and quantitative explanations can generate a sense of understanding that is not necessarily correlated with actual understanding .

Fisher, M., Smiley, A. H. & Grillo, T. L. H. Information without knowledge: the effects of internet search on learning. Memory 30 , 375–387 (2022).

Eliseev, E. D. & Marsh, E. J. Understanding why searching the internet inflates confidence in explanatory ability. Appl. Cogn. Psychol. 37 , 711–720 (2023).

Fisher, M. & Oppenheimer, D. M. Who knows what? Knowledge misattribution in the division of cognitive labor. J. Exp. Psychol. Appl. 27 , 292–306 (2021).

Chromik, M., Eiband, M., Buchner, F., Krüger, A. & Butz, A. I think I get your point, AI! The illusion of explanatory depth in explainable AI. In 26th International Conference on Intelligent User Interfaces (eds Hammond, T. et al.) 307–317 (Association for Computing Machinery, 2021).

Strevens, M. No understanding without explanation. Stud. Hist. Philos. Sci. A 44 , 510–515 (2013).

Ylikoski, P. in Scientific Understanding: Philosophical Perspectives (eds De Regt, H. et al.) 100–119 (Univ. Pittsburgh Press, 2009).

Giudice, M. D. The prediction–explanation fallacy: a pervasive problem in scientific applications of machine learning. Preprint at PsyArXiv https://doi.org/10.31234/osf.io/4vq8f (2021).

Hofman, J. M. et al. Integrating explanation and prediction in computational social science. Nature 595 , 181–188 (2021). This paper highlights the advantages and disadvantages of explanatory versus predictive approaches to modelling, with a focus on applications to computational social science .

Shmueli, G. To explain or to predict? Stat. Sci. 25 , 289–310 (2010).

Article   MathSciNet   Google Scholar  

Hofman, J. M., Sharma, A. & Watts, D. J. Prediction and explanation in social systems. Science 355 , 486–488 (2017).

Logg, J. M., Minson, J. A. & Moore, D. A. Algorithm appreciation: people prefer algorithmic to human judgment. Organ. Behav. Hum. Decis. Process. 151 , 90–103 (2019).

Nguyen, C. T. Cognitive islands and runaway echo chambers: problems for epistemic dependence on experts. Synthese 197 , 2803–2821 (2020).

Breiman, L. Statistical modeling: the two cultures. Stat. Sci. 16 , 199–215 (2001).

Gao, J. & Wang, D. Quantifying the benefit of artificial intelligence for scientific research. Preprint at arxiv.org/abs/2304.10578 (2023).

Hanson, B. et al. Garbage in, garbage out: mitigating risks and maximizing benefits of AI in research. Nature 623 , 28–31 (2023).

Kleinberg, J. & Raghavan, M. Algorithmic monoculture and social welfare. Proc. Natl Acad. Sci. USA 118 , e2018340118 (2021). This paper uses formal modelling methods to demonstrate that when companies all rely on the same algorithm to make decisions (an algorithmic monoculture), the overall quality of those decisions is reduced because valuable options can slip through the cracks, even when the algorithm performs accurately for individual companies .

Article   MathSciNet   CAS   PubMed   PubMed Central   Google Scholar  

Hofstra, B. et al. The diversity–innovation paradox in science. Proc. Natl Acad. Sci. USA 117 , 9284–9291 (2020).

Hong, L. & Page, S. E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl Acad. Sci. USA 101 , 16385–16389 (2004).

Page, S. E. Where diversity comes from and why it matters? Eur. J. Soc. Psychol. 44 , 267–279 (2014). This article reviews research demonstrating the benefits of cognitive diversity and diversity in methodological approaches for problem solving and innovation .

Clarke, A. E. & Fujimura, J. H. (eds) The Right Tools for the Job: At Work in Twentieth-Century Life Sciences (Princeton Univ. Press, 2014).

Silva, V. J., Bonacelli, M. B. M. & Pacheco, C. A. Framing the effects of machine learning on science. AI Soc. https://doi.org/10.1007/s00146-022-01515-x (2022).

Sassenberg, K. & Ditrich, L. Research in social psychology changed between 2011 and 2016: larger sample sizes, more self-report measures, and more online studies. Adv. Methods Pract. Psychol. Sci. 2 , 107–114 (2019).

Simon, A. F. & Wilder, D. Methods and measures in social and personality psychology: a comparison of JPSP publications in 1982 and 2016. J. Soc. Psychol. https://doi.org/10.1080/00224545.2022.2135088 (2022).

Anderson, C. A. et al. The MTurkification of social and personality psychology. Pers. Soc. Psychol. Bull. 45 , 842–850 (2019).

Latour, B. in The Social After Gabriel Tarde: Debates and Assessments (ed. Candea, M.) 145–162 (Routledge, 2010).

Porter, T. M. Trust in Numbers: The Pursuit of Objectivity in Science and Public Life (Princeton Univ. Press, 1996).

Lazer, D. et al. Meaningful measures of human society in the twenty-first century. Nature 595 , 189–196 (2021).

Knox, D., Lucas, C. & Cho, W. K. T. Testing causal theories with learned proxies. Annu. Rev. Polit. Sci. 25 , 419–441 (2022).

Barberá, P. Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data. Polit. Anal. 23 , 76–91 (2015).

Brady, W. J., McLoughlin, K., Doan, T. N. & Crockett, M. J. How social learning amplifies moral outrage expression in online social networks. Sci. Adv. 7 , eabe5641 (2021).

Article   PubMed   PubMed Central   ADS   Google Scholar  

Barnes, J., Klinger, R. & im Walde, S. S. Assessing state-of-the-art sentiment models on state-of-the-art sentiment datasets. In Proc. 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (eds Balahur, A. et al.) 2–12 (Association for Computational Linguistics, 2017).

Gitelman, L. (ed.) “Raw Data” is an Oxymoron (MIT Press, 2013).

Breznau, N. et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc. Natl Acad. Sci. USA 119 , e2203150119 (2022). This study demonstrates how 73 research teams analysing the same dataset reached different conclusions about the relationship between immigration and public support for social policies, highlighting the subjectivity and uncertainty involved in analysing complex datasets .

Gillespie, T. in Media Technologies: Essays on Communication, Materiality, and Society (eds Gillespie, T. et al.) 167–194 (MIT Press, 2014).

Leonelli, S. Data-Centric Biology: A Philosophical Study (Univ. Chicago Press, 2016).

Wang, A., Kapoor, S., Barocas, S. & Narayanan, A. Against predictive optimization: on the legitimacy of decision-making algorithms that optimize predictive accuracy. ACM J. Responsib. Comput. , https://doi.org/10.1145/3636509 (2023).

Athey, S. Beyond prediction: using big data for policy problems. Science 355 , 483–485 (2017).

del Rosario Martínez-Ordaz, R. Scientific understanding through big data: from ignorance to insights to understanding. Possibility Stud. Soc. 1 , 279–299 (2023).

Nussberger, A.-M., Luo, L., Celis, L. E. & Crockett, M. J. Public attitudes value interpretability but prioritize accuracy in artificial intelligence. Nat. Commun. 13 , 5821 (2022).

Zittrain, J. in The Cambridge Handbook of Responsible Artificial Intelligence: Interdisciplinary Perspectives (eds. Voeneky, S. et al.) 176–184 (Cambridge Univ. Press, 2022). This article articulates the epistemic risks of prioritizing predictive accuracy over explanatory understanding when AI tools are interacting in complex systems.

Shumailov, I. et al. The curse of recursion: training on generated data makes models forget. Preprint at arxiv.org/abs/2305.17493 (2023).

Latour, B. Science In Action: How to Follow Scientists and Engineers Through Society (Harvard Univ. Press, 1987). This book provides strategies and approaches for thinking about science as a social endeavour .

Franklin, S. Science as culture, cultures of science. Annu. Rev. Anthropol. 24 , 163–184 (1995).

Haraway, D. Situated knowledges: the science question in feminism and the privilege of partial perspective. Fem. Stud. 14 , 575–599 (1988). This article acknowledges that the objective ‘view from nowhere’ is unobtainable: knowledge, it argues, is always situated .

Harding, S. Objectivity and Diversity: Another Logic of Scientific Research (Univ. Chicago Press, 2015).

Longino, H. E. Science as Social Knowledge: Values and Objectivity in Scientific Inquiry (Princeton Univ. Press, 1990).

Daston, L. & Galison, P. Objectivity (Princeton Univ. Press, 2007). This book is a historical analysis of the shifting modes of ‘objectivity’ that scientists have pursued, arguing that objectivity is not a universal concept but that it shifts alongside scientific techniques and ambitions .

Prescod-Weinstein, C. Making Black women scientists under white empiricism: the racialization of epistemology in physics. Signs J. Women Cult. Soc. 45 , 421–447 (2020).

Mavhunga, C. What Do Science, Technology, and Innovation Mean From Africa? (MIT Press, 2017).

Schiebinger, L. The Mind Has No Sex? Women in the Origins of Modern Science (Harvard Univ. Press, 1991).

Martin, E. The egg and the sperm: how science has constructed a romance based on stereotypical male–female roles. Signs J. Women Cult. Soc. 16 , 485–501 (1991). This case study shows how assumptions about gender affect scientific theories, sometimes delaying the articulation of what might be considered to be more accurate descriptions of scientific phenomena .

Harding, S. Rethinking standpoint epistemology: What is “strong objectivity”? Centen. Rev. 36 , 437–470 (1992). In this article, Harding outlines her position on ‘strong objectivity’, by which clearly articulating one’s standpoint can lead to more robust knowledge claims .

Oreskes, N. Why Trust Science? (Princeton Univ. Press, 2019). This book introduces the reader to 20 years of scholarship in science and technology studies, arguing that the tools the discipline has for understanding science can help to reinstate public trust in the institution .

Rolin, K., Koskinen, I., Kuorikoski, J. & Reijula, S. Social and cognitive diversity in science: introduction. Synthese 202 , 36 (2023).

Hong, L. & Page, S. E. Problem solving by heterogeneous agents. J. Econ. Theory 97 , 123–163 (2001).

Sulik, J., Bahrami, B. & Deroy, O. The diversity gap: when diversity matters for knowledge. Perspect. Psychol. Sci. 17 , 752–767 (2022).

Lungeanu, A., Whalen, R., Wu, Y. J., DeChurch, L. A. & Contractor, N. S. Diversity, networks, and innovation: a text analytic approach to measuring expertise diversity. Netw. Sci. 11 , 36–64 (2023).

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 9 , 5163 (2018).

Campbell, L. G., Mehtani, S., Dozier, M. E. & Rinehart, J. Gender-heterogeneous working groups produce higher quality science. PLoS ONE 8 , e79147 (2013).

Nielsen, M. W., Bloch, C. W. & Schiebinger, L. Making gender diversity work for scientific discovery and innovation. Nat. Hum. Behav. 2 , 726–734 (2018).

Yang, Y., Tian, T. Y., Woodruff, T. K., Jones, B. F. & Uzzi, B. Gender-diverse teams produce more novel and higher-impact scientific ideas. Proc. Natl Acad. Sci. USA 119 , e2200841119 (2022).

Kozlowski, D., Larivière, V., Sugimoto, C. R. & Monroe-White, T. Intersectional inequalities in science. Proc. Natl Acad. Sci. USA 119 , e2113067119 (2022).

Fehr, C. & Jones, J. M. Culture, exploitation, and epistemic approaches to diversity. Synthese 200 , 465 (2022).

Nakadai, R., Nakawake, Y. & Shibasaki, S. AI language tools risk scientific diversity and innovation. Nat. Hum. Behav. 7 , 1804–1805 (2023).

National Academies of Sciences, Engineering, and Medicine et al. Advancing Antiracism, Diversity, Equity, and Inclusion in STEMM Organizations: Beyond Broadening Participation (National Academies Press, 2023).

Winner, L. Do artifacts have politics? Daedalus 109 , 121–136 (1980).

Eubanks, V. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor (St. Martin’s Press, 2018).

Littmann, M. et al. Validity of machine learning in biology and medicine increased through collaborations across fields of expertise. Nat. Mach. Intell. 2 , 18–24 (2020).

Carusi, A. et al. Medical artificial intelligence is as much social as it is technological. Nat. Mach. Intell. 5 , 98–100 (2023).

Raghu, M. & Schmidt, E. A survey of deep learning for scientific discovery. Preprint at arxiv.org/abs/2003.11755 (2020).

Bishop, C. AI4Science to empower the fifth paradigm of scientific discovery. Microsoft Research Blog www.microsoft.com/en-us/research/blog/ai4science-to-empower-the-fifth-paradigm-of-scientific-discovery/ (2022).

Whittaker, M. The steep cost of capture. Interactions 28 , 50–55 (2021).

Liesenfeld, A., Lopez, A. & Dingemanse, M. Opening up ChatGPT: Tracking openness, transparency, and accountability in instruction-tuned text generators. In Proc. 5th International Conference on Conversational User Interfaces 1–6 (Association for Computing Machinery, 2023).

Chu, J. S. G. & Evans, J. A. Slowed canonical progress in large fields of science. Proc. Natl Acad. Sci. USA 118 , e2021636118 (2021).

Park, M., Leahey, E. & Funk, R. J. Papers and patents are becoming less disruptive over time. Nature 613 , 138–144 (2023).

Frith, U. Fast lane to slow science. Trends Cogn. Sci. 24 , 1–2 (2020). This article explains the epistemic risks of a hyperfocus on scientific productivity and explores possible avenues for incentivizing the production of higher-quality science on a slower timescale .

Stengers, I. Another Science is Possible: A Manifesto for Slow Science (Wiley, 2018).

Lake, B. M. & Baroni, M. Human-like systematic generalization through a meta-learning neural network. Nature 623 , 115–121 (2023).

Feinman, R. & Lake, B. M. Learning task-general representations with generative neuro-symbolic modeling. Preprint at arxiv.org/abs/2006.14448 (2021).

Schölkopf, B. et al. Toward causal representation learning. Proc. IEEE 109 , 612–634 (2021).

Mitchell, M. AI’s challenge of understanding the world. Science 382 , eadm8175 (2023).

Sartori, L. & Bocca, G. Minding the gap(s): public perceptions of AI and socio-technical imaginaries. AI Soc. 38 , 443–458 (2023).

Download references

Acknowledgements

We thank D. S. Bassett, W. J. Brady, S. Helmreich, S. Kapoor, T. Lombrozo, A. Narayanan, M. Salganik and A. J. te Velthuis for comments. We also thank C. Buckner and P. Winter for their feedback and suggestions.

Author information

These authors contributed equally: Lisa Messeri, M. J. Crockett

Authors and Affiliations

Department of Anthropology, Yale University, New Haven, CT, USA

Lisa Messeri

Department of Psychology, Princeton University, Princeton, NJ, USA

M. J. Crockett

University Center for Human Values, Princeton University, Princeton, NJ, USA

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to the research and writing of the paper.

Corresponding authors

Correspondence to Lisa Messeri or M. J. Crockett .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature thanks Cameron Buckner, Peter Winter and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Messeri, L., Crockett, M.J. Artificial intelligence and illusions of understanding in scientific research. Nature 627 , 49–58 (2024). https://doi.org/10.1038/s41586-024-07146-0

Download citation

Received : 31 July 2023

Accepted : 31 January 2024

Published : 06 March 2024

Issue Date : 07 March 2024

DOI : https://doi.org/10.1038/s41586-024-07146-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Ai is no substitute for having something to say.

Nature Reviews Physics (2024)

Perché gli scienziati si fidano troppo dell'intelligenza artificiale - e come rimediare

Nature Italy (2024)

Why scientists trust AI too much — and what to do about it

Nature (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

data science research paper

data science research paper

Analytics Insight

Top 10 Must-Read Data Science Research Papers in 2022

' src=

Are you a data science enthusiast? If yes, then this Data Science Research Paper listing is for you

  • 1 0 DATA SCIENTISTS THAT TECH ENTHUSIASTS CAN FOLLOW ON LINKEDIN
  • ARE YOU A JOB SEEKER? KNOW THE IMPACT OF AI AND DATA SCIENCE
  • TOP 10 PYTHON + DATA SCIENCE COURSES YOU SHOULD TAKE UP IN 2022  

The Research Papers Includes

Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks, assessing the effects of fuel energy consumption, foreign direct investment and gdp on co2 emission: new data science evidence from europe & central asia, impact on stock market across covid-19 outbreak, exploring the political pulse of a country using data science tools, situating data science, veridicalflow: a python package for building trustworthy data science pipelines with pcs, from ai ethics principles to data science practice: a reflection and a gap analysis based on recent frameworks and practical experience, building an effective data science practice, detection of road traffic anomalies based on computational data science, data science data governance [ai ethics].

Whatsapp Icon

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

You May Also Like

Aave (AAVE)

Aave Price Prediction: AAVE Turns Bullish With 5% Gains- Where Next?

Data analyst

Top 10 Data Analyst Jobs to Apply for in May 2022

Data intelligence

Data Intelligence: Driving Business Towards a Smart Future

Pomerdoge

Dogecoin and Pomerdoge Showing Strong Bullish Price Movements for September

footer-img

Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.

linkedin

  • Select Language:
  • Privacy Policy
  • Content Licensing
  • Terms & Conditions
  • Submit an Interview

Special Editions

  • Dec – Crypto Weekly Vol-1
  • 40 Under 40 Innovators
  • Women In Technology
  • Market Reports
  • AI Glossary
  • Infographics

Latest Issue

Influential Tech Leaders 2024

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

Also, note that the cryptocurrencies mentioned/listed on the website could potentially be scams. i.e designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are resposible for conducting your ownresearch (DYOR) before making any investment.

  • MyU : For Students, Faculty, and Staff

ML Seminar: Scientific Innovations in the Age of Generative AI

The  UMN Machine Learning Seminar Series  brings together faculty, students, and local industrial partners who are interested in the theoretical, computational, and applied aspects of machine learning, to pose problems, exchange ideas, and foster collaborations. The talks are every Tuesday from 11 a.m. - 12 p.m. during the Spring 2024 semester.

This week's speaker,  James Zou  ( Stanford University ), will be giving a talk titled " Scientific Innovations in the Age of Generative AI ".

This talk will explore how we can develop and use generative AI to help researchers and clinicians to enable scientific innovations. I will first discuss how we use AI to generate recipes for making and validating new drugs. Then I will present how we developed visual-language AI to help clinicians interpret histology images. Finally, I will discuss how we use large language models (LLM) to help all of us to write better papers. I will also discuss perspectives on what’s on the new horizon for generative AI. 

James Zou is an associate professor of Biomedical Data Science, CS and EE at Stanford University. He is also the faculty director of Stanford AI4Health. He works on both improving the foundations of ML–-by making models more trustworthy and reliable–-as well as in-depth scientific and clinical applications. Many of his innovations are widely used in tech and biotech industries. He has received a Sloan Fellowship, an NSF CAREER Award, two Chan-Zuckerberg Investigator Awards, a Top Ten Clinical Achievement Award, several best paper awards, and faculty awards from Google, Amazon, Tencent and Adobe. His research has also been profiled in popular press including the NY Times, WSJ, and WIRED.  

James Zou

Keller Hall 3-180 and via Zoom .

  • Future undergraduate students
  • Future transfer students
  • Future graduate students
  • Future international students
  • Diversity and Inclusion Opportunities
  • Learn abroad
  • Living Learning Communities
  • Mentor programs
  • Programs for women
  • Student groups
  • Visit, Apply & Next Steps
  • Information for current students
  • Departments and majors overview
  • Departments
  • Undergraduate majors
  • Graduate programs
  • Integrated Degree Programs
  • Additional degree-granting programs
  • Online learning
  • Academic Advising overview
  • Academic Advising FAQ
  • Academic Advising Blog
  • Appointments and drop-ins
  • Academic support
  • Commencement
  • Four-year plans
  • Honors advising
  • Policies, procedures, and forms
  • Career Services overview
  • Resumes and cover letters
  • Jobs and internships
  • Interviews and job offers
  • CSE Career Fair
  • Major and career exploration
  • Graduate school
  • Collegiate Life overview
  • Scholarships
  • Diversity & Inclusivity Alliance
  • Anderson Student Innovation Labs
  • Information for alumni
  • Get engaged with CSE
  • Upcoming events
  • CSE Alumni Society Board
  • Alumni volunteer interest form
  • Golden Medallion Society Reunion
  • 50-Year Reunion
  • Alumni honors and awards
  • Outstanding Achievement
  • Alumni Service
  • Distinguished Leadership
  • Honorary Doctorate Degrees
  • Nobel Laureates
  • Alumni resources
  • Alumni career resources
  • Alumni news outlets
  • CSE branded clothing
  • International alumni resources
  • Inventing Tomorrow magazine
  • Update your info
  • CSE giving overview
  • Why give to CSE?
  • College priorities
  • Give online now
  • External relations
  • Giving priorities
  • Donor stories
  • Impact of giving
  • Ways to give to CSE
  • Matching gifts
  • CSE directories
  • Invest in your company and the future
  • Recruit our students
  • Connect with researchers
  • K-12 initiatives
  • Diversity initiatives
  • Research news
  • Give to CSE
  • CSE priorities
  • Corporate relations
  • Information for faculty and staff
  • Administrative offices overview
  • Office of the Dean
  • Academic affairs
  • Finance and Operations
  • Communications
  • Human resources
  • Undergraduate programs and student services
  • CSE Committees
  • CSE policies overview
  • Academic policies
  • Faculty hiring and tenure policies
  • Finance policies and information
  • Graduate education policies
  • Human resources policies
  • Research policies
  • Research overview
  • Research centers and facilities
  • Research proposal submission process
  • Research safety
  • Award-winning CSE faculty
  • National academies
  • University awards
  • Honorary professorships
  • Collegiate awards
  • Other CSE honors and awards
  • Staff awards
  • Performance Management Process
  • Work. With Flexibility in CSE
  • K-12 outreach overview
  • Summer camps
  • Outreach events
  • Enrichment programs
  • Field trips and tours
  • CSE K-12 Virtual Classroom Resources
  • Educator development
  • Sponsor an event

Help | Advanced Search

Computer Science > Computation and Language

Title: uni-smart: universal science multimodal analysis and research transformer.

Abstract: In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as molecular structure, tables, and charts, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present Uni-SMART (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over leading text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.

Submission history

Access paper:.

  • Download PDF
  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Top 10 Must-Read Data Science Research Papers in 2022

    data science research paper

  2. Anatomy of a Scientific Research Paper

    data science research paper

  3. PPT

    data science research paper

  4. Sample Apa Research Paper With Tables

    data science research paper

  5. How to Write a Scientific Paper

    data science research paper

  6. Science Research Paper Example ~ 7 Scientific Research Paper Template

    data science research paper

VIDEO

  1. What is Data Science?? Practically #shorts

  2. Introduction to Data Science

  3. INTRODUCTION TO DATA SCIENCE Video

  4. 3 Websites For Datasets & Research Papers 😮📜 #datascience #artificialintelligence #data #research

  5. Introduction to Data Science

  6. Research Methods: Extracting the Essentials of a Computer Science Research Paper

COMMENTS

  1. Harvard Data Science Review

    HDSR is an open access platform of the Harvard Data Science Initiative that publishes foundational thinking, research milestones, educational innovations, and major applications of data science. The current issue features articles on reproducibility, replicability, impact, innovation, and paradigms in data science, as well as special themes on personalized trials, data archives, and data citations.

  2. data science Latest Research Papers

    Data Science . Information Use . Regulatory Compliance . Future Research . Public And Private . Social Good . Public And Private Sector . Effective Use. AbstractThe appetite for effective use of information assets has been steadily rising in both public and private sector organisations.

  3. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  4. Home

    Overview. The International Journal of Data Science and Analytics is a pioneering journal in data science and analytics, publishing original and applied research outcomes. Focuses on fundamental and applied research outcomes in data and analytics theories, technologies and applications. Promotes new scientific and technological approaches for ...

  5. Data Science and Analytics: An Overview from Data-Driven Smart

    Challenges and Research Directions. Our study on data science and analytics, ... to make the position of this paper. A thorough study on the data science modeling with its various processing modules that are needed to extract the actionable insights from the data for a particular business problem and the eventual data product. Thus, according ...

  6. Ten Research Challenge Areas in Data Science

    Ten Research Challenge Areas in Data Science. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak ...

  7. 6 Papers Every Modern Data Scientist Must Read

    This paper, released in early 2021 by OpenAI, is probably one of the greatest revolutions in zero-shot classification algorithms, presenting a novel model known as Contrastive Language-Image Pre-Training, or CLIP for short. CLIP was trained over a massive dataset of 400 million pairs of images and their corresponding captions, and has learnt to ...

  8. PDF Data Science Methodologies: Current Challenges and Future Approaches

    Data science has employed great research e orts in developing advanced analytics, improving data models and cultivating new al-gorithms. However, not many authors have come across the organizational and socio-technical challenges that arise when executing ... Therefore, the aim of this paper is to conduct a critical re-view of methodologies ...

  9. Data science approaches to confronting the COVID-19 pandemic: a

    1. Introduction. The use of data science methodologies in medicine and public health has been enabled by the wide availability of big data of human mobility, contact tracing, medical imaging, virology, drug screening, bioinformatics, electronic health records and scientific literature along with the ever-growing computing power [1-4].With these advances, the huge passion of researchers and ...

  10. Education Data Science: Past, Present, Future

    What implications did this rise of data science as a transdisciplinary methodological toolkit have for the field of education?One means of illustrating the salience of data science in education research is to study its emergence in the Education Resources Information Center's (ERIC) publication corpus. 1 In the corpus, the growth of data science in education can be identified by the adoption ...

  11. The role of data science in healthcare advancements: applications

    The data generated generally are retained for a shorter duration, and thus, extensive research into produced data is neglected. However, advancements in data science in the field of healthcare attempt to ensure better management of data and provide enhanced patient care [20-23].

  12. [2007.03606] Data Science: A Comprehensive Overview

    Although it is widely debated whether big data is only hype and buzz, and data science is still in a very early phase, significant challenges and opportunities are emerging or have been inspired by the research, innovation, business, profession, and education of data science. This paper provides a comprehensive survey and tutorial of the ...

  13. Journal of Computational Mathematics and Data Science

    Journal of Computational Mathematics and Data Science welcomes two types of papers: A) Full research papers B) Microarticles: short papers, no more than 6 pages. They may consist of a single, but well-presented piece of information, such as: • Data and/or a plot plus a description • Description of a new numerical/computational method or ...

  14. 69901 PDFs

    Data science combines the power of computer science and applications, modeling, statistics, engineering, economy and analytics. Whereas a... | Explore the latest full-text research PDFs, articles ...

  15. Data Science and Management

    Data Science and Management (DSM) is a peer-reviewed open access journal for original research articles, review articles and technical reports related to all aspects of data science and its application in the field of business, economics, finance, operations, engineering, healthcare, transportation, agriculture, energy, environment, sports, and social management.

  16. (PDF) Data Science: the impact of statistics

    Abstract. In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data ...

  17. Data Science and Analytics: An Overview from Data-Driven Smart

    The digital world has a wealth of data, such as internet of things (IoT) data, business data, health data, mobile data, urban data, security data, and many more, in the current age of the Fourth Industrial Revolution (Industry 4.0 or 4IR). Extracting knowledge or useful insights from these data can be used for smart decision-making in various applications domains. In the area of data science ...

  18. Data Science and Artificial Intelligence

    The articles in this special section are dedicated to the application of artificial intelligence AI), machine learning (ML), and data analytics to address different problems of communication systems, presenting new trends, approaches, methods, frameworks, systems for efficiently managing and optimizing networks related operations. Even though AI/ML is considered a key technology for next ...

  19. Articles

    The CODATA Data Science Journal is a peer-reviewed, open access, electronic journal, publishing papers on the management, dissemination, use and reuse of research data and databases across all research domains, including science, technology, the humanities and the arts. The scope of the journal includes descriptions of data systems, their implementations and their publication, applications ...

  20. (PDF) What Is Data Science?

    Abstract. Data science, a new discovery paradigm, is potentially one of the most significant advances of the early twenty-first century. Originating in scientific discovery, it is being applied to ...

  21. Ten Research Challenge Areas in Data Science

    Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning ...

  22. A Deep Dissertion of Data Science: Related Issues and its Applications

    This paper illustrates What is Data Science, How it processes, and also its Applications. Section II of this paper consists of the different review regarding data science. Section III of this paper illustrates about the complete process of data science. Section IV describes all the related research issues for data science.

  23. Artificial intelligence and illusions of understanding in scientific

    Gil, Y. Thoughtful artificial intelligence: forging a new partnership for data science and scientific discovery. ... The authors contributed equally to the research and writing of the paper.

  24. Top 10 Must-Read Data Science Research Papers in 2022

    VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS. The research paper is written by- James Duncan, RushKapoor, Abhineet Agarwal, Chandan Singh, Bin Yu This research paper is more of a journal of open-source software than a study paper. It deals with the open-source software that is the programs available ...

  25. Big data in Earth science: Emerging practice and promise

    Ideally, papers using big data would formally cite data DOIs, both to enable tracking of data usage and as a way to associate datasets and the researchers responsible for curating and publishing them with their use and citation in traditional peer-reviewed publications . Another way to extract trends is to look at the use of data from large ...

  26. How Much Can Machines Learn Finance from Chinese Text Data?

    Funding: This study was supported by the National Natural Science Foundation of China [Grants 71991471, 71991470, and 72204049], the National Key Research and Development Program [Grant 2020YFA0608604], the Shanghai Pujiang Scholar Project [Grant 21PJC010], the Shanghai Science Project [Grant 23692119300], and the China Postdoctoral Science ...

  27. ML Seminar: Scientific Innovations in the Age of Generative AI

    Finally, I will discuss how we use large language models (LLM) to help all of us to write better papers. I will also discuss perspectives on what's on the new horizon for generative AI. BiographyJames Zou is an associate professor of Biomedical Data Science, CS and EE at Stanford University. He is also the faculty director of Stanford AI4Health.

  28. Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

    In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to ...