• Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial

Computer Vision Projects

  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning

100+ Machine Learning Projects with Source Code [2024]

Classification projects.

  • Wine Quality Prediction - Machine Learning
  • ML | Credit Card Fraud Detection
  • Disease Prediction Using Machine Learning
  • Recommendation System in Python
  • Detecting Spam Emails Using Tensorflow in Python
  • SMS Spam Detection using TensorFlow in Python
  • Python | Classify Handwritten Digits with Tensorflow
  • Recognizing HandWritten Digits in Scikit Learn
  • Identifying handwritten digits using Logistic Regression in PyTorch
  • Python | Customer Churn Analysis Prediction
  • Online Payment Fraud Detection using Machine Learning in Python
  • Flipkart Reviews Sentiment Analysis using Python
  • Loan Approval Prediction using Machine Learning
  • Loan Eligibility prediction using Machine Learning Models in Python
  • Stock Price Prediction using Machine Learning in Python
  • Bitcoin Price Prediction using Machine Learning in Python
  • Handwritten Digit Recognition using Neural Network
  • Parkinson Disease Prediction using Machine Learning - Python
  • Spaceship Titanic Project using Machine Learning - Python
  • Rainfall Prediction using Machine Learning - Python
  • Autism Prediction using Machine Learning
  • Predicting Stock Price Direction using Support Vector Machines
  • Fake News Detection Model using TensorFlow in Python
  • CIFAR-10 Image Classification in TensorFlow
  • Black and white image colorization with OpenCV and Deep Learning
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression
  • ML | Cancer cell classification using Scikit-learn
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN and Cross Validation
  • Human Scream Detection and Analysis for Controlling Crime Rate - Project Idea
  • Multiclass image classification using Transfer learning
  • Intrusion Detection System Using Machine Learning Algorithms
  • Heart Disease Prediction using ANN

Regression Projects

  • IPL Score Prediction using Deep Learning
  • Dogecoin Price Prediction with Machine Learning
  • Zillow Home Value (Zestimate) Prediction in ML
  • Calories Burnt Prediction using Machine Learning
  • Vehicle Count Prediction From Sensor Data
  • Analyzing selling price of used cars using Python
  • Box Office Revenue Prediction Using Linear Regression in ML
  • House Price Prediction using Machine Learning in Python
  • ML | Boston Housing Kaggle Challenge with Linear Regression
  • Stock Price Prediction Project using TensorFlow
  • Medical Insurance Price Prediction using Machine Learning - Python
  • Inventory Demand Forecasting using Machine Learning - Python
  • Ola Bike Ride Request Forecast using ML
  • Waiter's Tip Prediction using Machine Learning
  • Predict Fuel Efficiency Using Tensorflow in Python
  • Microsoft Stock Price Prediction with Machine Learning
  • Share Price Forecasting Using Facebook Prophet
  • Python | Implementation of Movie Recommender System
  • How can Tensorflow be used with abalone dataset to build a sequential model?
  • OCR of Handwritten digits | OpenCV
  • Cartooning an Image using OpenCV - Python
  • Count number of Object using Python-OpenCV
  • Count number of Faces using Python - OpenCV
  • Text Detection and Extraction using OpenCV and OCR
  • FaceMask Detection using TensorFlow in Python
  • Dog Breed Classification using Transfer Learning
  • Flower Recognition Using Convolutional Neural Network
  • Emojify using Face Recognition with Machine Learning
  • Cat & Dog Classification using Convolutional Neural Network in Python
  • Traffic Signs Recognition using CNN and Keras in Python
  • Lung Cancer Detection using Convolutional Neural Network (CNN)
  • Lung Cancer Detection Using Transfer Learning
  • Pneumonia Detection using Deep Learning
  • Detecting Covid-19 with Chest X-ray
  • Skin Cancer Detection using TensorFlow
  • Age Detection using Deep Learning in OpenCV
  • Face and Hand Landmarks Detection using Python - Mediapipe, OpenCV
  • Detecting COVID-19 From Chest X-Ray Images using CNN
  • Image Segmentation Using TensorFlow
  • License Plate Recognition with OpenCV and Tesseract OCR
  • Detect and Recognize Car License Plate from a video in real time
  • Residual Networks (ResNet) - Deep Learning

Natural Language Processing Projects

  • Twitter Sentiment Analysis using Python
  • Facebook Sentiment Analysis using python
  • Next Sentence Prediction using BERT
  • Hate Speech Detection using Deep Learning
  • Image Caption Generator using Deep Learning on Flickr8K dataset
  • Movie recommendation based on emotion in Python
  • Speech Recognition in Python using Google Speech API
  • Voice Assistant using python
  • Human Activity Recognition - Using Deep Learning Model
  • Fine-tuning BERT model for Sentiment Analysis
  • Sentiment Classification Using BERT
  • Sentiment Analysis with an Recurrent Neural Networks (RNN)
  • Autocorrector Feature Using NLP In Python
  • Python | NLP analysis of Restaurant reviews
  • Restaurant Review Analysis Using NLP and SQLite

Clustering Projects

  • Customer Segmentation using Unsupervised Machine Learning in Python
  • Music Recommendation System Using Machine Learning
  • K means Clustering - Introduction
  • Image Segmentation using K Means Clustering

Recommender System Project

  • AI Driven Snake Game using Deep Q Learning

Machine Learning gained a lot of popularity and become a necessary tool for research purposes as well as for Business. It is a revolutionary field that helps us to make better decisions and automate tasks. Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed.

In this article, you’ll find the top 100+ latest Machine Learning projects and Ideas which are beneficial for both beginners and as well experienced professionals. Whether you’re a final-year student aiming for a standout resume or someone building a career, these machine learning projects provide hands-on experience, launching you into the exciting world of Machine Learning and Data Science .

Machine Learning Project with Source Code

Top Machine Learning Project with Source Code [2024]

We mainly include projects that solve real-world problems to demonstrate how machine learning solves these real-world problems like: – Online Payment Fraud Detection using Machine Learning in Python, Rainfall Prediction using Machine Learning in Python, and Facemask Detection using TensorFlow in Python.

  • Machine Learning Project for Beginners
  • Advanced Machine Learning Projects

These projects provide a great opportunity for developers to apply their knowledge of machine learning and make an application that benefits society. By implementing these projects in data science, you be familiar with a practical working environment where you follow instructions in real time.

Machine Learning Project for Beginners in 2024 [Source Code]

Let’s look at some of the best new machine-learning projects for beginners in this section and each project deals with a different set of issues, including supervised and unsupervised learning, classification, regression, and clustering. Beginners will be better prepared to tackle more challenging tasks by the time they have finished reading this article and have a better understanding of the fundamentals of machine learning.

1. Healthcare

  • ML | Heart Disease Prediction Using Logistic Regression
  • Prediction of Wine type using Deep Learning
  • Parkinson’s Disease Prediction using Machine Learning in Python
  • ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN and Cross-Validation

2. Finance and Economics

  • Credit Card Fraud Detection

3. Food and Beverage

  • Wine Quality Prediction

4. Retail and Commerce

  • Sales Forecast Prediction – Python
  • IPL Score Prediction Using Deep Learning

6. Health and Fitness

  • Medical Insurance Price Prediction using Machine Learning in Python

7. Transportation and Traffic

8. environmental science.

  • Rainfall Prediction using Machine Learning in Python

9. Text and Image Processing

  • Cartooning an Image using OpenCV – Python
  • Count number of Faces using Python – OpenCV

10. Social Media and Sentiment Analysis

11. other important machine learning projects.

  • Human Scream Detection and Analysis for Controlling Crime Rate
  • Spaceship Titanic Project using Machine Learning in Python
  • Inventory Demand Forecasting using Machine Learning in Python
  • Waiter’s Tip Prediction using Machine Learning
  • Fake News Detection using Machine Learning

Advanced Machine Learning Projects With Source Code [2024]

We have discussed a variety of complex machine-learning ideas in this section that are intended to be challenging for users and span a wide range of topics. These subjects involve creating deep learning models, dealing with unstructured data, and instructing sophisticated models like convolutional neural networks, gated recurrent units, large language models, and reinforcement learning models.

1. Image and Video Processing

  • Cat & Dog Classification using Convolutional Neural Network in Python
  • Residual Networks (ResNet) – Deep Learning

2. Recommendation Systems

  • Ted Talks Recommendation System with Machine Learning

3. Speech and Language Processing

  • Fine-tuning the BERT model for Sentiment Analysis
  • Sentiment Analysis with Recurrent Neural Networks (RNN)
  • Autocorrect Feature Using NLP In Python

4. Health and Medical Applications

5. security and surveillance.

  • Detect and Recognize Car License Plate from a video in real-time

6. Gaming and Entertainment

  • AI-Driven Snake Game using Deep Q Learning

7. Other Advanced Machine Learning Projects

  • Face and Hand Landmarks Detection using Python
  • Human Activity Recognition – Using Deep Learning Model
  • How can Tensorflow be used with the abalone dataset to build a sequential model?

Machine Learning Projects – FAQs

What are some good machine-learning projects.

For beginners, recommended machine learning projects include sentiment analysis, sales forecast prediction, and image recognition.  

How do I start an ML project?

To start a machine learning project, the first steps involve collecting data, preprocessing it, constructing data models, and then training those models with that data.

Which Language is used for machine learning?

Python and R are most popular and widely-used programming languages for machine learning.

Why do we need to build machine learning projects?

We need to build machine learning projects to solve complex problems, automate tasks and improve decision-making.

What is the future of machine learning?

Machine learning is a fast-growing field of study and research, which means that the demand for machine learning professionals is also growing. 

Please Login to comment...

Similar reads.

author

  • Machine Learning
  • Google Releases ‘Prompting Guide’ With Tips For Gemini In Workspace
  • Google Cloud Next 24 | Gmail Voice Input, Gemini for Google Chat, Meet ‘Translate for me,’ & More
  • 10 Best Viber Alternatives for Better Communication
  • 12 Best Database Management Software in 2024
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

machine-learning-projects

Here are 175 public repositories matching this topic..., ashishpatel26 / 500-ai-machine-learning-deep-learning-computer-vision-nlp-projects-with-code.

500 AI Machine learning Deep learning Computer vision NLP Projects with code

  • Updated Apr 10, 2024

deepfence / FlowMeter

⭐ ⭐ Use ML to classify flows and packets as benign or malicious. ⭐ ⭐

  • Updated Apr 8, 2024

qxresearch / qxresearch-event-1

Python hands on tutorial with 50+ Python Application (10 lines of code) @xiaowuc2

  • Updated Feb 15, 2024

milaan9 / 93_Python_Data_Analytics_Projects

This repository contains all the data analytics projects that I've worked on in python.

  • Updated Dec 9, 2022
  • Jupyter Notebook

veeralakrishna / DataCamp-Project-Solutions-Python

DataCamp Project Solutions

  • Updated Aug 23, 2023

shsarv / Machine-Learning-Projects

This Contain Some Machine Learning Projects that I have done while understanding ML Concepts.

  • Updated Mar 9, 2024

durgeshsamariya / Data-Science-Machine-Learning-Project-with-Source-Code

Data Science and Machine Learning projects with source code.

  • Updated Feb 11, 2024

Arshad221b / Sign-Language-Recognition

Indian Sign language Recognition using OpenCV

  • Updated Jun 12, 2023

fsiddh / Machine-Learning-Masters

This repository consists content, assignments, assignments solution and study material provided by ineoron ML masters course

  • Updated Dec 8, 2022

anubhavshrimal / Machine-Learning

The projects I do in Machine Learning with PyTorch, keras, Tensorflow, scikit learn and Python.

  • Updated Oct 2, 2020

lawmurray / Birch

A probabilistic programming language that combines automatic differentiation, automatic marginalization, and automatic conditioning within Monte Carlo methods.

  • Updated Sep 21, 2023

inboxpraveen / movie-recommendation-system

Movie Recommendation System with Complete End-to-End Pipeline, Model Intregration & Web Application Hosted. It will also help you build similar projects.

  • Updated Dec 13, 2023

Viveckh / LilHomie

A Machine Learning Project implemented from scratch which involves web scraping, data engineering, exploratory data analysis and machine learning to predict housing prices in New York Tri-State Area.

ayushreal / Signature-recognition

Signature recognition is a behavioural biometric. It can be operated in two different ways: Static: In this mode, users write their signature on paper, digitize it through an optical scanner or a camera, and the biometric system recognizes the signature analyzing its shape. This group is also known as “off-line”. Dynamic: In this mode, users wri…

  • Updated Apr 10, 2018

vvssttkk / dst

yet another custom data science template via cookiecutter

  • Updated Apr 21, 2023

mandliya / ml

A 60 days+ streak of daily learning of ML/DL/Maths concepts through projects

  • Updated Jun 30, 2020

souvikmajumder26 / Land-Cover-Semantic-Segmentation-PyTorch

🛣 Building an end-to-end Promptable Semantic Segmentation (Computer Vision) project from training to inferencing a model on LandCover.ai data (Satellite Imagery).

  • Updated Jun 20, 2023

thomas-young-2013 / mindware

An efficient open-source AutoML system for automating machine learning lifecycle, including feature engineering, neural architecture search, and hyper-parameter tuning.

  • Updated Jan 26, 2022

FaizanZaheerGit / StudentPerformancePrediction-ML

This is a simple machine learning project using classifiers for predicting factors which affect student grades, using data from CSV file

  • Updated Apr 3, 2024

leadthefuture / ML-and-DataScience-preparation

RoadMap and preparation material for Machine Learning and Data Science - From beginner to expert.

  • Updated Oct 25, 2022

Improve this page

Add a description, image, and links to the machine-learning-projects topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the machine-learning-projects topic, visit your repo's landing page and select "manage topics."

12 Great Machine Learning Projects to Get You Started in AI

Everyone remotely connected to the data world is abuzz about machine learning. Instead of just knowing what machine learning is , why not go one better and start working with it?

If you’re looking to dive into the world of ML but don’t know where to start, I’ve curated 12 hands-on machine learning projects to help kickstart your journey. These projects range from more simple classification tasks to more complex challenges in natural language processing and image recognition. 

Whether you’re a new entrant to data analytics looking for a portfolio project idea, or an intermediate coder interested in machine learning implementations, there’s something here for you. 

While you’re at it, check out our other explainer articles on machine learning models , career paths , and essential machine learning tools . 

We’ll cover:

Classification projects

Natural language processing projects, recommendation projects, computer vision projects, 12 great machine learning projects.

Classification involves sorting data into predefined classes or labels based on their features. Classification projects are a great way for new entrants to machine learning as they help you understand foundational concepts through real-world problems. We’ll take a look at three such machine learning projects.

1. Predicting wine quality

Oenology–or the science of wines, offers a fun introduction to the art of classification. UCI’s wine quality dataset can be used to distinguish between red and white wines. 

You can use libraries such as scikit-learn or Tensorflow to predict the quality of wines based on their 11 physicochemical properties such as acidity, presence of sulfates, and sugar. You’ll learn how to understand feature importance and handle imbalanced datasets through a real-world dataset. While the main task is classification, you can also extend the project to run regression analysis to predict wine quality scores. 

2. Tree species identification

UCI’s leaf dataset is a great way to understand image preprocessing and feature extraction. This project looks at how to best classify tree species based on the shape and texture of their leaves. 

For extra credit, you could augment the original dataset with leaves from your own neighborhood, or extract your own set of features. The learning curve is slightly steeper than most beginner classification projects, but you’ll learn how to create more robust machine learning models .

3. Exoplanet discovery 

For the more extraterrestrially-inclined, Caltech’s Kepler Exoplanet dataset offers a unique classification challenge. Learn how to detect exoplanets (planets outside our solar system) by analyzing light curves observed through NASA’s space telescopes. 

You’ll be challenged to find ways to improve the accuracy of your model through applying advanced time-series analysis on sequential data, and learn how to handle rare events through anomaly detection. While the project requires more intensive domain knowledge research, it’s also an opportunity to demonstrate your ability to dive deep into a new subject area and come up with a useful model. 

Natural language processing stands at the intersection of linguistics and machine learning. It’s a great subject area for coders interested in teaching machines how to understand, interpret, and generate human language in new and useful ways. I’ve listed three compelling NLP projects that are applicable across industries. 

1. Spam detection

Emails would be unusable without spam detection algorithms working in the background to filter out unsolicited messages. UCI’s SMS spam collection dataset contains 5,574 messages labeled as either spam or ham (not spam). 

You can use Python’s Natural Language Toolkit (NLTK) or scikit-learn libraries to learn how to preprocess the text, extract features, and run your prediction model. This is an excellent project to learn how to work with imbalanced datasets–which are common in real-world business scenarios–since spam messages are only a tiny percent of genuine ones. 

2. Sentiment analysis

Every day, millions of people express their opinions online, whether through product reviews or social media posts. 

The sentiment140 dataset , released by a group of Stanford researchers contains 1.6 million tweets for you to train your machine learning model on. It labels tweets as positive, negative, or neutral, which is a relatively easy introduction to the art of sentiment analysis. Learn how to interpret human emotions in text, handle slang, and make decisions on ambiguous or nuanced statements.

You’ll learn how to use techniques such as word embeddings and recurrent neural networks, which feature in many NLP applications.  You can brush up on your skills in our full guide to sentiment analysis .

3. Fake news detector

Google “fake news detection” and you may not be surprised to learn how many active research projects exist to solve this problem. Identifying bias and misinformation is especially difficult given the speed of content creation. 

There are a number of fake news datasets where you can apply popular machine learning libraries like TensorFlow to handle large text data, understand what embeddings are, and grapple with the challenge of creating a model that can reliably understand nuance at scale. 

Recommendation systems, or RecSys in industry speak, power many of the most popular services we use on a day-to-day basis. These systems try to optimize and personalize our experiences based on our preferences and interactions. These machine learning projects are challenging as it can be hard to predict what users like with noisy datasets.

The MovieLens dataset spans 25 million ratings across 62,000 movies. Learn how to use Python libraries such as Surprise or LightFM to build recommender systems with recommendation algorithms. 

You’ll come away with a strong foundation in collaborative filtering, a technique to make recommendations based on user-user or item-item similarities. It’s also a great way to understand concepts like matrix factorization and user embeddings.

2. Building Goodreads

Avid readers will particularly enjoy diving into this project: UCSD’s Goodreads’s dataset is a rich repository of user-book interactions, with more than 2 million books rated by nearly 1 million users.

  Building a reading recommendation system will introduce you to concepts like content-based filtering, where recommendations are based on items’ features and users’ preferences.

3. Restaurant ratings

Yelp’s dataset of almost 7 million restaurant reviews can be used to build a model that recommends restaurants based on reviews and restaurant features. 

Learn how to handle text data, extra features, and understand how to solve thorny “cold start” problems in ML , where your model needs to make accurate recommendations for new users. 

Teaching machines to “see” offers a glimpse into a future where machines can interpret and analyze visual data. If you’re interested in more advanced projects involving the analysis of pixels and patterns, here are three projects with real-world applications. 

1. Facial expressions

Google’s facial expression dataset contains 156,000 images annotated by human raters to classify them according to various emotions. 

You’ll learn how to build a model that can recognise and correctly classify facial expressions using Python libraries such as OpenCV for face detection, and Keras for deep learning. You can take this project in many different directions; one good use case would be to learn how to use convolutional neural networks (CNNs) to enable real-time video processing of human emotions. 

2. Traffic analysis

The promise of self-driving cars has dominated the ML news cycle for a while now. Have you ever wondered about the algorithms powering this? 

If so, check out the University of Albany’s US-DETRAC dataset with 10 hours of traffic video from China. Use Python libraries like Tensorflow’s Object Detection API or YOLO’s state-of-the-art object detection to build an ML system that can detect and analyze vehicles in real-time. While this project will immerse you in challenging concepts like objection detection techniques and tracking algorithms, you’ll come away with an impressive portfolio project at the intersection of computer vision and urban planning. 

3. Human activity 

The University of Central Florida has released an action recognition dataset with more than 13,000 videos across 101 action categories, including knitting, diving, basketball and surfing. 

You’re tasked with building a model that can correctly label these activities, and more, through temporal sequence analysis. You’ll learn how to process video data and 3D convolutional neural networks. This is a great project for more advanced learners looking to understand the specific complexities of identifying human motion through machine learning.

From classifying datasets and deciphering nuanced textual sentiments in NLP, to creating accurate recommendation systems and grappling with the challenges of computer vision, these 12 machine learning projects offer many ways for you to learn new techniques and come away with a strong portfolio project to showcase your skills. 

If the world of data analytics interests you but you don’t know where to start, why not explore our Machine Learning with Python Course ? It covers the basics of machine learning as a field through hands-on projects with an expert mentor, and will give you a good idea of whether or not it’s a career path you’re interested in pursuing further.

You may also be interested in the following articles:

  • What Does a Machine Learning Engineer Do?
  • Bias in Machine Learning: What Are the Ethics of AI?
  • 22 Most-Asked Machine Learning Interview Questions (and Answers!)

Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

472k Accesses

1335 Citations

21 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

research project on machine learning

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Iqbal H. Sarker

research project on machine learning

Machine learning and deep learning

Christian Janiesch, Patrick Zschech & Kai Heinrich

research project on machine learning

Artificial intelligence for waste management in smart cities: a review

Bingbing Fang, Jiacheng Yu, … Pow-Seng Yap

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research

Subscribe to the PwC Newsletter

Join the community, trending research, autocoderover: autonomous program improvement.

nus-apr/auto-code-rover • 8 Apr 2024

Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding.

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

research project on machine learning

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm • 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged.

research project on machine learning

InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation

Tuning-free diffusion-based models have demonstrated significant potential in the realm of image personalization and customization.

research project on machine learning

AIOS: LLM Agent Operating System

agiresearch/aios • 25 Mar 2024

Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI.

research project on machine learning

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

research project on machine learning

Rho-1: Not All Tokens Are What You Need

microsoft/rho • 11 Apr 2024

After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40. 6% and 51. 8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens.

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).

research project on machine learning

Generative MRI

Robust MRI Using New Algorithms for Generative Machine Learning

Magnetic Resonance Imaging (MRI) is currently expensive and time consuming. Exam times can take upwards of 30 minutes to an hour. As a result, patient motion...

  • Research Projects

Exploiting Shared Representations for Personalized Federated Learning

Learning under ambiguity with different amounts of annotation, optimal transport in large-scale machine learning applications, steady: simultaneous state estimation and dynamics learning from indirect observations, scalp - supervised contrastive learning for cardiopulmonary disease classification and localization in chest x-rays using patient metadata, knowledge-augmented contrastive learning for abnormality classification and localization in chest x-rays with radiomics using a feedback loop, multitasking models are robust to structural failure: a neural model for bilingual cognitive reserve, fairness in imaging with deep learning, fruit-fly inspired neighbor encoding for classification, a universal law of robustness via isoperimetry, intermediate layer optimization for inverse problems using deep generative models.

  • Publications
  • Recorded Talks
  • Education & Outreach
  • Ethics in AI

research project on machine learning

Frequently Asked Questions

Journal of Machine Learning Research

The Journal of Machine Learning Research (JMLR), established in 2000 , provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online.

  • 2024.02.18 : Volume 24 completed; Volume 25 began.
  • 2023.01.20 : Volume 23 completed; Volume 24 began.
  • 2022.07.20 : New special issue on climate change .
  • 2022.02.18 : New blog post: Retrospectives from 20 Years of JMLR .
  • 2022.01.25 : Volume 22 completed; Volume 23 began.
  • 2021.12.02 : Message from outgoing co-EiC Bernhard Schölkopf .
  • 2021.02.10 : Volume 21 completed; Volume 22 began.
  • More news ...

Latest papers

More PAC-Bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime validity Borja Rodríguez-Gálvez, Ragnar Thobaben, Mikael Skoglund , 2024. [ abs ][ pdf ][ bib ]

Neural Hilbert Ladders: Multi-Layer Neural Networks in Function Space Zhengdao Chen , 2024. [ abs ][ pdf ][ bib ]

QDax: A Library for Quality-Diversity and Population-based Algorithms with Hardware Acceleration Felix Chalumeau, Bryan Lim, Raphaël Boige, Maxime Allard, Luca Grillotti, Manon Flageat, Valentin Macé, Guillaume Richard, Arthur Flajolet, Thomas Pierrot, Antoine Cully , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Random Forest Weighted Local Fréchet Regression with Random Objects Rui Qiu, Zhou Yu, Ruoqing Zhu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

PhAST: Physics-Aware, Scalable, and Task-Specific GNNs for Accelerated Catalyst Design Alexandre Duval, Victor Schmidt, Santiago Miret, Yoshua Bengio, Alex Hernández-García, David Rolnick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Unsupervised Anomaly Detection Algorithms on Real-world Data: How Many Do We Need? Roel Bouman, Zaharah Bukhsh, Tom Heskes , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multi-class Probabilistic Bounds for Majority Vote Classifiers with Partially Labeled Data Vasilii Feofanov, Emilie Devijver, Massih-Reza Amini , 2024. [ abs ][ pdf ][ bib ]

Information Processing Equalities and the Information–Risk Bridge Robert C. Williamson, Zac Cranko , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Regression for 3D Point Cloud Learning Xinyi Li, Shan Yu, Yueying Wang, Guannan Wang, Li Wang, Ming-Jun Lai , 2024. [ abs ][ pdf ][ bib ]      [ code ]

AMLB: an AutoML Benchmark Pieter Gijsbers, Marcos L. P. Bueno, Stefan Coors, Erin LeDell, Sébastien Poirier, Janek Thomas, Bernd Bischl, Joaquin Vanschoren , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Materials Discovery using Max K-Armed Bandit Nobuaki Kikkawa, Hiroshi Ohno , 2024. [ abs ][ pdf ][ bib ]

Semi-supervised Inference for Block-wise Missing Data without Imputation Shanshan Song, Yuanyuan Lin, Yong Zhou , 2024. [ abs ][ pdf ][ bib ]

Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao, Yu-Jie Zhang, Lijun Zhang, Zhi-Hua Zhou , 2024. [ abs ][ pdf ][ bib ]

Scaling Speech Technology to 1,000+ Languages Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli , 2024. [ abs ][ pdf ][ bib ]      [ code ]

MAP- and MLE-Based Teaching Hans Ulrich Simon, Jan Arne Telle , 2024. [ abs ][ pdf ][ bib ]

A General Framework for the Analysis of Kernel-based Tests Tamara Fernández, Nicolás Rivera , 2024. [ abs ][ pdf ][ bib ]

Overparametrized Multi-layer Neural Networks: Uniform Concentration of Neural Tangent Kernel and Convergence of Stochastic Gradient Descent Jiaming Xu, Hanjing Zhu , 2024. [ abs ][ pdf ][ bib ]

Sparse Representer Theorems for Learning in Reproducing Kernel Banach Spaces Rui Wang, Yuesheng Xu, Mingsong Yan , 2024. [ abs ][ pdf ][ bib ]

Exploration of the Search Space of Gaussian Graphical Models for Paired Data Alberto Roverato, Dung Ngoc Nguyen , 2024. [ abs ][ pdf ][ bib ]

The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy , 2024. [ abs ][ pdf ][ bib ]

Minimax Rates for High-Dimensional Random Tessellation Forests Eliza O'Reilly, Ngoc Mai Tran , 2024. [ abs ][ pdf ][ bib ]

Nonparametric Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks Guohao Shen, Yuling Jiao, Yuanyuan Lin, Joel L. Horowitz, Jian Huang , 2024. [ abs ][ pdf ][ bib ]

Spatial meshing for general Bayesian multivariate models Michele Peruzzi, David B. Dunson , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A Semi-parametric Estimation of Personalized Dose-response Function Using Instrumental Variables Wei Luo, Yeying Zhu, Xuekui Zhang, Lin Lin , 2024. [ abs ][ pdf ][ bib ]

Learning Non-Gaussian Graphical Models via Hessian Scores and Triangular Transport Ricardo Baptista, Youssef Marzouk, Rebecca Morrison, Olivier Zahm , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Learnability of Out-of-distribution Detection Zhen Fang, Yixuan Li, Feng Liu, Bo Han, Jie Lu , 2024. [ abs ][ pdf ][ bib ]

Win: Weight-Decay-Integrated Nesterov Acceleration for Faster Network Training Pan Zhou, Xingyu Xie, Zhouchen Lin, Kim-Chuan Toh, Shuicheng Yan , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains Yicheng Li, Zixiong Yu, Guhan Chen, Qian Lin , 2024. [ abs ][ pdf ][ bib ]

Tight Convergence Rate Bounds for Optimization Under Power Law Spectral Conditions Maksim Velikanov, Dmitry Yarotsky , 2024. [ abs ][ pdf ][ bib ]

ptwt - The PyTorch Wavelet Toolbox Moritz Wolter, Felix Blanke, Jochen Garcke, Charles Tapley Hoyt , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria Victor Bystrov, Viktoriia Naboka-Krell, Anna Staszewska-Bystrova, Peter Winker , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Functional Directed Acyclic Graphs Kuang-Yao Lee, Lexin Li, Bing Li , 2024. [ abs ][ pdf ][ bib ]

Unlabeled Principal Component Analysis and Matrix Completion Yunzhen Yao, Liangzu Peng, Manolis C. Tsakiris , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Estimation on Semi-Supervised Generalized Linear Model Jiyuan Tu, Weidong Liu, Xiaojun Mao , 2024. [ abs ][ pdf ][ bib ]

Towards Explainable Evaluation Metrics for Machine Translation Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger , 2024. [ abs ][ pdf ][ bib ]

Differentially private methods for managing model uncertainty in linear regression Víctor Peña, Andrés F. Barrientos , 2024. [ abs ][ pdf ][ bib ]

Data Summarization via Bilevel Optimization Zalán Borsos, Mojmír Mutný, Marco Tagliasacchi, Andreas Krause , 2024. [ abs ][ pdf ][ bib ]

Pareto Smoothed Importance Sampling Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, Jonah Gabry , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Policy Gradient Methods in the Presence of Symmetries and State Abstractions Prakash Panangaden, Sahand Rezaei-Shoshtari, Rosie Zhao, David Meger, Doina Precup , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling Instruction-Finetuned Language Models Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei , 2024. [ abs ][ pdf ][ bib ]

Tangential Wasserstein Projections Florian Gunsilius, Meng Hsuan Hsieh, Myung Jin Lee , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Learnability of Linear Port-Hamiltonian Systems Juan-Pablo Ortega, Daiying Yin , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning Ariyan Bighashdel, Daan de Geus, Pavol Jancura, Gijs Dubbelman , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Unbiased Estimation for Partially Observed Diffusions Jeremy Heng, Jeremie Houssineau, Ajay Jasra , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions Stanislas Ducotterd, Alexis Goujon, Pakshal Bohra, Dimitris Perdios, Sebastian Neumayer, Michael Unser , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Mathematical Framework for Online Social Media Auditing Wasim Huleihel, Yehonathan Refael , 2024. [ abs ][ pdf ][ bib ]

An Embedding Framework for the Design and Analysis of Consistent Polyhedral Surrogates Jessie Finocchiaro, Rafael M. Frongillo, Bo Waggoner , 2024. [ abs ][ pdf ][ bib ]

Low-rank Variational Bayes correction to the Laplace method Janet van Niekerk, Haavard Rue , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Scaling the Convex Barrier with Sparse Dual Algorithms Alessandro De Palma, Harkirat Singh Behl, Rudy Bunel, Philip H.S. Torr, M. Pawan Kumar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Causal-learn: Causal Discovery in Python Yujia Zheng, Biwei Huang, Wei Chen, Joseph Ramsey, Mingming Gong, Ruichu Cai, Shohei Shimizu, Peter Spirtes, Kun Zhang , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Decomposed Linear Dynamical Systems (dLDS) for learning the latent components of neural dynamics Noga Mudrik, Yenho Chen, Eva Yezerets, Christopher J. Rozell, Adam S. Charles , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification Natalie S. Frank, Jonathan Niles-Weed , 2024. [ abs ][ pdf ][ bib ]

Data Thinning for Convolution-Closed Distributions Anna Neufeld, Ameer Dharamshi, Lucy L. Gao, Daniela Witten , 2024. [ abs ][ pdf ][ bib ]      [ code ]

A projected semismooth Newton method for a class of nonconvex composite programs with strong prox-regularity Jiang Hu, Kangkang Deng, Jiayuan Wu, Quanzheng Li , 2024. [ abs ][ pdf ][ bib ]

Revisiting RIP Guarantees for Sketching Operators on Mixture Models Ayoub Belhadji, Rémi Gribonval , 2024. [ abs ][ pdf ][ bib ]

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune, Jiayu Liu, Reinhard Heckel , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Polygonal Unadjusted Langevin Algorithms: Creating stable and efficient adaptive algorithms for neural networks Dong-Young Lim, Sotirios Sabanis , 2024. [ abs ][ pdf ][ bib ]

Axiomatic effect propagation in structural causal models Raghav Singal, George Michailidis , 2024. [ abs ][ pdf ][ bib ]

Optimal First-Order Algorithms as a Function of Inequalities Chanwoo Park, Ernest K. Ryu , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Resource-Efficient Neural Networks for Embedded Systems Wolfgang Roth, Günther Schindler, Bernhard Klein, Robert Peharz, Sebastian Tschiatschek, Holger Fröning, Franz Pernkopf, Zoubin Ghahramani , 2024. [ abs ][ pdf ][ bib ]

Trained Transformers Learn Linear Models In-Context Ruiqi Zhang, Spencer Frei, Peter L. Bartlett , 2024. [ abs ][ pdf ][ bib ]

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees Nachuan Xiao, Xiaoyin Hu, Xin Liu, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]

Efficient Modality Selection in Multimodal Learning Yifei He, Runxiang Cheng, Gargi Balasubramaniam, Yao-Hung Hubert Tsai, Han Zhao , 2024. [ abs ][ pdf ][ bib ]

A Multilabel Classification Framework for Approximate Nearest Neighbor Search Ville Hyvönen, Elias Jääsaari, Teemu Roos , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Probabilistic Forecasting with Generative Networks via Scoring Rule Minimization Lorenzo Pacchiardi, Rilwan A. Adewoyin, Peter Dueben, Ritabrata Dutta , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Multiple Descent in the Multiple Random Feature Model Xuran Meng, Jianfeng Yao, Yuan Cao , 2024. [ abs ][ pdf ][ bib ]

Mean-Square Analysis of Discretized Itô Diffusions for Heavy-tailed Sampling Ye He, Tyler Farghly, Krishnakumar Balasubramanian, Murat A. Erdogdu , 2024. [ abs ][ pdf ][ bib ]

Invariant and Equivariant Reynolds Networks Akiyoshi Sannai, Makoto Kawano, Wataru Kumagai , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Personalized PCA: Decoupling Shared and Unique Features Naichen Shi, Raed Al Kontar , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Survival Kernets: Scalable and Interpretable Deep Kernel Survival Analysis with an Accuracy Guarantee George H. Chen , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control Amrit Singh Bedi, Anjaly Parayil, Junyu Zhang, Mengdi Wang, Alec Koppel , 2024. [ abs ][ pdf ][ bib ]

Convergence for nonconvex ADMM, with applications to CT imaging Rina Foygel Barber, Emil Y. Sidky , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms T. Tony Cai, Hongji Wei , 2024. [ abs ][ pdf ][ bib ]

Sparse NMF with Archetypal Regularization: Computational and Robustness Properties Kayhan Behdin, Rahul Mazumder , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions Shijun Zhang, Jianfeng Lu, Hongkai Zhao , 2024. [ abs ][ pdf ][ bib ]

Effect-Invariant Mechanisms for Policy Generalization Sorawit Saengkyongam, Niklas Pfister, Predrag Klasnja, Susan Murphy, Jonas Peters , 2024. [ abs ][ pdf ][ bib ]

Pygmtools: A Python Graph Matching Toolkit Runzhong Wang, Ziao Guo, Wenzheng Pan, Jiale Ma, Yikai Zhang, Nan Yang, Qi Liu, Longxuan Wei, Hanxue Zhang, Chang Liu, Zetian Jiang, Xiaokang Yang, Junchi Yan , 2024. (Machine Learning Open Source Software Paper) [ abs ][ pdf ][ bib ]      [ code ]

Heterogeneous-Agent Reinforcement Learning Yifan Zhong, Jakub Grudzien Kuba, Xidong Feng, Siyi Hu, Jiaming Ji, Yaodong Yang , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Sample-efficient Adversarial Imitation Learning Dahuin Jung, Hyungyu Lee, Sungroh Yoon , 2024. [ abs ][ pdf ][ bib ]

Stochastic Modified Flows, Mean-Field Limits and Dynamics of Stochastic Gradient Descent Benjamin Gess, Sebastian Kassing, Vitalii Konarovskyi , 2024. [ abs ][ pdf ][ bib ]

Rates of convergence for density estimation with generative adversarial networks Nikita Puchkin, Sergey Samsonov, Denis Belomestny, Eric Moulines, Alexey Naumov , 2024. [ abs ][ pdf ][ bib ]

Additive smoothing error in backward variational inference for general state-space models Mathis Chagneux, Elisabeth Gassiat, Pierre Gloaguen, Sylvain Le Corff , 2024. [ abs ][ pdf ][ bib ]

Optimal Bump Functions for Shallow ReLU networks: Weight Decay, Depth Separation, Curse of Dimensionality Stephan Wojtowytsch , 2024. [ abs ][ pdf ][ bib ]

Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees Alexander Terenin, David R. Burt, Artem Artemev, Seth Flaxman, Mark van der Wilk, Carl Edward Rasmussen, Hong Ge , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Tail Decay Rate Estimation of Loss Function Distributions Etrit Haxholli, Marco Lorenzi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces Hao Liu, Haizhao Yang, Minshuo Chen, Tuo Zhao, Wenjing Liao , 2024. [ abs ][ pdf ][ bib ]

Post-Regularization Confidence Bands for Ordinary Differential Equations Xiaowu Dai, Lexin Li , 2024. [ abs ][ pdf ][ bib ]

On the Generalization of Stochastic Gradient Descent with Momentum Ali Ramezani-Kebrya, Kimon Antonakopoulos, Volkan Cevher, Ashish Khisti, Ben Liang , 2024. [ abs ][ pdf ][ bib ]

Pursuit of the Cluster Structure of Network Lasso: Recovery Condition and Non-convex Extension Shotaro Yagishita, Jun-ya Gotoh , 2024. [ abs ][ pdf ][ bib ]

Iterate Averaging in the Quest for Best Test Error Diego Granziol, Nicholas P. Baskerville, Xingchen Wan, Samuel Albanie, Stephen Roberts , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Nonparametric Inference under B-bits Quantization Kexuan Li, Ruiqi Liu, Ganggang Xu, Zuofeng Shang , 2024. [ abs ][ pdf ][ bib ]

Black Box Variational Inference with a Deterministic Objective: Faster, More Accurate, and Even More Black Box Ryan Giordano, Martin Ingram, Tamara Broderick , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On Sufficient Graphical Models Bing Li, Kyongwon Kim , 2024. [ abs ][ pdf ][ bib ]

Localized Debiased Machine Learning: Efficient Inference on Quantile Treatment Effects and Beyond Nathan Kallus, Xiaojie Mao, Masatoshi Uehara , 2024. [ abs ][ pdf ][ bib ]      [ code ]

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks Sebastian Neumayer, Lénaïc Chizat, Michael Unser , 2024. [ abs ][ pdf ][ bib ]

Improving physics-informed neural networks with meta-learned optimization Alex Bihlo , 2024. [ abs ][ pdf ][ bib ]

A Comparison of Continuous-Time Approximations to Stochastic Gradient Descent Stefan Ankirchner, Stefan Perko , 2024. [ abs ][ pdf ][ bib ]

Critically Assessing the State of the Art in Neural Network Verification Matthias König, Annelot W. Bosman, Holger H. Hoos, Jan N. van Rijn , 2024. [ abs ][ pdf ][ bib ]

Estimating the Minimizer and the Minimum Value of a Regression Function under Passive Design Arya Akhavan, Davit Gogolashvili, Alexandre B. Tsybakov , 2024. [ abs ][ pdf ][ bib ]

Modeling Random Networks with Heterogeneous Reciprocity Daniel Cirkovic, Tiandong Wang , 2024. [ abs ][ pdf ][ bib ]

Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment Zixian Yang, Xin Liu, Lei Ying , 2024. [ abs ][ pdf ][ bib ]

On Efficient and Scalable Computation of the Nonparametric Maximum Likelihood Estimator in Mixture Models Yangjing Zhang, Ying Cui, Bodhisattva Sen, Kim-Chuan Toh , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Decorrelated Variable Importance Isabella Verdinelli, Larry Wasserman , 2024. [ abs ][ pdf ][ bib ]

Model-Free Representation Learning and Exploration in Low-Rank MDPs Aditya Modi, Jinglin Chen, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal , 2024. [ abs ][ pdf ][ bib ]

Seeded Graph Matching for the Correlated Gaussian Wigner Model via the Projected Power Method Ernesto Araya, Guillaume Braun, Hemant Tyagi , 2024. [ abs ][ pdf ][ bib ]      [ code ]

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization Shicong Cen, Yuting Wei, Yuejie Chi , 2024. [ abs ][ pdf ][ bib ]

Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic Zheng Tracy Ke, Jun S. Liu, Yucong Ma , 2024. [ abs ][ pdf ][ bib ]

Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction Yuze Han, Guangzeng Xie, Zhihua Zhang , 2024. [ abs ][ pdf ][ bib ]

On Truthing Issues in Supervised Classification Jonathan K. Su , 2024. [ abs ][ pdf ][ bib ]

research project on machine learning

Analytics Insight

Top 10 Research and Thesis Topics for ML Projects in 2022

Avatar photo

This article features the top 10 research and thesis topics for ML projects for students to try in 2022

Text mining and text classification, image-based applications, machine vision, optimization, voice classification, sentiment analysis, recommendation framework project, mall customers’ project, object detection with deep learning.

Whatsapp Icon

Disclaimer: Any financial and crypto market information given on Analytics Insight are sponsored articles, written for informational purpose only and is not an investment advice. The readers are further advised that Crypto products and NFTs are unregulated and can be highly risky. There may be no regulatory recourse for any loss from such transactions. Conduct your own research by contacting financial experts before making any investment decisions. The decision to read hereinafter is purely a matter of choice and shall be construed as an express undertaking/guarantee in favour of Analytics Insight of being absolved from any/ all potential legal action, or enforceable claims. We do not represent nor own any cryptocurrency, any complaints, abuse or concerns with regards to the information provided shall be immediately informed here .

You May Also Like

10-ways-Cloud-Computing-Empowers-Tech-Start-Ups

10 Ways Cloud Computing Empowers Tech Start-Ups

research project on machine learning

How to Ensure the Security of Cloud-Native Applications?

Upskills

5 Upskills Working Students Should Develop in Today’s Tech- Driven Era

Crypto Prices

Top 10 Crypto Prices for Sep 1, 2023: Bitcoin plunges below US$26,000 as SEC delays judgement on Spot ETFs

AI-logo

Analytics Insight® is an influential platform dedicated to insights, trends, and opinion from the world of data-driven technologies. It monitors developments, recognition, and achievements made by Artificial Intelligence, Big Data and Analytics companies across the globe.

linkedin

  • Select Language:
  • Privacy Policy
  • Content Licensing
  • Terms & Conditions
  • Submit an Interview

Special Editions

  • Dec – Crypto Weekly Vol-1
  • 40 Under 40 Innovators
  • Women In Technology
  • Market Reports
  • AI Glossary
  • Infographics

Latest Issue

Magazine April 2024

Disclaimer: Any financial and crypto market information given on Analytics Insight is written for informational purpose only and is not an investment advice. Conduct your own research by contacting financial experts before making any investment decisions, more information here .

Second Menu

research project on machine learning

machine learning Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

An explainable machine learning model for identifying geographical origins of sea cucumber Apostichopus japonicus based on multi-element profile

A comparison of machine learning- and regression-based models for predicting ductility ratio of rc beam-column joints, alexa, is this a historical record.

Digital transformation in government has brought an increase in the scale, variety, and complexity of records and greater levels of disorganised data. Current practices for selecting records for transfer to The National Archives (TNA) were developed to deal with paper records and are struggling to deal with this shift. This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial intelligence tools to aid selection. The project AI for Selection evaluated a range of commercial solutions varying from off-the-shelf products to cloud-hosted machine learning platforms, as well as a benchmarking tool developed in-house. Suitability of tools depended on several factors, including requirements and skills of transferring bodies as well as the tools’ usability and configurability. This article also explores questions around trust and explainability of decisions made when using AI for sensitive tasks such as selection.

Automated Text Classification of Maintenance Data of Higher Education Buildings Using Text Mining and Machine Learning Techniques

Data-driven analysis and machine learning for energy prediction in distributed photovoltaic generation plants: a case study in queensland, australia, modeling nutrient removal by membrane bioreactor at a sewage treatment plant using machine learning models, big five personality prediction based in indonesian tweets using machine learning methods.

<span lang="EN-US">The popularity of social media has drawn the attention of researchers who have conducted cross-disciplinary studies examining the relationship between personality traits and behavior on social media. Most current work focuses on personality prediction analysis of English texts, but Indonesian has received scant attention. Therefore, this research aims to predict user’s personalities based on Indonesian text from social media using machine learning techniques. This paper evaluates several machine learning techniques, including <a name="_Hlk87278444"></a>naive Bayes (NB), K-nearest neighbors (KNN), and support vector machine (SVM), based on semantic features including emotion, sentiment, and publicly available Twitter profile. We predict the personality based on the big five personality model, the most appropriate model for predicting user personality in social media. We examine the relationships between the semantic features and the Big Five personality dimensions. The experimental results indicate that the Big Five personality exhibit distinct emotional, sentimental, and social characteristics and that SVM outperformed NB and KNN for Indonesian. In addition, we observe several terms in Indonesian that specifically refer to each personality type, each of which has distinct emotional, sentimental, and social features.</span>

Compressive strength of concrete with recycled aggregate; a machine learning-based evaluation

Temperature prediction of flat steel box girders of long-span bridges utilizing in situ environmental parameters and machine learning, computer-assisted cohort identification in practice.

The standard approach to expert-in-the-loop machine learning is active learning, where, repeatedly, an expert is asked to annotate one or more records and the machine finds a classifier that respects all annotations made until that point. We propose an alternative approach, IQRef , in which the expert iteratively designs a classifier and the machine helps him or her to determine how well it is performing and, importantly, when to stop, by reporting statistics on a fixed, hold-out sample of annotated records. We justify our approach based on prior work giving a theoretical model of how to re-use hold-out data. We compare the two approaches in the context of identifying a cohort of EHRs and examine their strengths and weaknesses through a case study arising from an optometric research problem. We conclude that both approaches are complementary, and we recommend that they both be employed in conjunction to address the problem of cohort identification in health research.

Export Citation Format

Share document.

Grad Coach

Research Topics & Ideas

Artifical Intelligence (AI) and Machine Learning (ML)

Research topics and ideas about AI and machine learning

If you’re just starting out exploring AI-related research topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research topic ideation process by providing a hearty list of research topics and ideas , including examples from past studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan  to fill that gap.

If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, if you’d like hands-on help, consider our 1-on-1 coaching service .

Research topic idea mega list

AI-Related Research Topics & Ideas

Below you’ll find a list of AI and machine learning-related research topics ideas. These are intentionally broad and generic , so keep in mind that you will need to refine them a little. Nevertheless, they should inspire some ideas for your project.

  • Developing AI algorithms for early detection of chronic diseases using patient data.
  • The use of deep learning in enhancing the accuracy of weather prediction models.
  • Machine learning techniques for real-time language translation in social media platforms.
  • AI-driven approaches to improve cybersecurity in financial transactions.
  • The role of AI in optimizing supply chain logistics for e-commerce.
  • Investigating the impact of machine learning in personalized education systems.
  • The use of AI in predictive maintenance for industrial machinery.
  • Developing ethical frameworks for AI decision-making in healthcare.
  • The application of ML algorithms in autonomous vehicle navigation systems.
  • AI in agricultural technology: Optimizing crop yield predictions.
  • Machine learning techniques for enhancing image recognition in security systems.
  • AI-powered chatbots: Improving customer service efficiency in retail.
  • The impact of AI on enhancing energy efficiency in smart buildings.
  • Deep learning in drug discovery and pharmaceutical research.
  • The use of AI in detecting and combating online misinformation.
  • Machine learning models for real-time traffic prediction and management.
  • AI applications in facial recognition: Privacy and ethical considerations.
  • The effectiveness of ML in financial market prediction and analysis.
  • Developing AI tools for real-time monitoring of environmental pollution.
  • Machine learning for automated content moderation on social platforms.
  • The role of AI in enhancing the accuracy of medical diagnostics.
  • AI in space exploration: Automated data analysis and interpretation.
  • Machine learning techniques in identifying genetic markers for diseases.
  • AI-driven personal finance management tools.
  • The use of AI in developing adaptive learning technologies for disabled students.

Research topic evaluator

AI & ML Research Topic Ideas (Continued)

  • Machine learning in cybersecurity threat detection and response.
  • AI applications in virtual reality and augmented reality experiences.
  • Developing ethical AI systems for recruitment and hiring processes.
  • Machine learning for sentiment analysis in customer feedback.
  • AI in sports analytics for performance enhancement and injury prevention.
  • The role of AI in improving urban planning and smart city initiatives.
  • Machine learning models for predicting consumer behaviour trends.
  • AI and ML in artistic creation: Music, visual arts, and literature.
  • The use of AI in automated drone navigation for delivery services.
  • Developing AI algorithms for effective waste management and recycling.
  • Machine learning in seismology for earthquake prediction.
  • AI-powered tools for enhancing online privacy and data protection.
  • The application of ML in enhancing speech recognition technologies.
  • Investigating the role of AI in mental health assessment and therapy.
  • Machine learning for optimization of renewable energy systems.
  • AI in fashion: Predicting trends and personalizing customer experiences.
  • The impact of AI on legal research and case analysis.
  • Developing AI systems for real-time language interpretation for the deaf and hard of hearing.
  • Machine learning in genomic data analysis for personalized medicine.
  • AI-driven algorithms for credit scoring in microfinance.
  • The use of AI in enhancing public safety and emergency response systems.
  • Machine learning for improving water quality monitoring and management.
  • AI applications in wildlife conservation and habitat monitoring.
  • The role of AI in streamlining manufacturing processes.
  • Investigating the use of AI in enhancing the accessibility of digital content for visually impaired users.

Recent AI & ML-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic in AI, they are fairly generic and non-specific. So, it helps to look at actual studies in the AI and machine learning space to see how this all comes together in practice.

Below, we’ve included a selection of AI-related studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • An overview of artificial intelligence in diabetic retinopathy and other ocular diseases (Sheng et al., 2022)
  • HOW DOES ARTIFICIAL INTELLIGENCE HELP ASTRONOMY? A REVIEW (Patel, 2022)
  • Editorial: Artificial Intelligence in Bioinformatics and Drug Repurposing: Methods and Applications (Zheng et al., 2022)
  • Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities, and Challenges (Mukhamediev et al., 2022)
  • Will digitization, big data, and artificial intelligence – and deep learning–based algorithm govern the practice of medicine? (Goh, 2022)
  • Flower Classifier Web App Using Ml & Flask Web Framework (Singh et al., 2022)
  • Object-based Classification of Natural Scenes Using Machine Learning Methods (Jasim & Younis, 2023)
  • Automated Training Data Construction using Measurements for High-Level Learning-Based FPGA Power Modeling (Richa et al., 2022)
  • Artificial Intelligence (AI) and Internet of Medical Things (IoMT) Assisted Biomedical Systems for Intelligent Healthcare (Manickam et al., 2022)
  • Critical Review of Air Quality Prediction using Machine Learning Techniques (Sharma et al., 2022)
  • Artificial Intelligence: New Frontiers in Real–Time Inverse Scattering and Electromagnetic Imaging (Salucci et al., 2022)
  • Machine learning alternative to systems biology should not solely depend on data (Yeo & Selvarajoo, 2022)
  • Measurement-While-Drilling Based Estimation of Dynamic Penetrometer Values Using Decision Trees and Random Forests (García et al., 2022).
  • Artificial Intelligence in the Diagnosis of Oral Diseases: Applications and Pitfalls (Patil et al., 2022).
  • Automated Machine Learning on High Dimensional Big Data for Prediction Tasks (Jayanthi & Devi, 2022)
  • Breakdown of Machine Learning Algorithms (Meena & Sehrawat, 2022)
  • Technology-Enabled, Evidence-Driven, and Patient-Centered: The Way Forward for Regulating Software as a Medical Device (Carolan et al., 2021)
  • Machine Learning in Tourism (Rugge, 2022)
  • Towards a training data model for artificial intelligence in earth observation (Yue et al., 2022)
  • Classification of Music Generality using ANN, CNN and RNN-LSTM (Tripathy & Patel, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, in order for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Research Topic Kickstarter service, which is the perfect starting point for developing a unique, well-justified research topic.

Research Topic Kickstarter - Need Help Finding A Research Topic?

You Might Also Like:

Topic Kickstarter: Research topics in education

can one come up with their own tppic and get a search

can one come up with their own title and get a search

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Research Group

Machine learning.

research project on machine learning

We study a range of research areas related to machine learning and their applications for robotics, health care, language processing, information retrieval and more. Among these subjects include precision medicine, motion planning, computer vision, Bayesian inference, graphical models, statistical inference and estimation. Our work is interdisciplinary and deeply rooted in systems and computer science theory.

Many of our researchers have affiliations with other groups at MIT, including the  Institute for Medical Engineering & Science  (IMES) and the Institute for Data, Systems and Society (IDSS).

Related Links

If you would like to contact us about our work, please refer to our members below and reach out to one of the group leads directly.

Last updated Jun 20 '18

Research Areas

Impact areas.

Broderick-headshot

Tamara Broderick

Tommi Jaakkola

Tommi Jaakkola

Jegelka-headshot

Stefanie Jegelka

Kaelbling

Leslie Kaelbling

David Sontag headshot

David Sontag

Optimal transport for statistics and machine learning.

default headshot

Justin Solomon

Sensible deep learning for 3d data, interpretability in complex machine learning models, diversity-inducing probability measures, geometry in large-scale machine learning, robust optimization in machine learning and data mining, structured prediction through randomization, andreea gane, tamir hazan, learning optimal interventions.

Gifford

Learning Strategic Games

Scalable bayesian inference with optimization, learning from streaming network data, different types of approximations for fast and accurate probabilistic inference, finite approximations of infinite models, tractable models of sparse networks, scalable bayesian inference via adaptive data summaries,   12 more.

A new technique can help researchers who use Bayesian inference achieve more accurate results more quickly, without a lot of additional work (Credits: iStock).

Automated method helps researchers quantify uncertainty in their predictions

New MIT research provides a theoretical proof for a phenomenon observed in practice: that encoding symmetries in the machine learning model helps the model learn with fewer data (Credits: Alex Shipps/MIT CSAIL).

How symmetry can come to the aid of machine learning

What do people mean when they say “generative AI,” and why do these systems seem to be finding their way into practically every application imaginable? MIT AI experts help break down the ins and outs of this increasingly popular, and ubiquitous, technology (Credits: Jose-Luis Olivares, MIT).

Explained: Generative AI

Two MIT CSAIL members.

Two CSAILers within School of Engineering granted tenure in 2023

CSAIL PI and MIT professor

Unpacking the “black box” to build better AI models

Avoiding shortcut solutions in artificial intelligence.

Model developed at MIT’s Computer Science and Artificial Intelligence Laboratory could reduce false positives and unnecessary surgeries.

Using artificial intelligence to improve early breast cancer detection

  4 more.

Machine Learning - CMU

Machine learning department research.

Research

Additional Research

  • In-depth descriptions of selected research projects are available on the ML@CMU blog .
  • Additional research projects are described on the individual home pages of  ML faculty .
  • Examples of projects open to undergraduates can be found on our ML Minor Senior Projects page.

Dissertations & Reports

  • PhD Dissertations
  • Data Analysis Projects
  • Technical Reports

research project on machine learning

Smart. Open. Grounded. Inventive. Read our Ideas Made to Matter.

Which program is right for you?

MIT Sloan Campus life

Through intellectual rigor and experiential learning, this full-time, two-year MBA program develops leaders who make a difference in the world.

A rigorous, hands-on program that prepares adaptive problem solvers for premier finance careers.

A 12-month program focused on applying the tools of modern data science, optimization and machine learning to solve real-world business problems.

Earn your MBA and SM in engineering with this transformative two-year program.

Combine an international MBA with a deep dive into management science. A special opportunity for partner and affiliate schools only.

A doctoral program that produces outstanding scholars who are leading in their fields of research.

Bring a business perspective to your technical and quantitative expertise with a bachelor’s degree in management, business analytics, or finance.

A joint program for mid-career professionals that integrates engineering and systems thinking. Earn your master’s degree in engineering and management.

An interdisciplinary program that combines engineering, management, and design, leading to a master’s degree in engineering and management.

Executive Programs

A full-time MBA program for mid-career leaders eager to dedicate one year of discovery for a lifetime of impact.

This 20-month MBA program equips experienced executives to enhance their impact on their organizations and the world.

Non-degree programs for senior executives and high-potential managers.

A non-degree, customizable program for mid-career professionals.

Startup tactics: How and when to hire technical talent

Robots could give humans ‘superpowers’

Can generative AI provide trusted financial advice?

Credit: Jennifer Tapias Derch

Ideas Made to Matter

Artificial Intelligence

MIT Sloan research on artificial intelligence and machine learning

Oct 26, 2022

There’s little question artificial intelligence and machine learning are playing an increased role in making business decisions. A 2022 survey of senior data and technology executives by NewVantage Partners found that 92% of large companies reported achieving returns on their data and AI investments — an increase from 48% in 2017.

But as these technologies enter the mainstream, new issues arise: How will they change the nature of workflow and workplace connection? Will they be ethically harnessed? Will they replace humans?

Here’s what to consider as AI and machine learning become omnipresent, according to MIT Sloan researchers, visiting scholars, and industry experts.

Artificial intelligence is changing most occupations, but it is far from replacing humans , according to a book examining the findings of the MIT Task Force on the Work of the Future .

Some 92% of large companies report that they are achieving returns on their investments in data and artificial intelligence.

MIT researchers David Autor, David Mindell, and Elisabeth B. Reynolds argue it is essential to understand the capabilities and limitations of artificial intelligence as we think about how it will impact jobs.

Today’s AI challenges center around physical dexterity, social interaction, and judgment. Consider a home health aide, whose responsibilities include providing physical assistance to a fragile human, observing their behavior, and communicating with family and doctors. Not until automation reaches that level can it truly be considered what scholars call “artificial general intelligence."

It’s possible to harness AI to create a more equitable future.   Across industries, workers are worried that automation and artificial intelligence will steal their jobs. MIT Sloan professor of management Thomas Kochan shares those concerns, but also sees “tremendous” innovative potential in new technologies to create “a productive and more equitable future.”

In his online executive education course, “ Leading the Future of Work ,” Kochan lays out a four-pronged roadmap for work of the future:

  • Strive to become a “high road” company   that creates value for all stakeholders, including their employees.
  • Use advanced technology to drive innovation and augment work. 
  • Train and upskill the workforce to work alongside new technologies. 
  • Rebuild a dialogue with labor leaders for a mutually beneficial future. 

An AI-fueled productivity boom is on the way. Consider the internet: Its fundamental technologies took root in the 1960s and 1970s but didn’t become commercialized until the mid-1990s. Stanford University’s Erik Brynjolfsson, part of the MIT Task Force on the Work of the Future, calls this phenomenon a “ J-curve ,” when technological acceptance is “slow and incremental at first, then accelerates to break through into broad acceptance.”

Now, businesses should prepare for an AI-fueled J-curve as the technology takes off. Companies should focus on incorporating artificial intelligence and machine learning  into work processes and preparing employees, Brynjolfsson said at an EmTech Next conference, while policymakers should move to ensure its adoption doesn’t contribute to inequality.

Related Articles

AI requires stakeholder buy-in. Machine learning tools are used in a variety of fields. But getting tech into the workplace is just one step — these tools are only successful if they’re integrated into workflows and if people trust them enough to depend on them.

A key to successful adoption is an ongoing dialogue between technology developers and end users, according to research from MIT Sloan professor Kate Kellogg and co-authors.

“Managers and developers need to engage in a back-and-forth process to build, evaluate, and refine the tools in order for them to be useful in practice,” Kellogg said.

Moreover, stakeholders need to believe that AI programs are accurate and trustworthy. Artificial intelligence explainability can help. Researchers from the  MIT Center for Information Systems Research define AI explainability as “the ability to manage AI initiatives in ways that ensure models are value-generating, compliant, representative, and reliable.”

Artificial intelligence explainability is an emerging field, the researchers admit. They recommend that companies start by identifying units and organizations that are already creating effective AI explanations, and identifying practices that the organization’s own AI project teams have already adopted.

AI and machine learning are transforming digital marketing. Most marketers are concerned about retention and revenue, but without good forecasts, decisions about effective marketing interventions can be arbitrary, said Dean Eckles, social and digital experimentation research group lead at the MIT Initiative on the Digital Economy.

Machine learning will change that, helping to predict customer behavior and intuit customer needs.

“There’s a lot of value to applying statistical machine learning to predict long-term and hard-to-measure outcomes,” Eckles said.

Companies such as Wayfair and Spotify leverage machine learning for bespoke customer experiences, from highly tailored furniture search results to personalized suggested playlists. And, when COVID-19 spread, Moderna used its long-held automated processes and AI algorithms to increase the number of small-scale messenger RNA (mRNA) needed to run clinical experiments. This groundwork contributed to Moderna releasing one of the first COVID-19 vaccines (using mRNA) in the pandemic’s early days.

Good data makes for good AI. Having five essential data capabilities, such as data scientists and a data platform, helps companies build successful artificial intelligence programs.

The key is to take an enterprise — rather than local — perspective as AI project teams learn and mature, according to researchers from MIT CISR. When companies identify and accumulate expertise and practices from their AI teams, they can create reusable and refinable practices and build capability. This accelerates new AI projects and sets up those future teams for success.

Embracing data-centric AI is also important. This is “the discipline of systematically engineering the data needed to build a successful AI system,” according to Andrew Ng, SM ’98, founder of the Google Brain research lab and former chief scientist at Baidu.

Focusing on high-quality data that's consistently labeled could unlock the value of AI for sectors such as health care, government technology, and manufacturing, Ng said at a EmTech Digital conference.

And even if AI can’t solve all of the world’s problems — and might even cause some new ones along the way — it can at least make Wordle a little bit easier to solve.

Read next: SEC’s Gary Gensler on how AI is changing finance

A human hand and a robot hand touch a digital heart made of network lines

reason.town

research project on machine learning

How to Write a Machine Learning Research Proposal

Introduction, what is a machine learning research proposal, the structure of a machine learning research proposal, tips for writing a machine learning research proposal, how to get started with writing a machine learning research proposal, the importance of a machine learning research proposal, why you should take the time to write a machine learning research proposal, how to make your machine learning research proposal stand out, the bottom line: why writing a machine learning research proposal is worth it, further resources on writing machine learning research proposals.

If you want to get into machine learning, you first need to get past the research proposal stage. We’ll show you how.

Checkout this video:

research project on machine learning

A machine learning research proposal is a document that summarizes your research project, methods, and expected outcomes. It is typically used to secure funding for your project from a sponsor or institution, and can also be used to assessment your project by peers. Your proposal should be clear, concise, and well-organized. It should also provide enough detail to allow reviewers to assess your project’s feasibility and potential impact.

In this guide, we will cover the basics of what you need to include in a machine learning research proposal. We will also provide some tips on how to create a strong proposal that is more likely to be funded.

A machine learning research proposal is a document that describes a proposed research project that uses machine learning algorithms and techniques. The proposal should include a brief overview of the problem to be tackled, the proposed solution, and the expected results. It should also briefly describe the dataset to be used, the evaluation metric, and any other relevant details.

There is no one-size-fits-all answer to this question, as the structure of a machine learning research proposal will vary depending on the specific research question you are proposing to answer, the methods you plan to use, and the overall focus of your proposal. However, there are some general principles that all good proposals should follow.

In general, a machine learning research proposal should include:

-A summary of the problem you are trying to solve and the motivation for solving it -A brief overview of previous work in this area, including any relevant background information -A description of your proposed solution and a discussion of how it compares to existing approaches -An evaluation plan outlining how you will evaluate the effectiveness of your proposed solution -A discussion of any potential risks or limitations associated with your proposed research

Useful tips for writing a machine learning research proposal:

-Your proposal should address a specific problem or question in machine learning.

-Before writing your proposal, familiarize yourself with the existing literature in the field. Your proposal should build on the existing body of knowledge and contribute to the understanding of the chosen problem or question.

-Your proposal should be clear and concise. It should be easy for non-experts to understand what you are proposing and why it is important.

-Your proposal should be well organized. Include a brief introduction, literature review, methodology, expected results, and significance of your work.

-Make sure to proofread your proposal carefully before submitting it.

A machine learning research proposal is a document that outlines the problem you want to solve with machine learning, the methods you will use to solve it, the data you will use, and the anticipated results. This guide provides an overview of what should be included in a machine learning research proposal so that you can get started on writing your own.

1. Introduction 2. Problem statement 3. Methodology 4. Data 5. Evaluation 6. References

A machine learning research proposal is a document that outlines the rationale for a proposed machine learning research project. The proposal should convince potential supervisors or funding bodies that the project is worthwhile and that the researcher is competent to undertake it.

The proposal should include:

– A clear statement of the problem to be addressed or the question to be answered – A review of relevant literature – An outline of the proposed research methodology – A discussion of the expected outcome of the research – A realistic timeline for completing the project

A machine learning research proposal is not just a formal exercise; it is an opportunity to sell your idea to potential supervisors or funding bodies. Take advantage of this opportunity by doing your best to make your proposal as clear, concise, and convincing as possible.

Your machine learning research proposal is your chance to sell your project to potential supervisors and funders. It should be clear, concise and make a strong case for why your project is worth undertaking.

A well-written proposal will convince others that you have a worthwhile project and that you have the necessary skills and experience to complete it successfully. It will also help you to clarify your own ideas and focus your research.

Writing a machine learning research proposal can seem like a daunting task, but it doesn’t have to be. If you take it one step at a time, you’ll be well on your way to writing a strong proposal that will get the support you need.

In order to make your machine learning research proposal stand out, you will need to do several things. First, make sure that your proposal is well written and free of grammatical errors. Second, make sure that your proposal is clear and concise. Third, make sure that your proposal is organized and includes all of the necessary information. Finally, be sure to proofread your proposal carefully before submitting it.

The benefits of writing a machine learning research proposal go beyond helping you get funding for your project. A good proposal will also force you to think carefully about your problem and how you plan to solve it. This process can help you identify potential flaws in your approach and make sure that your project is as strong as possible before you start.

It can also be helpful to have a machine learning research proposal on hand when you’re talking to potential collaborators or presenting your work to a wider audience. A well-written proposal can give people a clear sense of what your project is about and why it’s important, which can make it easier to get buy-in and find people who are excited to work with you.

In short, writing a machine learning research proposal is a valuable exercise that can help you hone your ideas and make sure that your project is as strong as possible before you start.

Here are some further resources on writing machine learning research proposals:

– How to Write a Machine Learning Research Paper: https://MachineLearningMastery.com/how-to-write-a-machine-learning-research-paper/

– 10 Tips for Writing a Machine Learning Research Paper: https://blog.MachineLearning.net/10-tips-for-writing-a-machine-learning-research-paper/

Please also see our other blog post on writing research proposals: https://www.MachineLearningMastery.com/how-to-write-a-research-proposal/

Similar Posts

Cadence is Hiring Machine Learning Jobs

Cadence is Hiring Machine Learning Jobs

Contents Who is Cadence? What is machine learning? What are the benefits of machine learning? What types of jobs will Cadence be hiring for? How can machine learning be used in business? What are some examples of machine learning? What are some challenges of machine learning? What is the future of machine learning? How can…

Data Curation and Machine Learning – The Future of Data Management

Data Curation and Machine Learning – The Future of Data Management

Contents The importance of data curation in the age of machine learning. How machine learning is changing the landscape of data management. The benefits of using machine learning for data curation. The challenges of data curation in the age of machine learning. The future of data management with machine learning. How to effectively use machine…

Yearning for Machine Learning: Andrew Ng

Yearning for Machine Learning: Andrew Ng

Contents What is machine learning? What are the different types of machine learning? What are the benefits of machine learning? What are the challenges of machine learning? What is Andrew Ng’s machine learning strategy? What are some of Andrew Ng’s machine learning successes? What are some of Andrew Ng’s machine learning failures? What can we…

The Best Book for Learning Machine Learning

The Best Book for Learning Machine Learning

Contents Introduction Why Machine Learning? What is Machine Learning? The Benefits of Machine Learning The Different Types of Machine Learning The Challenges of Machine Learning The Future of Machine Learning Conclusion ReferencesFurther Reading If you’re looking for the best book to learn machine learning, look no further than “The Elements of Statistical Learning”! This book…

CS229: The Top Machine Learning Course

CS229: The Top Machine Learning Course

ContentsCS229: The Top Machine Learning CourseWhat is CS229?The Course StructureThe Course TopicsThe Course ResourcesThe Course AssignmentsThe Course GradingThe Course InstructorsThe Course ScheduleThe Course Enrollment CS229 is widely considered to be the top machine learning course in the world. In this blog post, we’ll explore what makes this course so special and why you should consider…

What Does an eBay Machine Learning Engineer Do?

What Does an eBay Machine Learning Engineer Do?

ContentsIntroductionWhat is machine learning?What is eBay’s machine learning team?What do machine learning engineers do at eBay?What projects are machine learning engineers working on at eBay?What skills are needed to be a machine learning engineer at eBay?What are the benefits of being a machine learning engineer at eBay?What is the job outlook for machine learning engineers?How…

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 07 April 2023

Using machine learning to predict student retention from socio-demographic characteristics and app-based engagement metrics

  • Sandra C. Matz 1 ,
  • Christina S. Bukow 2 ,
  • Heinrich Peters 1 ,
  • Christine Deacons 3 ,
  • Alice Dinu 5   na1 &
  • Clemens Stachl 4  

Scientific Reports volume  13 , Article number:  5705 ( 2023 ) Cite this article

7793 Accesses

2 Citations

35 Altmetric

Metrics details

  • Human behaviour

An Author Correction to this article was published on 21 June 2023

This article has been updated

Student attrition poses a major challenge to academic institutions, funding bodies and students. With the rise of Big Data and predictive analytics, a growing body of work in higher education research has demonstrated the feasibility of predicting student dropout from readily available macro-level (e.g., socio-demographics or early performance metrics) and micro-level data (e.g., logins to learning management systems). Yet, the existing work has largely overlooked a critical meso-level element of student success known to drive retention: students’ experience at university and their social embeddedness within their cohort. In partnership with a mobile application that facilitates communication between students and universities, we collected both (1) institutional macro-level data and (2) behavioral micro and meso-level engagement data (e.g., the quantity and quality of interactions with university services and events as well as with other students) to predict dropout after the first semester. Analyzing the records of 50,095 students from four US universities and community colleges, we demonstrate that the combined macro and meso-level data can predict dropout with high levels of predictive performance (average AUC across linear and non-linear models = 78%; max AUC = 88%). Behavioral engagement variables representing students’ experience at university (e.g., network centrality, app engagement, event ratings) were found to add incremental predictive power beyond institutional variables (e.g., GPA or ethnicity). Finally, we highlight the generalizability of our results by showing that models trained on one university can predict retention at another university with reasonably high levels of predictive performance.

Similar content being viewed by others

research project on machine learning

Genomic data in the All of Us Research Program

The All of Us Research Program Genomics Investigators

research project on machine learning

Ethics and discrimination in artificial intelligence-enabled recruitment practices

Zhisheng Chen

research project on machine learning

Principal component analysis

Michael Greenacre, Patrick J. F. Groenen, … Elena Tuzhilina

Introduction

In the US, only about 60% of full-time students graduate from their program 1 , 2 with the majority of those who discontinue their studies dropping out during their first year 3 These high attrition rates pose major challenges to students, universities, and funding bodies alike 4 , 5 .

Dropping out of university without a degree negatively impacts students’ finances and mental health. Over 65% of US undergraduate students receive student loans to help pay for college, causing them to incur heavy debts over the course of their studies 6 . According to the U.S. Department of Education, students who take out a loan but never graduate are three times more likely to default on loan repayment than students who graduate 7 . This is hardly surprising, given that students who drop out of university without a degree, earn 66% less than university graduates with a bachelor's degree and are far more likely to be unemployed 2 . In addition to financial losses, the feeling of failure often adversely impacts students’ well-being and mental health 8 .

At the same time, student attrition negatively impacts universities and federal funding bodies. For universities, student attrition results in an average annual revenue reduction of approximately $16.5 billion per year through the loss of tuition fees 9 , 10 . Similarly, student attrition wastes valuable resources provided by states and federal governments. For example, the US Department of Education Integrated Postsecondary Education Data System (IPEDS) shows that between 2003 and 2008, state and federal governments together provided more than $9 billion in grants and subsidies to students who did not return to the institution where they were enrolled for a second year 11 .

Given the high costs of attrition, the ability to predict at-risk students – and to provide them with additional support – is critical 12 , 13 . As most dropouts occur during the first year 14 , such predictions are most valuable if they can identify at-risk students as early as possible 13 , 15 , 16 . The earlier one can identify students who might struggle, the better the chances that interventions aimed at protecting them from gradually falling behind – and eventually discontinuing their studies – will be effective 17 , 18 .

Indicators of student retention

Previous research has identified various predictors of student retention, including previous academic performance, demographic and socio-economic factors, and the social embeddedness of a student in their home institution 19 , 20 , 21 , 22 , 23 .

Prior academic performance (e.g., high school GPA, SAT and ACT scores or college GPA) has been identified as one of the most consistent predictors of student retention: Students who are more successful academically are less likely to drop out 17 , 21 , 24 , 25 , 26 , 27 , 28 , 29 . Similarly, research has highlighted the role of demographic and socio-economic variables, including age, gender, and ethnicity 12 , 19 , 25 , 27 , 30 as well as socio-economic status 31 in predicting a students’ likelihood of persisting. For example, women are more likely to continue their studies than men 12 , 30 , 32 , 33 while White and Asian students are more likely to persist than students from other ethnic groups 19 , 27 , 30 . Moreover, a student’s socio-economic status and immediate financial situation have been shown to impact retention. Students are more likely to discontinue their studies if they are first generation students 34 , 35 , 36 or experience high levels of financial distress, (e.g., due to student loans or working nearly full time to cover college expenses) 37 , 38 . In contrast, students who receive financial support that does not have to be repaid post-graduation are more likely to complete their degree 39 , 40 .

While most of the outlined predictors of student retention are relatively stable intrapersonal characteristics and oftentimes difficult or costly to change, research also points to a more malleable pillar of retention: the students’ experience at university. In particular, the extent to which they are successfully integrated and socialized into the institution 16 , 22 , 41 , 42 . As Bean (2005) notes, “few would deny that the social lives of students in college and their exchanges with others inside and outside the institution are important in retention decisions” (p. 227) 41 . The extent to which a student is socially integrated and embedded into their institution has been studied in a number of ways, relating retention to the development of friendships with fellow students 43 , the student’s position in the social networks 16 , 29 , the experience of social connectedness 44 and a sense of belonging 42 , 45 , 46 . Taken together, these studies suggest that interactions with peers as well as faculty and staff – for example through participation in campus activities, membership of organizations, and the pursuit of extracurricular activities – help students better integrate into university life 44 , 47 . In contrast, a lack of social integration resulting from commuting (i.e., not living on campus with other students) has shown to negatively impact a student’s chances of completing their degree 48 , 49 , 50 , 51 . In short, the stronger a student is embedded and feels integrated into the university community – particularly in their first year – the less likely the student will drop out 42 , 52 .

Predicting retention using machine learning

A large proportion of research on student attrition has focused on understanding and explaining drivers of student retention. However, alongside the rise of computational methods and predictive modeling in the social sciences 53 , 54 , 55 , educational researchers and practitioners have started exploring the feasibility and value of data-driven approaches in supporting institutional decision making and educational effectiveness (for excellent overviews of the growing field see 56 , 57 ). In line with this broader trend, a growing body of work has shown the potential of predicting student dropout with the help of machine learning. In contrast to traditional inferential approaches, machine learning approaches are predominantly concerned with predictive performance (i.e., the ability to accurately forecast behavior that has not yet occurred) 54 . In the context of student retention this means: How accurately can we predict whether a student is going to complete or discontinue their studies (in the future) by analyzing their demographic and socio-economic characteristics, their past and current academic performance, as well as their current embeddedness in the university system and culture?

Echoing the National Academy of Education’s statement (2017) that “in the educational context, big data typically take the form of administrative data and learning process data, with each offering their own promise for educational research” (p.4) 58 , the vast majority of existing studies have focused on the prediction of student retention from demographic and socio-economic characteristics as well as students’ academic history and current performance 13 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 . In a recent study, Aulck and colleagues trained a model on the administrative data of over 66,000 first-year students enrolled in a public US university (e.g., race, gender, highschool GPA, entrance exam scores and early college performance/transcript data) to predict whether they would re-enroll in the second year and eventually graduate 59 . Specifically, they used a range of linear and non-linear machine learning models (e.g., regularized logistic regression, k-nearest neighbor, random forest, support vector machine, and gradient boosted trees) to predict retention out-of-sample using a standard cross-validation procedures. Their model was able to predict dropouts with an accuracy of 88% and graduation with an accuracy of 81% (where 50% is chance).

While the existing body of work provides robust evidence for the potential of predictive models for identifying at-risk students, it is based on similar sets of macro-level data (e.g., institutional data, academic performance) or micro-level data (e.g., click-stream data). Almost entirely absent from this research is data on students’ daily experience and engagement with both other students and the university itself (meso-level). Although there have been a small number of studies trying to capture part of this experience by inferring social networks from smart card transactions that were made by students in the same time and place 16 or engagement metrics with an open online course 67 , none of the existing work has offered a more holistic and comprehensive view on students’ daily experience. One potential explanation for this gap is that information about students’ social interactions with classmates or their day-to-day engagement with university services and events is difficult to track. While universities often have access to demographic or socio-economic variables through their Student Information Systems (SISs), and can easily track their academic performance, most universities do not have an easy way of capturing student’s deeper engagement with the system.

Research overview

In this research, we partner with an educational software company – READY Education – that offers a virtual one-stop interaction platform in the form of a smartphone application to facilitate communication between students, faculty, and staff. Students receive relevant information and announcements, can manage their university activities, and interact with fellow students in various ways. For example, the app offers a social media experience like Facebook, including private messaging, groups, public walls, and friendships. In addition, it captures students’ engagement with the university asking them to check into events (e.g., orientation, campus events, and student services) using QR code functionality and prompting them to rate their experience afterwards (see Methods for more details on the features we extracted from this data). As a result, the READY Education app allows us to observe a comprehensive set of information about students that include both (i) institutional data (i.e., demographic, and socio-economic characteristics as well as academic performance), and (ii) their idiosyncratic experience at university captured by their daily interactions with other students and the university services/events. Combining the two data sources captures a student’s profile more holistically and makes it possible to consider potential interactions between the variable sets. For example, being tightly embedded in a social support network of friends might be more important for retention among first-generation students who might not receive the same level of academic support or learn about implicit academic norms and rules from their parents.

Building on this unique dataset, we use machine learning models to predict student retention (i.e., dropout) from both institutional and behavioral engagement data. Given the desire to identify at-risk students as early as possible, we only use information gathered in the students’ first semester to predict whether the student dropped out at any point in time during their program. To thoroughly validate and scrutinize our analytical approach, generate insights for potential interventions, and probe the generalizability of our predictive models across different universities, we investigate the following three research questions:

How accurately can we predict a student's likelihood of discontinuing their studies using information from the first term of their studies (i.e., institutional data, behavioral engagement data, and a combination of both)?

Which features are the most predictive of student retention?

How well do the predictive models generalize across universities (i.e., how well can we predict student retention of students from one university if we use the model trained on data from another university and vice versa)?

Participants

We analyze de-identified data from four institutions with a total of 50,095 students (min = 476, max = 45,062). All students provided informed consent to the use of the anonymized data by READY Education and research partners. All experimental protocols were approved by the Columbia University Ethics Board, and all methods carried out were in accordance with the Board’s guidelines and regulations. The data stem from two sources: (a) Institutional data and (b) behavioral engagement data. The institutional data collected by the universities contain socio-demographics (e.g., gender, ethnicity), general study information (e.g., term of admission, study program), financial information (e.g., pell eligibility), students’ academic achievement scores (e.g., GPA, ACT) as well as the retention status. The latter indicated whether students continued or dropped out and serves as the outcome variable. As different universities collect different information about their students, the scope of institutional data varied between universities. Table 1 provides a descriptive overview of the most important sociodemographic characteristics for each of the four universities. In addition, it provides a descriptive overview of the app usage, including the average number of logs per student, the total number of sessions and logs, as well as the percentage of students in a cohort using the app (i.e., coverage). The broad coverage of students using the app, ranging between 70 and 98%, results in a largely representative sample of the student populations at the respective universities.

Notably, Universities 1–3 are traditional university campuses, while University 4 is a combination of 16 different community colleges. Given that there is considerable heterogeneity across campuses, the predictive accuracies for University 4 are a-priori expected to be lower than those observed for universities 1–3 (and partly speak to the generalizability of findings already). The decision to include University 4 as a single entity was based on the fact that separating out the 16 colleges would have resulted in an over-representation of community colleges that all share similar characteristics thereby artificially inflating the observed cross-university accuracies. Given these limitations (and the fact that the University itself collapsed the college campuses for many of their internal reports), we decided to analyze it as a single unit, acknowledging that this approach brings its own limitations.

The behavioral engagement data were generated through the app (see Table 1 for the specific data collection windows at each University). Behavioral engagement data were available in the form of time-stamped event-logs (i.e., each row in the raw data represented a registered event such as a tab clicked, a comment posted, a message sent). Each log could be assigned to a particular student via an anonymized, unique identifier. Across all four universities, the engagement data contained 7,477,630 sessions (Mean = 1,869,408, SD = 3,329,852) and 17,032,633 logs (Mean = 4,258,158, SD = 6,963,613) across all universities. For complete overview of all behavioral engagement metrics including a description see Table S1 in the Supplementary Materials.

Pre-processing and feature extraction

As a first step, we cleaned both the institutional and app data. For the institutional data, we excluded students who did not use the app and therefore could not be assigned a unique identifier. In addition, we excluded students without a term of admission to guarantee that we are only observing the students’ first semester. Lastly, we removed duplicate entries resulting from dual enrollment in different programs. For the app usage data, we visually inspected the variables in our data set for outliers that might stem from technical issues. We pre-processed data that reflected clicking through the app, named “clicked_[…]” and “viewed_[…]” (see Table S1 in the Supplementary Materials). A small number of observations showed unrealistically high numbers of clicks on the same tab in a very short period, which is likely a reflection of a student repeatedly clicking on a tab due to long loading time or other technical issues. To avoid oversampling these behaviors, we removed all clicks of the same type which were made by the same person less than one minute apart.

We extracted up to 462 features for each university across two broad categories: (i) institutional features and (ii) engagement features, using evidence from previous research as a reference point (see Table S2 in the Supplementary Materials for a comprehensive overview of all features and their availability for each of the universities). Institutional features contain students’ demographic, socio-economic and academic information. The engagement features represent the students’ behavior during their first term of studies. They can be further divided into app engagement and community engagement. The app engagement features represent the students’ behavior related to app usage, such as whether the students used the app before the start of the semester, how often they clicked on notifications or the community tabs, or whether their app use increased over the course of the semester. The community involvement features reflect social behavior and interaction with others, e.g., the number of messages sent, posts and comments made, events visited or a student’s position in the network as inferred from friendships and direct messages. Importantly, many of the features in our dataset will be inter-correlated. For example, living in college accommodation could signal higher levels of socio-economic status, but also make it more likely that students attend campus events and connect with other students living on campus. While intercorrelations among predictors is a challenge with standard inferential statistical techniques such as regression analyses, the methods we apply in this paper can account for a large number of correlated predictors.

Institutional features were directly derived from the data recorded by the institutions. As noted above, not all features were available for all universities, resulting in slightly different feature sets across universities. The engagement features were extracted from the app usage data. As we focused on an early prediction of drop-out, we restricted the data to event-logs that were recorded in the respective students' first term. Notably, the data captures students’ engagement as a time-stamped series of events, offering fine-grained insights into their daily experience. For reasons of simplicity and interpretability (see research question 2), we collapse the data into a single entry for each student. Specifically, we describe a student’s overall experience during the first semester, by calculating distribution measures for each student such as the arithmetic mean, standard deviation, kurtosis, skewness, and sum values. For example, we calculate how many daily messages a particular student sent or received during their first semester, or how many campus events they attended in total. However, we also account for changes in a student’s behavior over time by calculating more complex features such as entropy (e.g., the extent to which a person has frequent contact with few people or the same degree of contact with many people) and the development of specific behaviors over time measured by the slope of regression analyses, as well as features representing the regularity of behavior (e.g., the deviation of time between sending messages). Overall, the feature set was aimed at describing a student’s overall engagement with campus resources and other students during the first semester as well as changed in engagement over time. Finally, we extracted some of the features separately for weekdays and weekends to account for differences and similarities in students’ activities during the week and the weekend. For example, little social interaction on weekdays might predict retention differently than little social interaction on the weekend.

We further cleaned the data by discarding participants for whom the retention status was missing and those in which 95% or more of the values were zero or missing. Furthermore, features were removed if they showed little or no variance across participants, which makes them essentially meaningless in a prediction task. Specifically, we excluded numerical features which showed the same values for more than 90% of observations and categorical features which showed the same value for all observations.

In addition to these general pre-processing procedures, we integrated additional pre-processing steps into the resampling prior to training the models to avoid an overestimation of model performance 68 . To prevent problems with categorical features that occur when there are fewer levels in the test than in the training data, we first removed categories that did not occur in the training data. Second, we removed constant categorical features containing a single value only (and therefore no variation). Third, we imputed missing values using the following procedures: Categorical features were imputed with the mode. Following commonly used approaches to dealing with missing data, the imputation of numeric features varied between the learners. For the elastic net, we imputed those features with the median. For the random forest, we used twice the maximum to give missing values a distinct meaning that would allow the model to leverage this information. Lastly, we used the "Synthetic Minority Oversampling Technique" (SMOTE) to create artificial examples for the minority class in the training data 69 . The only exception was University 4 which followed a different procedure due to the large sample size and estimated computing power for implementing SMOTE. Instead of oversampling minority cases, we downsampled majority cases such that the positive and negative class were balanced. This was done to address the class imbalance caused by most students continuing their studies rather than dropping out 12 .

Predictive modeling approach

We predicted the retention status (1 = dropped out, 0 = continued) in a binary prediction task, with three sets of features: (1) institutional features (2) engagement features, and (3) a combined set of all features. To ensure the robustness of our predictions and to identify the model which is best suited for the current prediction context 54 , we compared a linear classifier ( elastic net; implemented in glmnet 4.1–4) 70 , 71 and a nonlinear classifier ( random forest ; implemented in randomForest 4.7–1) 72 , 73 . Both models are particularly well suited for our prediction context and are common choices in computational social science. That is, simple linear or logistic regression models are not suitable to work with datasets that have many inter-correlated predictors (in our case, a total of 462 predictors many of which are highly correlated) due to a high risk of overfitting. Both the elastic net and the random forest algorithm can effectively utilize large feature sets while reducing the risk of overfitting. We evaluate the performance of our six models for each school (2 algorithms and 3 feature sets), using out-of-sample benchmark experiments that estimate predictive performance and compare it against a common non-informative baseline model. The baseline represents a null-model that does not include any features, but instead always predicts the majority class, which in our samples means “continued.” 74 Below, we provide more details about the specific algorithms (i.e., elastic net and random forest), the cross-validation procedure, and the performance metrics we used for model evaluation.

Elastic net model

The elastic net is a regularized regression approach that combines advantages of ridge regression 75 with those of the LASSO 76 and is motivated by the need to handle large feature sets. The elastic net shrinks the beta-coefficients of features that add little predictive value (e.g., intercorrelated, little variance). Additionally, the elastic net can effectively remove variables from the model by reducing the respective beta coefficients to zero 70 . Unlike classical regression models, the elastic net does not aim to optimize the sum of least squares, but includes two penalty terms (L1, L2) that incentivize the model to reduce the estimated beta value of features that do not add information to the model. Combining the L1 (the sum of absolute values of the coefficients) and L2 (the sum of the squared values of the coefficients) penalties, elastic net addresses the limitations of alternative linear models such as LASSO regression (not capable of handling multi-collinearity) and Ridge Regression (may not produce sparse-enough solutions) 70 .

Formally, following Hastie & Qian (2016) the model equation of elastic net for binary classification problems can be written as follows 77 . Suppose the response variable takes values in G = {0,1}, y i denoted as I(g i  = 1), the model formula is written as

After applying the log-odds transformation, the model formula can be written as

The objective function for logistic regression is the penalized negative binomial log-likelihood

where λ is the regularization parameter that controls the overall strength of the regularization, α is the mixing parameter that controls the balance between L1 and L2 regularization with α values closer to zero to result in sparser models (lasso regression α = 1, ridge regression α = 0). β represents coefficients of the regression model, ||β|| 1 is the is the L1 norm of the coefficients (the sum of absolute values of the coefficients), ||β|| 2 is the L2 norm of the coefficients (the sum of the squared values of the coefficients).

The regularized regression approach is especially relevant for our model because many of the app-based engagement features are highly correlated (e.g., the number of clicks is related to the number of activities registered in the app). In addition, we favored the elastic net algorithm over more complex alternatives, because the regularized beta coefficients can be interpreted as feature importance, allowing insights into which predictors are most informative of college dropout 78 , 79 .

Random forest model

Random forest models are a widely used ensemble learning method that grows many bagged and decorrelated decision trees to come up with a “collective” prediction of the outcome (i.e., the outcome that is chosen by most trees in a classification problem) 72 . Individual decision trees recursively split the feature space (rules to distinguish classes) with the goal to separate the different classes of the criterion (drop out vs. remain in our case). For a detailed description of how individual decision trees operate and translate to a random forest see Pargent, Schoedel & Stachl 80 .

Unlike the elastic net, random forest models can account for nonlinear associations between features and criterion and automatically include multi-dimensional interactions between features. Each decision tree in a random forest considers a random subset of bootstrapped cases and features, thereby increasing the variance of predictions across trees and the robustness of the overall prediction. For the splitting in each node of each tree, a random subset of features (mtry hyperparameter that we optimize in our models) are used by randomly drawing from the total set. For each split, all combinations of split variables and split points are compared, with the model choosing the splits that optimize the separation between classes 72 .

The random forest algorithm can be formally described as follows (verbatim from Hastie et al., 2016, p. 588):

For b = 1 to B:

Draw a bootstrap sample of size N from the training data.

Grow a decision tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size is reached.

Select m variables at random from the p variables.

Pick the best variable/split-point among the m according to the loss function (in our case Gini-impurity decrease)

Split the node into two daughter nodes.

Output the ensemble of trees

New predictions can then be made by generating a prediction for each tree and aggregating the results using majority vote.

The aggregation of predictions across trees in random forests improves the prediction performance compared to individual decision trees, as it can benefit from the trees’ variance and greatly reduces it to arrive at a single prediction 72 , 81 .

(Nested) Cross-validation: Out-of-sample model evaluation

We evaluate the performance of our predictive models using an out-of-sample validation approach. The idea behind out-of-sample validation is to increase the likelihood that a model will accurately predict student dropout on new data (e.g. new students) by using different datasets when training and evaluating the model. A commonly used, efficient technique for out-of-sample validation is to repeatedly fit (cf. training) and evaluate (cf. testing) models on non-overlapping parts of the same datasets and to combine the individual estimates across multiple iterations. This procedure – known as cross-validation – can also be used for model optimization (e.g., hyperparameter-tuning, pre-processing, variable selection), by repeatedly evaluating different settings for optimal predictive performance. When both approaches are combined, evaluation and optimization steps need to be performed in a nested fashion to ensure a strict separation of training and test data for a realistic out-of-sample performance estimation. The general idea is to emulate all modeling steps in each fold of the resampling as if it were a single in-sample model. Here, we use nested cross-validation to estimate the predictive performance of our models, to optimize model hyperparameters, and to pre-process data. We illustrate the procedure in Fig.  1 .

figure 1

Schematic cross-validation procedure for out-of-sample predictions. The figure shows a tenfold cross-validation in the outer loop which is used to estimate the overall performance of the model by comparing the predicted outcomes for each student in the previously unseen test set with their actual outcomes. Within each of the 10 outer loops, a fivefold cross-validation in the inner loop is used to finetune model hyperparameters by evaluating different model settings.

The cross-validation procedure works as follows: Say we have a dataset with 1,000 students. In a first step, the dataset is split into ten different subsamples, each containing data from 100 students. In the first round, nine of these subsamples are used for training (i.e., fitting the model to estimate parameters, green boxes). That means, the data from the first 900 students will be included in training the model to relate the different features to the retention outcome. Once training is completed, the model’s performance can be evaluated on the data of the remaining 100 students (i.e., test dataset, blue boxes). For each student, the actual outcome (retained or discontinued, grey and black figures) is compared to the predicted outcome (retained or discontinued, grey and black figures). This comparison allows for the calculation of various performance metrics (see “ Performance metrics ” section below for more details). In contrast to the application of traditional inferential statistics, the evaluation process in predictive models separates the data used to train a model from the data used to evaluate these associations. Hence any overfitting that occurs at the training stage (e.g., using researcher degrees of freedom or due to the model learning relationships that are unique to the training data), hurts the predictive performance in the testing stage. To further increase the robustness of findings and leverage the entire dataset, this process is repeated for all 10 subsamples, such that each subsample is used nine times for training and once for testing. Finally, the obtained estimates from those ten iterations are aggregated to arrive at a cross-validated estimate of model performance. This tenfold cross validation procedure is referred to as the “outer loop”.

In addition to the outer loop, our models also contain an “inner loop”. The inner loop consists of an additional cross-validation procedure that is used to identify ideal hyperparameter settings (see “ Hyperparameter tuning ” section below). That is, in each of the ten iterations of the outer loop, the training sample is further divided into a training and test set to identify the best parameter constellations before model evaluation in the outer loop. We used fivefold cross-validation in the inner loop. All analyses scripts for the pre-processing and modeling steps are available on OSF ( https://osf.io/bhaqp/?view_only=629696d6b2854aa9834d5745425cdbbc ).

Performance metrics

We evaluate model performance based on four different metrics. Our main metric for model performance is AUC (area under the received operating characteristics curve). AUC is commonly used to assess the performance of a model over a 50%-chance baseline, and can range anywhere between 0 and 1. The AUC metric captures the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR or recall; i.e. the percentage of correctly classified dropouts among all students who actually dropped out), against the false positive rate (FPR; i.e. the percentage of students erroneously classified as dropouts among all the students who actually continued). When the AUC is 0.5, the model’s predictive performance is equal to chance or a coin flip. The closer to 1, the higher the model’s predictive performance in distinguishing between students who continued and those who dropped out.

In addition, we report the F1 score, which ranges between 0 and 1 82 . The F1 score is based on the model’s positive predictive value (or precision, i.e., the percentage of correctly classified dropouts among all students predicted to have dropped out) as well as the model's TPR. A high F1 score hence indicates that there are both few false positives and few false negatives.

Given the specific context, we also report the TPR and the true negative rates (TNR, i.e. the percentage of students predicted to continue among all students who actually continued). Depending on their objective, universities might place a stronger emphasis on optimizing the TPR to make sure no student who is at risk of dropping out gets overlooked or on optimizing the TNR to save resources and assure that students do not get overly burdened. Notably, in most cases, universities are likely to strive for a balance between the two, which is reflected in our main AUC measure. All reported performance metrics represent the mean predictive performance across the 10 cross-validation folds of the outer loop 54 .

Hyperparameter tuning

We used a randomized search with 50 iterations and fivefold cross-validation for hyperparameter tuning in the inner loop of our cross-validation. The randomized search algorithm fits models with hyperparameter configurations randomly selected from a previously defined hyperparameter space and then picks the model that shows the best generalized performance averaged over the five cross-validation folds. The best hyperparamter configuration is used for training in the outer resampling loop to evaluate model performance.

For the elastic net classifier, we tuned the regularization parameter lambda, the decision rule used to choose lambda, and the L1-ratio parameter. The search space for lambda encompassed the 100 glmnet default values 71 . The space of decision rules for lambda included lambda.min which chooses the value of lambda that results in the minimum mean cross-validation error, and lambda.1se which chooses the value of lambda that results in the most regularized model such that the cross-validation error remains within one standard error of the minimum. The search space for the L1-ratio parameter included the range of values between 0 (ridge) to 1 (lasso). For the random forest classifier, we tuned the number of features selected for each split within a decision tree (mtry) and the minimum node size (i.e., how many cases are required to be left in the resulting end-nodes of the tree). The search space for the number of input features per decision tree was set to a range of 1 to p, where p represents the dimensionality of the feature space. The search space for minimum node size was set to a range of 1 to 5. Additionally, for both models, we tuned the oversampling rate and the number or neighbors used to generate new samples utilized by the SMOTE algorithm. The oversampling rate was set to a range of 2 to 15 and the number of nearest neighbors was set to a range of 1 to 10.

RQ1: How accurately can we predict a student's likelihood of discontinuing their studies using information from the first term of their studies?

Figure  2 displays AUC scores (Y-axis) across the different universities (rows), separated by the different feature sets (colors) and predictive algorithms (X-axis labels). The figure displays the distribution of AUC accuracies across the 10 cross-validation folds, alongside their mean and standard deviation. Independent t-tests using Holm corrections for multiple comparisons indicate statistical differences in the predictive performance across the different models and feature sets within each university. Table 2 provides the predictive performance across all four metrics.

figure 2

AUC performance across the four universities for different feature sets and model. Inst. = Institutional data. Engag. = Engagement data. (EN) = Elastic Net. (RF) = Random Forest.

Overall, our models showed high levels of predictive accuracies across universities, models, feature sets and performance metrics, significantly outperforming the baseline in all instances. The main performance metric AUC reached an average of 73% (where 50% is chance), with a maximum of 88% for the random forest model and the full feature set in University 1. Both institutional features and engagement features significantly contributed to predictive performance, highlighting the fact that a student’s likelihood to drop out is both a function of their more stable socio-demographic characteristics as well as their experience of campus life. In most cases, the joint model (i.e., the combination of institutional and engagement features) performed better than each of the individual models alone. Finally, the random forest models produced higher levels of predictive performance than the elastic net in most cases (average AUC elastic net = 70%, AUC random forest = 75%), suggesting that the features are likely to interact with one another in predicting student retention and might not always be linearly related to the outcome.

RQ2: Which features are the most predictive of student retention?

To provide insights into the underlying relationships between student retention and socio-demographic as well as behavioral features, we examined two indicators of feature importance that both offer unique insights. First, we calculated the zero-order correlations between the features and the outcome for each of the four universities. We chose zero-order correlations over elastic net coefficients as they represent the relationships unaltered by the model’s regularization procedure (i.e., the relationship between a feature and the outcome is shown independently of the importance of the other features in the model). To improve the robustness of our findings, we only included the variables that passed the threshold for data inclusion in our models and had less than 50% of the data imputed. The top third of Table 3 displays the 10 most important features (i.e., highest absolute correlation with retention). The sign in brackets indicates the direction of the effects with ( +) indicating a protective factor and (−) indicating a risk factor. Features that showed up in the top 10 for more than 1 university are highlighted in bold.

Second, we calculated permutation variable importance scores for the elastic net and random forest models. For the elastic net model, feature importance is reported as the model coefficient after shrinking the coefficients according to their incremental predictive power. Compared to the zero-order correlation, the elastic net coefficients hence identify the features that have the strongest unique variance. For the random forest models, feature importance is reported as a model-agnostic metric that estimates the importance of a feature by observing the drop in model predictive performance when the actual association between the feature and the outcome is broken by randomly shuffling observations 72 , 83 . A feature is considered important if shuffling its values increases the model error (and therefore decreases the model’s predictive performance). In contrast to the coefficients from the elastic net model, the permutation feature importance scores are undirected and do not provide insights into the specific nature of the relationship between the feature and the outcome. However, they account for the fact that some features might not be predictive themselves but could still prove valuable in the overall model performance because they moderate the impact of other features. For example, minority or first-generation students might benefit more from being embedded in a strong social network than majority students who do not face the same barriers and are likely to have a stronger external support network. The bottom of Table 3 displays the 10 most important features in the elastic net and random forest models (i.e., highest permutation variable importance).

Supporting the findings reported in RQ1, the zero-order correlations confirm that both institutional and behavioral engagement features play an important role in predicting student retention. Aligned with prior work, students’ performance (measured by GPA or ACT) repeatedly appeared as one of the most important predictors across universities and models. In addition, many of the engagement features (e.g., services attended, chat messages network centrality) are related to social activities or network features, supporting the notion that a student’s social connections and support play a critical role in student retention. In addition, the extent to which students are positively engaged with their institutions (e.g., by attending events and rating them highly) appears to play a critical role in preventing dropout.

RQ3: How well do the predictive models generalize across universities?

To test the generalizability of our models across universities, we used the predictive model trained on one university (e.g., University 1) to predict retention of the remaining three universities (e.g., Universities 2–4). Figures  3 A,B display the AUCs across all possible pairs, indicating which university was used for training (X-axis) and which was used for testing (Y-axis, see Figure S1 in the SI for graphs illustrating the findings for F1, TNR and TPR).

figure 3

Performance (average AUC) of cross-university predictions.

Overall, we observed reasonably high levels of predictive performance when applying a model trained on one university to the data of another. The average AUC observed was 63% (for both the elastic net and the random forest), with the highest predictive performance reaching 74% (trained on University 1, predicting University 2), just 1%-point short of the predictive performance observed for the prediction from the universities own model (trained on University 2, predicting University 2). Contrary to the findings in RQ1, the random forest models did not perform better than the elastic net when making predictions for other universities. This suggests that the benefits afforded by the random forest models capture complex interaction patterns that are somewhat unique to each university but might not generalize well to new contexts. The main outlier in generalizability was University 4, where none of the other models reached accuracies much better than chance, and whose model produced relatively low levels of accuracies when predicting student retention across universities 1–2. This is likely a result of the fact that University 4 was qualitatively different from the other universities in several ways, including the fact that University 4 was a community college and consisted of 16 different campuses that were merged for the purpose of this analysis (see Methods for more details).

We show that student retention can be predicted from institutional data, behavioral engagement data, and their combination. Using data from over 50,000 students across four Universities, our predictive models achieve out-of-sample accuracies of up to 88% (where 50% is chance). Notably, while both institutional data and behavioral engagement data significantly predict retention, the combination of the two performs best in most instances. This finding is further supported by our feature importance analyses which suggest that both institutional and behavioral engagement features are among the most important predictors of student retention. Specifically, academic performance as measured by GPA and behavioral metrics associated with campus engagement (e.g., event attendances or ratings) or a student’s position in the network (e.g., closeness or centrality) were shown to consistently act as protective factors. Finally, we highlight the generalizability of our models across universities. Models trained on one university were able to predict student retention at another with reasonably high levels of predictive performance. As one might expect, the generalizability across universities heavily depends on the extent to which the universities are similar on important structural dimensions, with prediction accuracies dropping radically in cases where similarity is low (see low cross-generalization for University 4).

Contributions to the scientific literature

Our findings contribute to the existing literature in several ways. First, they respond to recent calls for more predictive research in psychology 54 , 55 as well as the use of Big Data analytics in education research 56 , 57 . Not only do our models consider socio-demographic characteristics that are collected by universities, but they also capture students’ daily experience and university engagement by tracking behaviors via the READY Education app. Our findings suggest, these more psychological predictors of student retention can improve the performance of predictive models above and beyond socio-demographic variables. This is consistent with previous findings suggesting that the inclusion of engagement metrics improves the performance of predictive models 16 , 84 , 85 . Overall, our models showed superior accuracies to models of former studies that were trained only on demographics and transcript records 15 , 25 or less comprehensive behavioral features 16 and provided results comparable to those reported in studies that additionally included a wide range of socio-economic variables 12 . Given that the READY Education app captures only a fraction of the students' actual experience, the high predictive accuracies make an even stronger case for the importance of student engagement in college retention.

Second, our findings provide insights into the features that are most important in predicting whether a student is going to drop out or not. By doing so they complement our predictive approach with layers of understanding that are conducive to not only validating our models but also generating insights into potential protective and risk factors. Most importantly, our findings highlight the relevance of the behavioral engagement metrics for predicting student retention. Most features identified as being important in the prediction were related to app and community engagement. In line with previous research, features indicative of early and deep social integration, such as interactions with peers and faculty or the development of friendships and social networks, were found to be highly predictive 16 , 41 . For example, it is reasonable to assume that a short time between app registration and the first visit of a campus event (one of the features identified as important) has a positive impact on retention, because campus events offer ideal opportunities for students to socialize 86 . Early participation in a campus event implies early integration and networking with others, protecting students from perceived stress 87 and providing better social and emotional support 88 . In contrast, a student who never attends an event or does so very late in the semester may be less connected to the campus life and the student community which in turn increases the likelihood of dropping out. This interpretation is strengthened by the fact that a high proportion of positive event ratings was identified as an important predictor of a student continuing their studies. Students who enjoy an event are likely to feel more comfortable, be embedded in the university life, make more connections, and build stronger connections. This might result in a virtuous cycle in which students continue attending events and over time create a strong social connection to their peers. As in most previous work, a high GPA score was consistently related to a higher likelihood of continuing one’s studies 21 , 24 . Although their importance varied across universities, ethnicity was also found to play a major role for retention, with consistent inequalities replicating in our predictive models 12 , 19 , 47 . For example, Black students were on average more likely to drop-out, suggesting that universities should dedicate additional resources to protect this group. Importantly, all qualitative interpretations are post-hoc. While many of the findings are intuitive and align with previous research on the topic, future studies should validate our results and investigate the causality underlying the effects in experimental or longitudinal within-person designs 54 , 78 .

Finally, our findings are the first to explore the extent to which the relationships between certain socio-demographic and behavioral characteristics might be idiosyncratic and unique to a specific university. By being able to compare the models across four different universities, we were able to show that many of the insights gained from one university can be leveraged to predict student retention at another. However, our findings also point to important boundary conditions: The more dissimilar universities are in their organizational structures and student experience, the more idiosyncratic the patterns between certain socio-demographic and behavioral features with student retention will be and the harder it is to merely translate general insights to the specific university campus.

Practical contributions

Our findings also have important practical implications. In the US, student attrition results in an average annual revenue loss of approximately $16.5 billion per year 9 , 10 and over $9 billion wasted in federal and state grants and subsidies that are awarded to students who do not finish their degree 11 . Hence, it is critical to predict potential dropouts as early and as accurately as possible to be able to offer dedicated support and allocate resources where they are needed the most. Our models rely exclusively on data collected in the first semester at university and are therefore an ideal “early warning” system for universities who want to predict whether their students will likely continue their studies or drop out at some point. Depending on the university’s resources and goals, the predictive models can be optimized for different performance measures. Indeed, a university might decide to focus on the true positive rate to capture as many dropouts as possible. While this would mean erroneously classifying “healthy “ students as potential dropouts, universities might decide that the burden of providing “unnecessary “ support to these healthy students is worth the reduced risk of missing a dropout. Importantly, our models go beyond mere socio-demographic variables and allow for a more nuanced, personal model that considers not just “who someone is” but also what their experience on campus looks like. As such, our models make it possible to acknowledge individuality rather than using over-generalized assessments of entire socio-demographic segments.

Importantly, however, it is critical to subject these models to continuous quality assurance. While predictive models could allow universities to flag at-risk students early, they could also perpetuate biases that get calcified in the predictive models themselves. For example, students who are traditionally less likely to discontinue their studies might have to pass a much higher level of dysfunctional engagement behavior before their file gets flagged as “at-risk”. Similarly, a person from a traditionally underrepresented group might receive an unnecessarily high volume of additional check-ins even though they are generally flourishing in their day-to-day experience. Given that being labeled as “at-risk” can be associated with stigma that could reinforce stigmas around historically marginalized groups, it will be critical to monitor both the performance of the model over time as well as the perception of its helpfulness among administrators, faculty, and students.

Limitations and future research

Our study has several limitations and highlights avenues for future research. First, our sample consisted of four US universities. Thus, our results are not necessarily generalizable to countries with more collectivistic cultures and other education systems such as Asia, where the reasons for dropping out might be different 89 , 90 , or Europe where most students work part-time jobs and live off-campus. Future research should investigate the extent to which our models can generalize to other cultural contexts and identify the features of student retention that are universally valid across contexts.

Second, our predictive models relied on app usage data. Therefore, our predictive approach could only be applied to students who decided to use the app. This selection, in and by itself, is likely to introduce a sampling bias, as students who decide to use the app might be more likely to retain in the first place, restricting the variance in observations, and excluding students for whom app usage data was not available. However, as our findings suggest, the institutional data alone provide predictive performance independent of the app features, making this a viable alternative for students who do not use the app.

Third, our predictive models rely on cross-sectional predictions. That is, we observe a students’ behavior over the course of an entire semester and based on the patterns observed in other students we predict whether that student is likely to drop out or not. Future research could try to improve both the predictive performance of the model and its usefulness for applied contexts by modeling within-person trends dynamically. Given enough data, the model could observe a person’s baseline behavior and identify changes from that baseline as potentially problematic. In fact, more social contact with other students might be considered a protective factor in our cross-sectional model. However, there are substantial individual differences in how much social contact individuals seek out and enjoy 91 . Hence, sending 10 chat messages a week might be considered a lot for one person, but very little for another. Future research should hence investigate whether the behavioral engagement features allow for a more dynamic within-person model that makes it possible to take base rates into account and provide a dynamic, momentary assessment of a student’s likelihood to drop out.

Fourth, although the engagement data was captured as a longitudinal time series with time-stamped events, we collapsed the data into a single set of cross-sectional features for each student. Although some of these features captures variation in behaviors over time (e.g., entropy and linear trends), future research should try to implement more advanced machine learning models to account for this time series data directly. For example, long short-term memory models (LSTMs) 92 – a type of recurrent neural network – are capable of learning patterns in longitudinal, sequential data like ours.

Fifth, even though the current research provides initial insights into the workings of the models by highlighting the importance of certain features, the conclusions that can be drawn from these analyses are limited as the importance metrics are calculated for the overall population. Future research could aim to calculate the importance of certain features at the individual level to test whether their importance varies across certain socio-demographic features. Estimating the importance of a person’s position in the social network on an individual level, for example, would make it possible to see whether the importance is correlated with institutional data such as minority or first-generation status.

Finally, our results lay the foundation for developing interventions that foster retention through shaping students’ experience at university 93 . Interventions which have been shown to have a positive effect on retention, include orientation programs and academic advising 94 , student support services like mentoring and coaching as well as need-based grants 95 . However, to date, the first-year experience programs meant to strengthen social integration of first year students, do not seem to have yielded positive results 96 , 97 . Our findings could support the development of interventions aimed at improving and maintaining student integration on campus. On a high level, the insights into the most important features provide an empirical path for developing relevant interventions that target the most important levers of student retention. For example, the fact that the time between registration and the first event attendance has such a big impact on student retention means that universities should do everything they can to get students to attend events as early as possible. Similarly, they could develop interventions that lead to more cohesive networks among cohorts and make sure that all students connect to their community. On a deeper, more sophisticated level, new approaches to model explainability could allow universities to tailor their intervention to each student 98 , 99 . For example, explainable AI makes it possible to derive decision rules for each student, indicating which features were critical in predicting the students’ outcome. While student A might be predicted to drop out because they are disconnected from the network, student B might be predicted to drop out because they don’t access the right information on the app. Given this information, universities would be able to personalize their offerings to the specific needs of the student. While student A might be encouraged to spend more time socializing with other students, student B might be reminded to check out important course information. Hence, predictive models could not only be used to identify students at risk but also provide an automated path to offering personalized guidance and support.

For every study that is discontinued, an educational dream shatters. And every shattered dream has a negative long-term impact both on the student and the university the student attended. In this study we introduce an approach to accurately predicting student retention after the first term. Our results show that student retention can be predicted with relatively high levels of predictive performance when considering institutional data, behavioral engagement data, or a combination of the two. By combining socio-demographic characteristics with passively observed behavioral traces reflecting a student’s daily activities, our models offer a holistic picture of students' university experiences and its relation to retention. Overall, such predictive models have great potential both for the early identification of at-risk students and for enabling timely, evidence-based interventions.

Data availability

Raw data are not publicly available due to their proprietary nature and the risks associated with de-anonymization, but they are available from the corresponding author on reasonable request. The pre-processed data and all analyses codes are available on OSF ( https://osf.io/bhaqp/ ) to facilitate reproducibility of our work. Data were analyzed using R, version 4.0.0 (R Core Team, 2020; see subsections for specific packages and versions used). The study’s design relies on secondary data and the analyses were not preregistered.

Change history

21 june 2023.

A Correction to this paper has been published: https://doi.org/10.1038/s41598-023-36579-2

Ginder, S. A., Kelly-Reid, J. E. & Mann, F. B. Graduation Rates for Selected Cohorts, 2009–14; Outcome Measures for Cohort Year 2009–10; Student Financial Aid, Academic Year 2016–17; and Admissions in Postsecondary Institutions, Fall 2017. First Look (Provisional Data). NCES 2018–151. National Center for Education Statistics (2018).

Snyder, T. D., de Brey, C. & Dillow, S. A. Digest of Education Statistics 2017 NCES 2018-070. Natl. Cent. Educ. Stat. (2019).

NSC Research Center. Persistence & Retention – 2019. NSC Research Center https://nscresearchcenter.org/snapshotreport35-first-year-persistence-and-retention/ (2019).

Bound, J., Lovenheim, M. F. & Turner, S. Why have college completion rates declined? An analysis of changing student preparation and collegiate resources. Am. Econ. J. Appl. Econ. 2 , 129–157 (2010).

Article   PubMed   PubMed Central   Google Scholar  

Bowen, W. G., Chingos, M. M. & McPherson, M. S. Crossing the finish line. in Crossing the Finish Line (Princeton University Press, 2009).

McFarland, J. et al. The Condition of Education 2019. NCES 2019-144. Natl. Cent. Educ. Stat. (2019).

Education, U. S. D. of. Fact sheet: Focusing higher education on student success. [Fact Sheet] (2015).

Freudenberg, N. & Ruglis, J. Peer reviewed: Reframing school dropout as a public health issue. Prev. Chronic Dis. 4 , 4 (2007).

Google Scholar  

Raisman, N. The cost of college attrition at four-year colleges & universities-an analysis of 1669 US institutions. Policy Perspect. (2013).

Wellman, J., Johnson, N. & Steele, P. Measuring (and Managing) the Invisible Costs of Postsecondary Attrition. Policy brief. Delta Cost Proj. Am. Instit. Res. (2012).

Schneider, M. Finishing the first lap: The cost of first year student attrition in America’s four year colleges and universities (American Institutes for Research, 2010).

Delen, D. A comparative analysis of machine learning techniques for student retention management. Decis. Support Syst. 49 , 498–506 (2010).

Article   Google Scholar  

Yu, R., Lee, H. & Kizilcec, R. F. Should College Dropout Prediction Models Include Protected Attributes? in Proceedings of the Eighth ACM Conference on Learning@ Scale 91–100 (2021).

Tinto, V. Reconstructing the first year of college. Plan. High. Educ. 25 , 1–6 (1996).

Ortiz-Lozano, J. M., Rua-Vieites, A., Bilbao-Calabuig, P. & Casadesús-Fa, M. University student retention: Best time and data to identify undergraduate students at risk of dropout. Innov. Educ. Teach. Int. 57 , 74–85 (2020).

Ram, S., Wang, Y., Currim, F. & Currim, S. Using big data for predicting freshmen retention. in 2015 international conference on information systems: Exploring the information frontier, ICIS 2015 (Association for Information Systems, 2015).

Levitz, R. S., Noel, L. & Richter, B. J. Strategic moves for retention success. N. Dir. High. Educ. 1999 , 31–49 (1999).

Veenstra, C. P. A strategy for improving freshman college retention. J. Qual. Particip. 31 , 19–23 (2009).

Astin, A. W. How, “good” is your institution’s retention rate?. Res. High. Educ. 38 , 647–658 (1997).

Coleman, J. S. Social capital in the creation of human capital. Am. J. Sociol. 94 , S95–S120 (1988).

Reason, R. D. Student variables that predict retention: Recent research and new developments. J. Stud. Aff. Res. Pract. 40 , 704–723 (2003).

Tinto, V. Dropout from higher education: A theoretical synthesis of recent research. Rev Educ Res 45 , 89–125 (1975).

Tinto, V. Completing college: Rethinking institutional action (University of Chicago Press, 2012).

Book   Google Scholar  

Astin, A. Retaining and Satisfying Students. Educ. Rec. 68 , 36–42 (1987).

Aulck, L., Velagapudi, N., Blumenstock, J. & West, J. Predicting student dropout in higher education. arXiv preprint arXiv:1606.06364 (2016).

Bogard, M., Helbig, T., Huff, G. & James, C. A comparison of empirical models for predicting student retention (Western Kentucky University, 2011).

Murtaugh, P. A., Burns, L. D. & Schuster, J. Predicting the retention of university students. Res. High. Educ. 40 , 355–371 (1999).

Porter, K. B. Current trends in student retention: A literature review. Teach. Learn. Nurs. 3 , 3–5 (2008).

Thomas, S. L. Ties that bind: A social network approach to understanding student integration and persistence. J. High. Educ. 71 , 591–615 (2000).

Peltier, G. L., Laden, R. & Matranga, M. Student persistence in college: A review of research. J. Coll. Stud. Ret. 1 , 357–375 (2000).

Nandeshwar, A., Menzies, T. & Nelson, A. Learning patterns of university student retention. Expert Syst. Appl. 38 , 14984–14996 (2011).

Boero, G., Laureti, T. & Naylor, R. An econometric analysis of student withdrawal and progression in post-reform Italian universities. (2005).

Tinto, V. Leaving college: Rethinking the causes and cures of student attrition (ERIC, 1987).

Choy, S. Students whose parents did not go to college: Postsecondary access, persistence, and attainment. Findings from the condition of education, 2001. (2001).

Ishitani, T. T. Studying attrition and degree completion behavior among first-generation college students in the United States. J. High. Educ. 77 , 861–885 (2006).

Thayer, P. B. Retention of students from first generation and low income backgrounds. (2000).

Britt, S. L., Ammerman, D. A., Barrett, S. F. & Jones, S. Student loans, financial stress, and college student retention. J. Stud. Financ. Aid 47 , 3 (2017).

McKinney, L. & Burridge, A. B. Helping or hindering? The effects of loans on community college student persistence. Res. High Educ. 56 , 299–324 (2015).

Hochstein, S. K. & Butler, R. R. The effects of the composition of a financial aids package on student retention. J. Stud. Financ. Aid 13 , 21–26 (1983).

Singell, L. D. Jr. Come and stay a while: Does financial aid effect retention conditioned on enrollment at a large public university?. Econ. Educ. Rev. 23 , 459–471 (2004).

Bean, J. P. Nine themes of college student. Coll. Stud. Retent. Formula Stud. Success 215 , 243 (2005).

Tinto, V. Through the eyes of students. J. Coll. Stud. Ret. 19 , 254–269 (2017).

Cabrera, A. F., Nora, A. & Castaneda, M. B. College persistence: Structural equations modeling test of an integrated model of student retention. J. High. Educ. 64 , 123–139 (1993).

Roberts, J. & Styron, R. Student satisfaction and persistence: Factors vital to student retention. Res. High. Educ. J. 6 , 1 (2010).

Gopalan, M. & Brady, S. T. College students’ sense of belonging: A national perspective. Educ. Res. 49 , 134–137 (2020).

Hoffman, M., Richmond, J., Morrow, J. & Salomone, K. Investigating, “sense of belonging” in first-year college students. J. Coll. Stud. Ret. 4 , 227–256 (2002).

Terenzini, P. T. & Pascarella, E. T. Toward the validation of Tinto’s model of college student attrition: A review of recent studies. Res. High Educ. 12 , 271–282 (1980).

Astin, A. W. The impact of dormitory living on students. Educational record (1973).

Astin, A. W. Student involvement: A developmental theory for higher education. J. Coll. Stud. Pers. 25 , 297–308 (1984).

Terenzini, P. T. & Pascarella, E. T. Studying college students in the 21st century: Meeting new challenges. Rev. High Ed. 21 , 151–165 (1998).

Thompson, J., Samiratedu, V. & Rafter, J. The effects of on-campus residence on first-time college students. NASPA J. 31 , 41–47 (1993).

Tinto, V. Research and practice of student retention: What next?. J. Coll. Stud. Ret. 8 , 1–19 (2006).

Lazer, D. et al. Computational social science. Science 1979 (323), 721–723 (2009).

Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology: Lessons from machine learning. Perspect. Psychol. Sci. 12 , 1100–1122 (2017).

Peters, H., Marrero, Z. & Gosling, S. D. The Big Data toolkit for psychologists: Data sources and methodologies. in The psychology of technology: Social science research in the age of Big Data. 87–124 (American Psychological Association, 2022). doi: https://doi.org/10.1037/0000290-004 .

Fischer, C. et al. Mining big data in education: Affordances and challenges. Rev. Res. Educ. 44 , 130–160 (2020).

Hilbert, S. et al. Machine learning for the educational sciences. Rev. Educ. 9 , e3310 (2021).

National Academy of Education. Big data in education: Balancing the benefits of educational research and student privacy . (2017).

Aulck, L., Nambi, D., Velagapudi, N., Blumenstock, J. & West, J. Mining university registrar records to predict first-year undergraduate attrition. Int. Educ. Data Min. Soc. (2019).

Beaulac, C. & Rosenthal, J. S. Predicting university students’ academic success and major using random forests. Res. High Educ. 60 , 1048–1064 (2019).

Berens, J., Schneider, K., Görtz, S., Oster, S. & Burghoff, J. Early detection of students at risk–predicting student dropouts using administrative student data and machine learning methods. Available at SSRN 3275433 (2018).

Dawson, S., Jovanovic, J., Gašević, D. & Pardo, A. From prediction to impact: Evaluation of a learning analytics retention program. in Proceedings of the seventh international learning analytics & knowledge conference 474–478 (2017).

Dekker, G. W., Pechenizkiy, M. & Vleeshouwers, J. M. Predicting students drop Out: A case study. Int. Work. Group Educ. Data Min. (2009).

del Bonifro, F., Gabbrielli, M., Lisanti, G. & Zingaro, S. P. Student dropout prediction. in International Conference on Artificial Intelligence in Education 129–140 (Springer, 2020).

Hutt, S., Gardner, M., Duckworth, A. L. & D’Mello, S. K. Evaluating fairness and generalizability in models predicting on-time graduation from college applications. Int. Educ. Data Min. Soc. (2019).

Jayaprakash, S. M., Moody, E. W., Lauría, E. J. M., Regan, J. R. & Baron, J. D. Early alert of academically at-risk students: An open source analytics initiative. J. Learn. Anal. 1 , 6–47 (2014).

Balakrishnan, G. & Coetzee, D. Predicting student retention in massive open online courses using hidden markov models. Elect. Eng. Comput. Sci. Univ. Calif. Berkeley 53 , 57–58 (2013).

Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning (Springer series in statistics, New York, NY, USA, 2001).

Book   MATH   Google Scholar  

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 , 321–357 (2002).

Article   MATH   Google Scholar  

Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Seri. B Stat. Methodol. 67 , 301–320 (2005).

Article   MathSciNet   MATH   Google Scholar  

Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 , 1 (2010).

Breiman, L. Random forests. Mach. Learn. 45 , 5–32 (2001).

Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2 , 18–22 (2002).

Pargent, F., Schoedel, R. & Stachl, C. An introduction to machine learning for psychologists in R. Psyarxiv (2022).

Hoerl, A. E. & Kennard, R. W. Ridge Regression. in Encyclopedia of Statistical Sciences vol. 8 129–136 (John Wiley & Sons, Inc., 2004).

Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58 , 267–288 (1996).

MathSciNet   MATH   Google Scholar  

Hastie, T. & Qian, J. Glmnet vignette. vol. 9 1–42 https://hastie.su.domains/Papers/Glmnet_Vignette.pdf (2016).

Orrù, G., Monaro, M., Conversano, C., Gemignani, A. & Sartori, G. Machine learning in psychometrics and psychological research. Front. Psychol. 10 , 2970 (2020).

Pargent, F. & Albert-von der Gönna, J. Predictive modeling with psychological panel data. Z Psychol (2019).

Pargent, F., Schoedel, R. & Stachl, C. Best practices in supervised machine learning: A tutorial for psychologists. Doi: https://doi.org/10.31234/osf.io/89snd (2023).

Friedman, J., Hastie, T. & Tibshirani, R. The elements of statistical learning Vol. 1 (Springer series in statistics, 2001).

MATH   Google Scholar  

Rijsbergen, V. & Joost, C. K. Information Retrieval Butterworths London. Google Scholar Google Scholar Digital Library Digital Library (1979).

Molnar, C. Interpretable machine learning . (Lulu. com, 2020).

Aguiar, E., Ambrose, G. A., Chawla, N. v, Goodrich, V. & Brockman, J. Engagement vs Performance: Using Electronic Portfolios to Predict First Semester Engineering Student Persistence . Journal of Learning Analytics vol. 1 (2014).

Chai, K. E. K. & Gibson, D. Predicting the risk of attrition for undergraduate students with time based modelling. Int. Assoc. Dev. Inf. Soc. (2015).

Saenz, T., Marcoulides, G. A., Junn, E. & Young, R. The relationship between college experience and academic performance among minority students. Int. J. Educ. Manag (1999).

Pidgeon, A. M., Coast, G., Coast, G. & Coast, G. Psychosocial moderators of perceived stress, anxiety and depression in university students: An international study. Open J. Soc. Sci. 2 , 23 (2014).

Wilcox, P., Winn, S. & Fyvie-Gauld, M. ‘It was nothing to do with the university, it was just the people’: The role of social support in the first-year experience of higher education. Stud. High. Educ. 30 , 707–722 (2005).

Guiffrida, D. A. Toward a cultural advancement of Tinto’s theory. Rev. High Ed. 29 , 451–472 (2006).

Triandis, H. C., McCusker, C. & Hui, C. H. Multimethod probes of individualism and collectivism. J. Pers. Soc. Psychol. 59 , 1006 (1990).

Watson, D. & Clark, L. A. Extraversion and its positive emotional core. in Handbook of personality psychology 767–793 (Elsevier, 1997).

Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R. & Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28 , 2222–2232 (2017).

Article   MathSciNet   PubMed   Google Scholar  

Arnold, K. E. & Pistilli, M. D. Course signals at Purdue: Using learning analytics to increase student success. in Proceedings of the 2nd international conference on learning analytics and knowledge 267–270 (2012).

Braxton, J. M. & McClendon, S. A. The fostering of social integration and retention through institutional practice. J. Coll. Stud. Ret. 3 , 57–71 (2001).

Sneyers, E. & de Witte, K. Interventions in higher education and their effect on student success: A meta-analysis. Educ. Rev. (Birm) 70 , 208–228 (2018).

Jamelske, E. Measuring the impact of a university first-year experience program on student GPA and retention. High Educ. (Dordr) 57 , 373–391 (2009).

Purdie, J. R. & Rosser, V. J. Examining the academic performance and retention of first-year students in living-learning communities and first-year experience courses. Coll. Stud. Aff. J. 29 , 95 (2011).

Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 , 56–67 (2020).

Ramon, Y., Farrokhnia, R. A., Matz, S. C. & Martens, D. Explainable AI for psychological profiling from behavioral data: An application to big five personality predictions from financial transaction records. Information 12 , 518 (2021).

Download references

Author information

Alice Dinu is an Independent Researcher.

Authors and Affiliations

Columbia University, New York, USA

Sandra C. Matz & Heinrich Peters

Ludwig Maximilian University of Munich, Munich, Germany

Christina S. Bukow

Ready Education, Montreal, Canada

Christine Deacons

University of St. Gallen, St. Gallen, Switzerland

Clemens Stachl

Montreal, Canada

You can also search for this author in PubMed   Google Scholar

Contributions

S.C.M., C.B, A.D., H.P., and C.S. designed the research. C.D. and A.D. provided the data. S.C.M, C.B. and H.P. analyzed the data. S.C.M and C.B. wrote the manuscript. All authors reviewed the manuscript. Earlier versions of thi research were part of the C.B.’s masters thesis which was supervised by S.C.M. and C.S.

Corresponding author

Correspondence to Sandra C. Matz .

Ethics declarations

Competing interests.

C.D. is a former employee of Ready Education. None of the other authors have conflict of interests related to this submission.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: Alice Dinu was omitted from the author list in the original version of this Article. The Author Contributions section now reads: “S.C.M., C.B, A.D., H.P., and C.S. designed the research. C.D. and A.D. provided the data. S.C.M, C.B. and H.P. analyzed the data. S.C.M and C.B. wrote the manuscript. All authors reviewed the manuscript. Earlier versions of this research were part of the C.B.’s masters thesis which was supervised by S.C.M. and C.S.” Additionally, the Article contained an error in Data Availability section and the legend of Figure 2 was incomplete.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Matz, S.C., Bukow, C.S., Peters, H. et al. Using machine learning to predict student retention from socio-demographic characteristics and app-based engagement metrics. Sci Rep 13 , 5705 (2023). https://doi.org/10.1038/s41598-023-32484-w

Download citation

Received : 09 August 2022

Accepted : 28 March 2023

Published : 07 April 2023

DOI : https://doi.org/10.1038/s41598-023-32484-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research project on machine learning

Search this site

Powered by Google

Research Projects with Graph Research Lab @ ANU

Picture of qing-wang.md

11 Apr 2024

Graph Research Lab @ ANU (https://graphlabanu.github.io/website/) aims to investigate graph-related problems by marrying the best of two worlds: traditional graph algorithms and new machine learning techniques to bridge the gap between combinatorial generalization and deep learning.

Graph Research Lab offers the following projects:

(1) Graph Deep Learning - Theory and Practice

Graph deep learning aims to apply deep learning techniques to learn from complex data that is represented as graph. In recent years, due to the rising trends in network analysis and prediction, Graph Neural Networks (GNNs) as a powerful deep learning approach have been widely applied in various fields, e.g., object recognition, image classification, and semantic segmentation. However, graphs are in irregular non-Euclidean domains. This brings up the challenge of how to design deep learning techniques in order to effectively extract useful features from (possibly arbitrary) graphs. This project aims to look into recent progress of the theoretical foundations of graph deep learning to explore the connections among the representation power, generalisation ability, and optimisation performance of GNNs, as well as various graph properties.

Students are required to have a solid background in mathematics and computer science. ANU students are expected to have finished COMP4670 or have an equivalent experience.

Background literature:

  • A New Perspective on “How Graph Neural Networks Go Beyond Weisfeiler-Lehman?” , A. Wijesinghe and Q. Wang, ICLR 2022.
  • Future Directions in Foundations of Graph Machine Learning , Morris et al., arXiv 2024.

(2) Scalable Graph Algorithms for Data Science/Network Science

Graph algorithms provide fundamental and powerful ways of exploiting the structure of graphs. In today’s real-world applications, graphs are ubiquitously used for representing complex objects and their relationships such as cities in a road network, atoms in a molecule, friendships in social networks, connections in computer networks, and links among web pages. This project aims to develop new techniques and algorithms to advance state-of-the-art research in the area of graph algorithms, graph theory, and network science.

Students are required to have a solid background in algorithms and excellent programming skills in C/C++/Java. ANU students are expected to have finished COMP3600 or have an equivalent experience.

  • Hierarchical Cut Labelling - Scaling Up Distance Queries on Road Networks , Farhan et al., SIGMOD 2024.
  • Query-by-Sketch: Scaling Shortest Path Graph Queries on Very Large Networks , Ye et al., SIGMOD 2021.

You are on Aboriginal land.

The Australian National University acknowledges, celebrates and pays our respects to the Ngunnawal and Ngambri people of the Canberra region and to all First Nations Australians on whose traditional lands we meet and work, and whose cultures are among the oldest continuing cultures in human history.

Help | Advanced Search

Computer Science > Machine Learning

Title: gansemble for small and imbalanced data sets: a baseline for synthetic microplastics data.

Abstract: Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often involves oversampling underrepresented classes or augmenting the existing data to improve model performance. This paper proposes GANsemble: a two-module framework connecting data augmentation with conditional generative adversarial networks (cGANs) to generate class-conditioned synthetic data. First, the data chooser module automates augmentation strategy selection by searching for the best data augmentation strategy. Next, the cGAN module uses this strategy to train a cGAN for generating enhanced synthetic data. We experiment with the GANsemble framework on a small and imbalanced microplastics data set. A Microplastic-cGAN (MPcGAN) algorithm is introduced, and baselines for synthetic microplastics (SYMP) data are established in terms of Frechet Inception Distance (FID) and Inception Scores (IS). We also provide a synthetic microplastics filter (SYMP-Filter) algorithm to increase the quality of generated SYMP. Additionally, we show the best amount of oversampling with augmentation to fix class imbalance in small microplastics data sets. To our knowledge, this study is the first application of generative AI to synthetically create microplastics data.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

How to increase flexibility in robotics?

The FZI project "GANResilRob - Generative Adversarial Networks and Semantics for Resilient, Flexible Production Robots" focuses on a crucial topic for the future: the further development of production robots.

Industrial robots are precise, fast and powerful. In the changed situation of the past two years, however, one specific shortcoming has become apparent: flexibility. The project "GANResilRob - Generative Adversarial Networks and Semantics for Resilient, Flexible Production Robots" presented by the Karlsruhe Research Center for Information Technology FZI therefore focuses on the anticipation of exceptional situations. In the production of industrial companies, a large number of people often work in a confined space and in closed rooms. In the event of a pandemic or political conflict, for example, this can pose a risk that endangers the health of employees and the productivity of the company. The "GANResilRob" approach creates a flexible, resilient industry based on the combination of machine learning, Generative Adversarial Networks (GAN), artificial intelligence (AI)-based semantic interpretation and intuitive task programming.

For cases in which, for example, material supply chains are interrupted, the FZI Research Center for Information Technology will develop production systems that learn tasks remotely and automatically adapt production to the changed conditions. The FZI's tasks include researching the flexible and resilient robotic assembly of vacuum chambers and possibilities for recycling components from old cars.

Related Exhibitors

Related speakers, related events.

Interested in news about exhibitors, top offers and trends in the industry?

Browser Notice

Your web browser is outdated. Update your browser for more security, speed and optimal presentation of this page.

Research Methods & Ethics

Developing novel scientific approaches and examining key ethical considerations to improve the quality of serious illness research

Medical provider wearing blue mask and face shield.

We seek—above all—to conduct studies with results that can be trusted. We are constantly developing and testing new research designs and state-of-the-art methods that push the boundaries of what is scientifically possible without ever compromising the rights, safety, and well-being of patients, care partners, or clinicians.

PAIR Center investigators develop new methods across the full spectrum of the clinical research process—from how we enroll participants into clinical trials to statistical methods for analyzing data.

Three ways we address persistent and complex issues in serious illness research:

Recruiting and retaining research participants that reflect the diversity of the American population.

Evaluating whether interventions work when the outcomes we care about—such as quality of life—cannot be fully assessed because patients die or are lost to follow-up.

Using many different analytic strategies in tandem to ensure that our study results maximize learning and impact.

Of course, with new methods come new ethical questions, and the PAIR Center does not shy away from these issues. Rather, we generate evidence to test the assumptions that often underlie ethical concerns and interrogate our findings to ensure that they apply equally across all groups of patients who differ in race, gender, socioeconomic status, and other key factors. We regularly consult ethicists, patient advocates, and patients themselves on our research methods.

Most importantly, we publish and speak internationally about this work, such that it not only informs the conduct and analysis of all PAIR Center studies but improves the conduct of science worldwide.

Related Projects

Behavioral economics to transform trial enrollment representativeness (better), using natural language processing and machine learning to identify potentially preventable hospital admissions among outpatients with chronic lung diseases, developing quality-weighted hospital-free days as a novel, patient-centered outcome for trials for patients with acute respiratory failure (qharp), related publications, win statistics in pulmonary arterial hypertension clinical trials, accounting for competing events when evaluating long-term outcomes in survivors of critical illness, effects of duration of follow-up and lag in data collection on the performance of adaptive clinical trials, take a deeper dive into our research..

Learn more about how the PAIR Center is making a difference.

IMAGES

  1. 21 Machine Learning Projects [Beginner to Advanced Guide]

    research project on machine learning

  2. (PDF) A Research on Machine Learning Methods and Its Applications

    research project on machine learning

  3. 21 Project Ideas on Machine Learning in 2021

    research project on machine learning

  4. The Machine Learning Workflow Explained (and How You Can Practice It

    research project on machine learning

  5. Machine Learning Roadmap: Where to Start and How to Succeed

    research project on machine learning

  6. 15 Top Machine Learning Projects for Final Year Students

    research project on machine learning

VIDEO

  1. Final Project

  2. Extreme Learning Machine: Learning Without Iterative Tuning

  3. Understanding Data, Project Graduate Admission Prediction Using Machine Learning

  4. Machine Learning

  5. Student Feedback Analysis

  6. Machine Learning for Solar Energy Prediction

COMMENTS

  1. 21 Machine Learning Projects [Beginner to Advanced Guide]

    Scoping is the first stage of machine learning project planning, and offers an opportunity to settle on a data question, ... Pick a small, succinctly-defined problem and research a lage, relevant data set to increase the odds that your project will generate a positive return on investment. Test Your Hypothesis. In a machine learning context, ...

  2. 100+ Machine Learning Projects with Source Code [2024]

    Discover the top 100+ beginner-friendly machine learning projects for 2024, complete with source code in Python. Kick-start your career in machine learning with these exciting project ideas tailored for beginners. ... Machine Learning gained a lot of popularity and become a necessary tool for research purposes as well as for Business. It is a ...

  3. 35 Remarkable Machine Learning Projects for You to Try

    This project assists businesses in better understanding their customers, tailoring marketing strategies, and personalizing customer experiences. 8. House Price Prediction. This entails designing a machine learning model to predict house prices based on location, size, number of rooms, amenities, etc.

  4. machine-learning-projects · GitHub Topics · GitHub

    To associate your repository with the machine-learning-projects topic, visit your repo's landing page and select "manage topics." GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.

  5. 12 Great Machine Learning Projects to Get You Started in AI

    We'll take a look at three such machine learning projects. 1. Predicting wine quality. Oenology-or the science of wines, offers a fun introduction to the art of classification. UCI's wine quality dataset can be used to distinguish between red and white wines.

  6. 180 Data Science and Machine Learning Projects with Python

    Data Science Project on-Extracting HOG Features. Data Science Project — Email spam Detection with Machine Learning. Data Science Project — Heart Disease Prediction with Machine. Data Science ...

  7. Machine Learning: Algorithms, Real-World Applications and Research

    Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [].It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [], i.e., a task-driven ...

  8. The latest in Machine Learning

    Papers With Code highlights trending Machine Learning research and the code to implement it. Browse State-of-the-Art Datasets ; Methods; More ... (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan ...

  9. Machine learning

    Machine learning articles from across Nature Portfolio. Machine learning is the ability of a machine to improve its performance based on previous results. Machine learning methods enable computers ...

  10. Research Projects

    Robust MRI Using New Algorithms for Generative Machine Learning. Magnetic Resonance Imaging (MRI) is currently expensive and time consuming. Exam times can take upwards of 30 minutes to an hour. As a result, patient motion... Research Project Details.

  11. Research Projects

    Machine Learning research projects. Our academics are at the forefront of researching and developing new technologies. We have a successful history of translating that research into practice for the benefit of our partners. We also have supported many PhD students onto successful careers. Contact our Graduate Services team.

  12. Journal of Machine Learning Research

    The Journal of Machine Learning Research (JMLR), , provides an international forum for the electronic and paper publication of high-quality scholarly articles in all areas of machine learning. All published papers are freely available online. JMLR has a commitment to rigorous yet rapid reviewing. Final versions are (ISSN 1533-7928) immediately ...

  13. Machine learning in project analytics: a data-driven framework and case

    This study considered four machine learning algorithms to explore the causes of project cost overrun using the research dataset mentioned above. They are support vector machine, logistic ...

  14. Research guidelines for Machine Learning projects

    1. Machine Learning projects can be delivered in two stages. The first stage is named Research and is about answering the question: can we make a machine learning model out of this pile of data that serves the needs of the client? The deliverable is a proof of concept or a feasibility study. The second stage is named Development and it is about ...

  15. Top 10 Research and Thesis Topics for ML Projects in 2022

    In this tech-driven world, selecting research and thesis topics in machine learning projects is the first choice of masters and Doctorate scholars. Selecting and working on a thesis topic in machine learning is not an easy task as machine learning uses statistical algorithms to make computers work in a certain way without being explicitly ...

  16. machine learning Latest Research Papers

    Find the latest published documents for machine learning, Related hot topics, top authors, the most cited documents, and related journals. ScienceGate; Advanced Search ... This article examines the background to the problem and outlines a project that TNA undertook to research the feasibility of using commercially available artificial ...

  17. AI & Machine Learning Research Topics (+ Free Webinar)

    AI-Related Research Topics & Ideas. Below you'll find a list of AI and machine learning-related research topics ideas. These are intentionally broad and generic, so keep in mind that you will need to refine them a little. Nevertheless, they should inspire some ideas for your project.

  18. Top 310+ Machine Learning Projects for 2023 [Source Code Included]

    2. Iris Flowers Classification Project. Project idea - The iris flowers have different species and you can distinguish them based on the length of petals and sepals. This is a basic project for machine learning beginners to predict the species of a new iris flower. Dataset: Iris Flowers Classification Dataset.

  19. Machine Learning

    We study a range of research areas related to machine learning and their applications for robotics, health care, language processing, information retrieval and more. Among these subjects include precision medicine, motion planning, computer vision, Bayesian inference, graphical models, statistical inference and estimation. Our work is ...

  20. Research

    Our principal research interests lie in the development of machine learning and statistical methodology, and large-scale computational system and architecture, for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in artificial, biological, and social ...

  21. MIT Sloan research on artificial intelligence and machine learning

    There's little question artificial intelligence and machine learning are playing an increased role in making business decisions. A 2022 survey of senior data and technology executives by NewVantage Partners found that 92% of large companies reported achieving returns on their data and AI investments — an increase from 48% in 2017.

  22. How to Write a Machine Learning Research Proposal

    A machine learning research proposal is a document that describes a proposed research project that uses machine learning algorithms and techniques. The proposal should include a brief overview of the problem to be tackled, the proposed solution, and the expected results. It should also briefly describe the dataset to be used, the evaluation ...

  23. Using machine learning to predict student retention from socio ...

    In this research, we partner with an educational software company - READY Education - that offers a virtual one-stop interaction platform in the form of a smartphone application to facilitate ...

  24. Popular Machine Learning Libraries to Know

    However, with an extensive machine learning library available, you can still reap the benefits of this powerful language. Python: Widely used for machine learning, Python is a beginner-friendly language with numerous machine learning libraries to assist with areas such as data analysis, computer vision, natural language processing, and more.

  25. Research Projects with Graph Research Lab @ ANU

    Graph Research Lab offers the following projects: (1) Graph Deep Learning - Theory and Practice. Graph deep learning aims to apply deep learning techniques to learn from complex data that is represented as graph. In recent years, due to the rising trends in network analysis and prediction, Graph Neural Networks (GNNs) as a powerful deep ...

  26. AI, Machine Learning & Robotics Research

    AI, Machine Learning & Robotics research at Drexel University's College of Computing & Informatics (CCI) explores algorithms, mathematics, and applications of artificial intelligence (AI) through fundamental and applied work in computer vision and pattern recognition, data science, knowledge representation and automatic metadata generation/extraction, logic, cognitive modeling and neural ...

  27. GANsemble for Small and Imbalanced Data Sets: A Baseline for Synthetic

    Microplastic particle ingestion or inhalation by humans is a problem of growing concern. Unfortunately, current research methods that use machine learning to understand their potential harms are obstructed by a lack of available data. Deep learning techniques in particular are challenged by such domains where only small or imbalanced data sets are available. Overcoming this challenge often ...

  28. AI & Machine Learning: How to increase flexibility in robotics?

    The FZI project "GANResilRob - Generative Adversarial Networks and Semantics for Resilient, Flexible Production Robots" focuses on a crucial topic for the future: the further development of production robots. ... of the company. The "GANResilRob" approach creates a flexible, resilient industry based on the combination of machine learning ...

  29. Prediction of Residual Aluminum Concentration and Size in Water ...

    Effluent water in Chinese water transfer project is complicated, leading to possibility of excessive residual aluminum if the coagulant dosage is improper. There exists great risk for controlling residual aluminum based on experiences, while using machine learning can offer appropriate dosage based on the effluent water parameters.

  30. Research Methods & Ethics

    Using Natural Language Processing and Machine Learning To Identify Potentially Preventable Hospital Admissions Among Outpatients With Chronic Lung Diseases project Developing Quality-Weighted Hospital-Free Days as a Novel, Patient-Centered Outcome for Trials for Patients with Acute Respiratory Failure (QHARP)