Georgetown University.

Biomedical Graduate Education

Georgetown University.

Capstone Projects

2022-2023 graduates, nelson moore.

Data Scientist at Essential Software Inc

Capstone Project: Modeling and code implementation to support data search and filter through the NCI Cancer Data Aggregator Industry Mentor: Frederick National Lab for Cancer Research: FNLCR

Joelle Fitzgerald

Business Analyst at Ascension Health Care

Capstone Project: Analysis of patient safety event reports data. Industry Mentor: MedStar Health. National Center for Human Factors in Healthcare

Kader (Abdelkader) Bouregag

Healthcare Xplorer | Medical Informatics at Genentech (internship)

Capstone Project: Transforming the Immuno-Oncology data to the OMOP CDM Industry Mentor: MSKCC/ MedStar/ Georgetown University/ Hackensack

Junaid Imam

Data Scientist at Medstar Institute

Capstone Project: Create an [trans-] eQTL visualization tool

Industry Mentor: Pfizer Inc / Harvard

Abbie Gillen

Staff Data Analyst at Nice Healthcare

Capstone Project: Nice Healthcare: Predicting Nice healthcare utilization

Industry Mentor: Nice Healthcare

Capstone Project: Next Generation Data Commons

Industry Mentor: ICF International

2021-2022 Graduates

Ahson saiyed.

NLP Engineer/Data Scientist at TrinetX

Capstone Project : Research Data Platform Pipelines Industry Mentor: Invitae

Walid Nashashibi

Data Scientist at FEMA

Capstone Project: Xenopus RNA-Seq Analysis to Understand Tissue Regeneration Mechanisms Industry Mentor: FDA

Tony Albini

Data Analyst at ClearView Healthcare Partners

Capstone project: Data Mining to understand the patient landscape of Chronic Kidney Disease Population Industry Mentor: AstraZeneca

Anvitha Gooty Agraharam

Business Account Manager at GeneData

Capstone Project: Computational estimation of Pleiotropy in Genome-Phenome Associations for target discovery Industry Mentor: AstraZeneca

Natalie Cortopassi

Researcher at the Institute for Health Metrics and Evaluation

Capstone project: Analysis of Clinical Trial Attrition in Neuropsychiatric Clinical Trials using Machine Learning Industry Mentor: AstraZeneca

Christle Iroezi

Business System Analyst at Centene Corporation

Capstone project: Visualize Digital HealthCare ROI Industry Mentor: MedStar Health

R & D Analyst II at GEICO

Capstone project: Heat Waves and Health Outcomes Industry Mentor: ICF

Research Specialist at Georgetown University

Capstone project: Mental Health Data Commons Industry Mentor: ICF

2020-2021 Graduates

Technology Transformation Analyst, Grant Thornton LLP

Capstone Project: Research Data Platform Pipelines Industry Mentor: Invitae

Research Technician at Georgetown University

Capstone Project: Using a configurable, open-source framework to create a fully functional data commons with the REMBRANDT dataset Industry Mentor: Frederick National Lab for Cancer Research – FNLCR

Consultant at Deloitte

Capstone Project: Building a patient centric data warehouse Industry Mentor: Invitae

Marcio Rosas

Project Manager of Technology and Informatics at Georgetown University

Capstone Project: Knowledge-Based Predictive Modeling of Clinical Trials Enrollment Rates Industry Mentor : AstraZeneca

Yuezheng (Kerry) He

Data Product Associate at YipitData

Capstone Project: ClinicalTrials2Vec – Accelerating trial-level computing using a vectorized model of clinical trial summaries and results Industry Mentor: AstraZeneca

Data Programmer at Chemonics International

Capstone Project: Multi-scale modeling to enable data-driven biomarker and target discovery Industry Mentor: AstraZeneca

2019-2020 Graduates

Pratyush tandale.

Informatics Specialist I at Mayo Clinic

Capstone Project: Improving clinical mapping process for lab data using LOINC Industry Mentor: Flatiron Roche

Shabeeb Kannattikuni

Senior Statistical Programmer at PRA Health Sciences (ICON Pl)

Capstone Project: NGS Data Analysis for the QA of viral vaccines Industry Mentor: Argentys Informatics

Fuyuan Wang (Bruce)

Software Engineer at Essential Software Inc , Frederick National Labs

Capstone Project: Cancer Data Model Visualization framework Industry Mentor: Frederick National Laboratory for Cancer Research

Ayah Elshikh

Capstone Project: NGS Data Analysis for the QA of viral vaccines

Industry Mentor: Argentys Informatics

Yue (Lilian) Li

Biostatistician and Statistical Programmer , Baim Institute for Clinical Research

Capstone Project: Analysis of COVID-19 Serological test data to improve the COVID-19 Detection capabalities Industry Mentor: Argentys Informatics

Algorithm Performance Engineer at Optovue

Capstone Project: Socioeconomic factors to readmissions after major cancer surgery Industry Mentor: Medstar Health

Jiazhong Zhang

Management Trainee at China Bohai Bank

Jianyi Zhang

jamiefosterscience logo

10 Unique Data Science Capstone Project Ideas

A capstone project is a culminating assignment that allows students to demonstrate the skills and knowledge they’ve acquired throughout their degree program. For data science students, it’s a chance to tackle a substantial real-world data problem.

If you’re short on time, here’s a quick answer to your question: Some great data science capstone ideas include analyzing health trends, building a predictive movie recommendation system, optimizing traffic patterns, forecasting cryptocurrency prices, and more .

In this comprehensive guide, we will explore 10 unique capstone project ideas for data science students. We’ll overview potential data sources, analysis methods, and practical applications for each idea.

Whether you want to work with social media datasets, geospatial data, or anything in between, you’re sure to find an interesting capstone topic.

Project Idea #1: Analyzing Health Trends

When it comes to data science capstone projects, analyzing health trends is an intriguing idea that can have a significant impact on public health. By leveraging data from various sources, data scientists can uncover valuable insights that can help improve healthcare outcomes and inform policy decisions.

Data Sources

There are several data sources that can be used to analyze health trends. One of the most common sources is electronic health records (EHRs), which contain a wealth of information about patient demographics, medical history, and treatment outcomes.

Other sources include health surveys, wearable devices, social media, and even environmental data.

Analysis Approaches

When analyzing health trends, data scientists can employ a variety of analysis approaches. Descriptive analysis can provide a snapshot of current health trends, such as the prevalence of certain diseases or the distribution of risk factors.

Predictive analysis can be used to forecast future health outcomes, such as predicting disease outbreaks or identifying individuals at high risk for certain conditions. Machine learning algorithms can be trained to identify patterns and make accurate predictions based on large datasets.

Applications

The applications of analyzing health trends are vast and far-reaching. By understanding patterns and trends in health data, policymakers can make informed decisions about resource allocation and public health initiatives.

Healthcare providers can use these insights to develop personalized treatment plans and interventions. Researchers can uncover new insights into disease progression and identify potential targets for intervention.

Ultimately, analyzing health trends has the potential to improve overall population health and reduce healthcare costs.

Project Idea #2: Movie Recommendation System

When developing a movie recommendation system, there are several data sources that can be used to gather information about movies and user preferences. One popular data source is the MovieLens dataset, which contains a large collection of movie ratings provided by users.

Another source is IMDb, a trusted website that provides comprehensive information about movies, including user ratings and reviews. Additionally, streaming platforms like Netflix and Amazon Prime also provide access to user ratings and viewing history, which can be valuable for building an accurate recommendation system.

There are several analysis approaches that can be employed to build a movie recommendation system. One common approach is collaborative filtering, which uses user ratings and preferences to identify patterns and make recommendations based on similar users’ preferences.

Another approach is content-based filtering, which analyzes the characteristics of movies (such as genre, director, and actors) to recommend similar movies to users. Hybrid approaches that combine both collaborative and content-based filtering techniques are also popular, as they can provide more accurate and diverse recommendations.

A movie recommendation system has numerous applications in the entertainment industry. One application is to enhance the user experience on streaming platforms by providing personalized movie recommendations based on individual preferences.

This can help users discover new movies they might enjoy and improve overall satisfaction with the platform. Additionally, movie recommendation systems can be used by movie production companies to analyze user preferences and trends, aiding in the decision-making process for creating new movies.

Finally, movie recommendation systems can also be utilized by movie critics and reviewers to identify movies that are likely to be well-received by audiences.

For more information on movie recommendation systems, you can visit https://www.kaggle.com/rounakbanik/movie-recommender-systems or https://www.researchgate.net/publication/221364567_A_new_movie_recommendation_system_for_large-scale_data .

Project Idea #3: Optimizing Traffic Patterns

When it comes to optimizing traffic patterns, there are several data sources that can be utilized. One of the most prominent sources is real-time traffic data collected from various sources such as GPS devices, traffic cameras, and mobile applications.

This data provides valuable insights into the current traffic conditions, including congestion, accidents, and road closures. Additionally, historical traffic data can also be used to identify recurring patterns and trends in traffic flow.

Other data sources that can be used include weather data, which can help in understanding how weather conditions impact traffic patterns, and social media data, which can provide information about events or incidents that may affect traffic.

Optimizing traffic patterns requires the use of advanced data analysis techniques. One approach is to use machine learning algorithms to predict traffic patterns based on historical and real-time data.

These algorithms can analyze various factors such as time of day, day of the week, weather conditions, and events to predict traffic congestion and suggest alternative routes.

Another approach is to use network analysis to identify bottlenecks and areas of congestion in the road network. By analyzing the flow of traffic and identifying areas where traffic slows down or comes to a halt, transportation authorities can make informed decisions on how to optimize traffic flow.

The optimization of traffic patterns has numerous applications and benefits. One of the main benefits is the reduction of traffic congestion, which can lead to significant time and fuel savings for commuters.

By optimizing traffic patterns, transportation authorities can also improve road safety by reducing the likelihood of accidents caused by congestion.

Additionally, optimizing traffic patterns can have positive environmental impacts by reducing greenhouse gas emissions. By minimizing the time spent idling in traffic, vehicles can operate more efficiently and emit fewer pollutants.

Furthermore, optimizing traffic patterns can have economic benefits by improving the flow of goods and services. Efficient traffic patterns can reduce delivery times and increase productivity for businesses.

Project Idea #4: Forecasting Cryptocurrency Prices

With the growing popularity of cryptocurrencies like Bitcoin and Ethereum, forecasting their prices has become an exciting and challenging task for data scientists. This project idea involves using historical data to predict future price movements and trends in the cryptocurrency market.

When working on this project, data scientists can gather cryptocurrency price data from various sources such as cryptocurrency exchanges, financial websites, or APIs. Websites like CoinMarketCap (https://coinmarketcap.com/) provide comprehensive data on various cryptocurrencies, including historical price data.

Additionally, platforms like CryptoCompare (https://www.cryptocompare.com/) offer real-time and historical data for different cryptocurrencies.

To forecast cryptocurrency prices, data scientists can employ various analysis approaches. Some common techniques include:

  • Time Series Analysis: This approach involves analyzing historical price data to identify patterns, trends, and seasonality in cryptocurrency prices. Techniques like moving averages, autoregressive integrated moving average (ARIMA), or exponential smoothing can be used to make predictions.
  • Machine Learning: Machine learning algorithms, such as random forests, support vector machines, or neural networks, can be trained on historical cryptocurrency data to predict future price movements. These algorithms can consider multiple variables, such as trading volume, market sentiment, or external factors, to make accurate predictions.
  • Sentiment Analysis: This approach involves analyzing social media sentiment and news articles related to cryptocurrencies to gauge market sentiment. By considering the collective sentiment, data scientists can predict how positive or negative sentiment can impact cryptocurrency prices.

Forecasting cryptocurrency prices can have several practical applications:

  • Investment Decision Making: Accurate price forecasts can help investors make informed decisions when buying or selling cryptocurrencies. By considering the predicted price movements, investors can optimize their investment strategies and potentially maximize their returns.
  • Trading Strategies: Traders can use price forecasts to develop trading strategies, such as trend following or mean reversion. By leveraging predicted price movements, traders can make profitable trades in the volatile cryptocurrency market.
  • Risk Management: Cryptocurrency price forecasts can help individuals and organizations manage their risk exposure. By understanding potential price fluctuations, risk management strategies can be implemented to mitigate losses.

Project Idea #5: Predicting Flight Delays

One interesting and practical data science capstone project idea is to create a model that can predict flight delays. Flight delays can cause a lot of inconvenience for passengers and can have a significant impact on travel plans.

By developing a predictive model, airlines and travelers can be better prepared for potential delays and take appropriate actions.

To create a flight delay prediction model, you would need to gather relevant data from various sources. Some potential data sources include:

  • Flight data from airlines or aviation organizations
  • Weather data from meteorological agencies
  • Historical flight delay data from airports

By combining these different data sources, you can build a comprehensive dataset that captures the factors contributing to flight delays.

Once you have collected the necessary data, you can employ different analysis approaches to predict flight delays. Some common approaches include:

  • Machine learning algorithms such as decision trees, random forests, or neural networks
  • Time series analysis to identify patterns and trends in flight delay data
  • Feature engineering to extract relevant features from the dataset

By applying these analysis techniques, you can develop a model that can accurately predict flight delays based on the available data.

The applications of a flight delay prediction model are numerous. Airlines can use the model to optimize their operations, improve scheduling, and minimize disruptions caused by delays. Travelers can benefit from the model by being alerted in advance about potential delays and making necessary adjustments to their travel plans.

Additionally, airports can use the model to improve resource allocation and manage passenger flow during periods of high delay probability. Overall, a flight delay prediction model can significantly enhance the efficiency and customer satisfaction in the aviation industry.

Project Idea #6: Fighting Fake News

With the rise of social media and the easy access to information, the spread of fake news has become a significant concern. Data science can play a crucial role in combating this issue by developing innovative solutions.

Here are some aspects to consider when working on a project that aims to fight fake news.

When it comes to fighting fake news, having reliable data sources is essential. There are several trustworthy platforms that provide access to credible news articles and fact-checking databases. Websites like Snopes and FactCheck.org are good starting points for obtaining accurate information.

Additionally, social media platforms such as Twitter and Facebook can be valuable sources for analyzing the spread of misinformation.

One approach to analyzing fake news is by utilizing natural language processing (NLP) techniques. NLP can help identify patterns and linguistic cues that indicate the presence of misleading information.

Sentiment analysis can also be employed to determine the emotional tone of news articles or social media posts, which can be an indicator of potential bias or misinformation.

Another approach is network analysis, which focuses on understanding how information spreads through social networks. By analyzing the connections between users and the content they share, it becomes possible to identify patterns of misinformation dissemination.

Network analysis can also help in identifying influential sources and detecting coordinated efforts to spread fake news.

The applications of a project aiming to fight fake news are numerous. One possible application is the development of a browser extension or a mobile application that provides users with real-time fact-checking information.

This tool could flag potentially misleading articles or social media posts and provide users with accurate information to help them make informed decisions.

Another application could be the creation of an algorithm that automatically identifies fake news articles and separates them from reliable sources. This algorithm could be integrated into news aggregation platforms to help users distinguish between credible and non-credible information.

Project Idea #7: Analyzing Social Media Sentiment

Social media platforms have become a treasure trove of valuable data for businesses and researchers alike. When analyzing social media sentiment, there are several data sources that can be tapped into. The most popular ones include:

  • Twitter: With its vast user base and real-time nature, Twitter is often the go-to platform for sentiment analysis. Researchers can gather tweets containing specific keywords or hashtags to analyze the sentiment of a particular topic.
  • Facebook: Facebook offers rich data for sentiment analysis, including posts, comments, and reactions. Analyzing the sentiment of Facebook posts can provide valuable insights into user opinions and preferences.
  • Instagram: Instagram’s visual nature makes it an interesting platform for sentiment analysis. By analyzing the comments and captions on Instagram posts, researchers can gain insights into the sentiment associated with different images or topics.
  • Reddit: Reddit is a popular platform for discussions on various topics. By analyzing the sentiment of comments and posts on specific subreddits, researchers can gain insights into the sentiment of different communities.

These are just a few examples of the data sources that can be used for analyzing social media sentiment. Depending on the research goals, other platforms such as LinkedIn, YouTube, and TikTok can also be explored.

When it comes to analyzing social media sentiment, there are various approaches that can be employed. Some commonly used analysis techniques include:

  • Lexicon-based analysis: This approach involves using predefined sentiment lexicons to assign sentiment scores to words or phrases in social media posts. By aggregating these scores, researchers can determine the overall sentiment of a post or a collection of posts.
  • Machine learning: Machine learning algorithms can be trained to classify social media posts into positive, negative, or neutral sentiment categories. These algorithms learn from labeled data and can make predictions on new, unlabeled data.
  • Deep learning: Deep learning techniques, such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs), can be used to capture the complex patterns and dependencies in social media data. These models can learn to extract sentiment information from textual or visual content.

It is important to note that the choice of analysis approach depends on the specific research objectives, available resources, and the nature of the social media data being analyzed.

Analyzing social media sentiment has a wide range of applications across different industries. Here are a few examples:

  • Brand reputation management: By analyzing social media sentiment, businesses can monitor and manage their brand reputation. They can identify potential issues, respond to customer feedback, and take proactive measures to maintain a positive image.
  • Market research: Social media sentiment analysis can provide valuable insights into consumer opinions and preferences. Businesses can use this information to understand market trends, identify customer needs, and develop targeted marketing strategies.
  • Customer feedback analysis: Social media sentiment analysis can help businesses understand customer satisfaction levels and identify areas for improvement. By analyzing sentiment in customer feedback, companies can make data-driven decisions to enhance their products or services.
  • Public opinion analysis: Researchers can analyze social media sentiment to study public opinion on various topics, such as political events, social issues, or product launches. This information can be used to understand public sentiment, predict trends, and inform decision-making.

These are just a few examples of how analyzing social media sentiment can be applied in real-world scenarios. The insights gained from sentiment analysis can help businesses and researchers make informed decisions, improve customer experience, and drive innovation.

Project Idea #8: Improving Online Ad Targeting

Improving online ad targeting involves analyzing various data sources to gain insights into users’ preferences and behaviors. These data sources may include:

  • Website analytics: Gathering data from websites to understand user engagement, page views, and click-through rates.
  • Demographic data: Utilizing information such as age, gender, location, and income to create targeted ad campaigns.
  • Social media data: Extracting data from platforms like Facebook, Twitter, and Instagram to understand users’ interests and online behavior.
  • Search engine data: Analyzing search queries and user behavior on search engines to identify intent and preferences.

By combining and analyzing these diverse data sources, data scientists can gain a comprehensive understanding of users and their ad preferences.

To improve online ad targeting, data scientists can employ various analysis approaches:

  • Segmentation analysis: Dividing users into distinct groups based on shared characteristics and preferences.
  • Collaborative filtering: Recommending ads based on users with similar preferences and behaviors.
  • Predictive modeling: Developing algorithms to predict users’ likelihood of engaging with specific ads.
  • Machine learning: Utilizing algorithms that can continuously learn from user interactions to optimize ad targeting.

These analysis approaches help data scientists uncover patterns and insights that can enhance the effectiveness of online ad campaigns.

Improved online ad targeting has numerous applications:

  • Increased ad revenue: By delivering more relevant ads to users, advertisers can expect higher click-through rates and conversions.
  • Better user experience: Users are more likely to engage with ads that align with their interests, leading to a more positive browsing experience.
  • Reduced ad fatigue: By targeting ads more effectively, users are less likely to feel overwhelmed by irrelevant or repetitive advertisements.
  • Maximized ad budget: Advertisers can optimize their budget by focusing on the most promising target audiences.

Project Idea #9: Enhancing Customer Segmentation

Enhancing customer segmentation involves gathering relevant data from various sources to gain insights into customer behavior, preferences, and demographics. Some common data sources include:

  • Customer transaction data
  • Customer surveys and feedback
  • Social media data
  • Website analytics
  • Customer support interactions

By combining data from these sources, businesses can create a comprehensive profile of their customers and identify patterns and trends that will help in improving their segmentation strategies.

There are several analysis approaches that can be used to enhance customer segmentation:

  • Clustering: Using clustering algorithms to group customers based on similar characteristics or behaviors.
  • Classification: Building predictive models to assign customers to different segments based on their attributes.
  • Association Rule Mining: Identifying relationships and patterns in customer data to uncover hidden insights.
  • Sentiment Analysis: Analyzing customer feedback and social media data to understand customer sentiment and preferences.

These analysis approaches can be used individually or in combination to enhance customer segmentation and create more targeted marketing strategies.

Enhancing customer segmentation can have numerous applications across industries:

  • Personalized marketing campaigns: By understanding customer preferences and behaviors, businesses can tailor their marketing messages to individual customers, increasing the likelihood of engagement and conversion.
  • Product recommendations: By segmenting customers based on their purchase history and preferences, businesses can provide personalized product recommendations, leading to higher customer satisfaction and sales.
  • Customer retention: By identifying at-risk customers and understanding their needs, businesses can implement targeted retention strategies to reduce churn and improve customer loyalty.
  • Market segmentation: By identifying distinct customer segments, businesses can develop tailored product offerings and marketing strategies for each segment, maximizing the effectiveness of their marketing efforts.

Project Idea #10: Building a Chatbot

A chatbot is a computer program that uses artificial intelligence to simulate human conversation. It can interact with users in a natural language through text or voice. Building a chatbot can be an exciting and challenging data science capstone project.

It requires a combination of natural language processing, machine learning, and programming skills.

When building a chatbot, data sources play a crucial role in training and improving its performance. There are various data sources that can be used:

  • Chat logs: Analyzing existing chat logs can help in understanding common user queries, responses, and patterns. This data can be used to train the chatbot on how to respond to different types of questions and scenarios.
  • Knowledge bases: Integrating a knowledge base can provide the chatbot with a wide range of information and facts. This can be useful in answering specific questions or providing detailed explanations on certain topics.
  • APIs: Utilizing APIs from different platforms can enhance the chatbot’s capabilities. For example, integrating a weather API can allow the chatbot to provide real-time weather information based on user queries.

There are several analysis approaches that can be used to build an efficient and effective chatbot:

  • Natural Language Processing (NLP): NLP techniques enable the chatbot to understand and interpret user queries. This involves tasks such as tokenization, part-of-speech tagging, named entity recognition, and sentiment analysis.
  • Intent recognition: Identifying the intent behind user queries is crucial for providing accurate responses. Machine learning algorithms can be trained to classify user intents based on the input text.
  • Contextual understanding: Chatbots need to understand the context of the conversation to provide relevant and meaningful responses. Techniques such as sequence-to-sequence models or attention mechanisms can be used to capture contextual information.

Chatbots have a wide range of applications in various industries:

  • Customer support: Chatbots can be used to handle customer queries and provide instant support. They can assist with common troubleshooting issues, answer frequently asked questions, and escalate complex queries to human agents when necessary.
  • E-commerce: Chatbots can enhance the shopping experience by assisting users in finding products, providing recommendations, and answering product-related queries.
  • Healthcare: Chatbots can be deployed in healthcare settings to provide preliminary medical advice, answer general health-related questions, and assist with appointment scheduling.

Building a chatbot as a data science capstone project not only showcases your technical skills but also allows you to explore the exciting field of artificial intelligence and natural language processing.

It can be a great opportunity to create a practical and useful tool that can benefit users in various domains.

Completing an in-depth capstone project is the perfect way for data science students to demonstrate their technical skills and business acumen. This guide outlined 10 unique project ideas spanning industries like healthcare, transportation, finance, and more.

By identifying the ideal data sources, analysis techniques, and practical applications for their chosen project, students can produce an impressive capstone that solves real-world problems and showcases their abilities.

Similar Posts

Info Science Vs Computer Science: A Detailed Comparison

Info Science Vs Computer Science: A Detailed Comparison

With the rise of data and technology in the modern world, many students find themselves trying to choose between studying information science or computer science. Both fields involve working with technology and data, but have some distinct differences in their focus and career trajectories. If you’re short on time, here’s a quick answer to your…

Michigan Math And Science Scholars Program: A Guide For Aspiring Stem Students

Michigan Math And Science Scholars Program: A Guide For Aspiring Stem Students

For top STEM students in Michigan looking to challenge themselves and access specialized guidance, the Michigan Math and Science Scholars program offers an engaging pathway to develop skills and connections. This accelerated learning opportunity allows students to tackle college-level coursework in math and science while still in high school. If you’re short on time, here’s…

The Science Behind Human Behavior And Thought

The Science Behind Human Behavior And Thought

Understanding what motivates human behavior and shapes human thought has long been a fascination of philosophers and scholars. In modern times, scientists have uncovered fascinating insights into what makes people act, feel, and think the way they do through research in fields like psychology, neuroscience, and behavioral economics. If you’re short on time, here’s a…

Bachelor Of Technology Vs Bachelor Of Science: How To Choose

Bachelor Of Technology Vs Bachelor Of Science: How To Choose

When exploring undergraduate programs in technical fields like engineering, computer science or agriculture, you’ll often have to decide between pursuing a Bachelor of Technology (BTech) or a Bachelor of Science (BSc). Both degrees cover scientific and mathematical concepts but have some key differences in focus and curriculum. If you’re short on time, here’s a quick…

In-Depth Guide To The Ut Austin Ms In Data Science Program

In-Depth Guide To The Ut Austin Ms In Data Science Program

The University of Texas at Austin offers a top-ranked Master of Science in Data Science program through its Department of Statistics and Data Sciences. If you’re short on time, here’s a quick overview: UT Austin’s MS Data Science is an on-campus, STEM-designated program that provides advanced analytics and machine learning training in just 1 year….

Is An Associate Degree Worth It? A Detailed Look At The Pros And Cons

Is An Associate Degree Worth It? A Detailed Look At The Pros And Cons

In today’s competitive job market, more and more people are looking into higher education to boost their career prospects. An associate degree is one option that offers affordability and flexibility compared to a 4-year bachelor’s degree program. But is an associate degree really worth the investment of time and money? In this comprehensive guide, we’ll…

Data Science: Capstone

To become an expert you need practice and experience..

Show what you’ve learned from the Professional Certificate Program in Data Science.

Harvard School of Public Health Logo

What You'll Learn

To become an expert data scientist you need practice and experience. By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

Unlike the rest of our Professional Certificate Program in Data Science , in this course, you will receive much less guidance from the instructors. When you complete the project you will have a data product to show off to potential employers or educational programs, a strong indicator of your expertise in the field of data science.

The course will be delivered via edX and connect learners around the world. By the end of the course, participants will understand the following concepts:

  • How to apply the knowledge base and skills learned throughout the series to a real-world problem
  • How to independently work on a data analysis project

Your Instructors

Rafael Irizarry

Rafael Irizarry

Professor of Biostatistics at Harvard University Read full bio.

Ways to take this course

When you enroll in this course, you will have the option of pursuing a Verified Certificate or Auditing the Course.

A Verified Certificate costs $149 and provides unlimited access to full course materials, activities, tests, and forums. At the end of the course, learners who earn a passing grade can receive a certificate. 

Alternatively, learners can Audit the course for free and have access to select course material, activities, tests, and forums.  Please note that this track does not offer a certificate for learners who earn a passing grade.

Introduction to Linear Models and Matrix Algebra

Learn to use R programming to apply linear models to analyze data in life sciences.

High-Dimensional Data Analysis

A focus on several techniques that are widely used in the analysis of high-dimensional data.

Introduction to Bioconductor

Join Harvard faculty in this online course to learn the structure, annotation, normalization, and interpretation of genome scale assays.

healthcare data science capstone project

Master of Science in Health Data Science

APPLY NOW   REQUEST INFO

REQUEST INFO

  • Accelerate Your Career
  • Degree Programs
  • Student Spotlights
  • Alumni Career Stories
  • Students with Disabilities
  • Dartmouth Healthcare Foundations 
  • Alumni Information
  • Dartmouth Atlas Data & Tools
  • Our Research
  • Our Broader Network
  • Publications
  • NEWS & EVENTS
  • JOB OPPORTUNITIES

Dartmouth’s Master of Science degree in Health Data Science positions students to be biomedical data scientists that are at the forefront of innovative solutions to the health care industry’s greatest challenges.

  • Program Overview
  • Academic Calendar

Capstone summary

Students will apply and refine their statistical, computational, and investigative skillset toward a research project in health data science in the capstone experience offered with this MS degree. This three-month experience culminates in a white paper and research presentation. This course trains students in professional skills such as executing a research project, scientific writing, and presenting to a larger audience which may not have the domain expertise. The goal for this experience is to prepare students with high-value professional skills necessary in the biomedical analysis workforce in addition to the skills gained through degree coursework. As data science and biomedical research become increasingly interdisciplinary, professional skills sharpened in an interdisciplinary space such as this capstone experience will increase student success post-graduation.  

The capstone experience includes preparatory coursework and a capstone course. 

Capstone experience overview

  • Winter Term: Capstone preparation course (0.5 Units)  
  • Spring Term: Capstone preparation course (0.5 Units) 
  • Summer Term: Capstone course (3.0 Units)  

Available capstone tracks:  

  • Individual project with a Dartmouth PI 
  • External internship experience 
  • Student-led group project using publicly available data 

EXPLORE CURRENT STUDENT STORIES:

STUDENT SPOTLIGHTS

DISCOVER THE LATEST FROM DARTMOUTH

News & Events

Receive updates on applying to Dartmouth

The average salary for a Big Data Scientist is $105,442 in one year.

TALK TO OUR ADMISSIONS TEAM

Courtney Theroux DIRECTOR OF ADMISSIONS AND OPERATIONS

Amanda Williams ASSOCIATE DIRECTOR OF ADMISSIONS AND RECRUITMENT

Mia Pennekamp ADMISSIONS MANAGER

Jordan Andrews ADMISSIONS OPERATIONS COORDINATOR

Amanda Helali PROGRAM COORDINATOR

[email protected] (603) 646-5834

FACULTY SPOTLIGHT

Get the latest updates on applying to dartmouth.

avatar

🏥👩🏽‍⚕️ Data Science Course Capstone Project - Healthcare domain - Diabetes Detection

This is comprehensive project completed by me as part of the Data Science Post Graduate Programme. This project includes multiple classification algorithms over a dataset collected on health/diagnostic variables to predict of a person has diabetes or not based on the data points. Apart from extensive EDA to understand the distribution and other aspects of the data, pre-processing was done to identify data which was missing or did not make sense within certain columns and imputation techniques were deployed to treat missing values. For classification the balance of classes was also reviewed and treated using SMOTE. Finally models were built using various classification algorithms and compared for accuracy on various metrics.Lastly the project contains a dashboard on the original data using Tableau.

You can view the full project code on this Github link

Note: This is an academic project completed by me as part of my Post Graduate program in Data Science from Purdue University through Simplilearn. This project was towards final course completion.

Bussiness Scenario

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Build a model to accurately predict whether the patients in the dataset have diabetes or not.

Analysis Steps

Data cleaning and exploratory data analysis -.

histogram

There are integer as well as float data-type of variables in this dataset. Create a count (frequency) plot describing the data types and the count of variables.

count plot of data types

Check the balance of the data (to review imbalanced classes for the classification problem) by plotting the count of outcomes by their value. Review findings and plan future course of actions.

class imbalance

We notice that there is class imbalance . The diabetic class (1) is the minority class and there are 35% samples for this class. However for the non-diabetic class(0) there are 65% of the total samples present. We need to balance the data using any oversampling for minority class or undersampling for majority class. This would help to ensure the model is balanced across both classes.We can apply the SMOTE (synthetic minority oversampling technique) method for balancing the samples by oversampling the minority class (class 1 - diabetic) as we would want to ensure model more accurately predicts when an individual has diabetes in our problem.

Create scatter charts between the pair of variables to understand the relationships. Describe findings.

Pair plots

We review scatter charts for analysing inter-relations between the variables and observe the following

Perform correlation analysis. Visually explore it using a heat map.

correlation matrix plots

Observation : As mentioned in the pairplot analysis the variable Glucose has the highest correlation to outcome.

Model Building

Confusion Matrix

Note: ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two. This is quantified in the AUC Score. Final Analysis Based on the classification report:

Data Reporting

Dashboard Tableau

Tools used:

This project was completed in Python using Jupyter notebooks. Common libraries used for analysis include numpy, pandas, sci-kit learn, matplotlib, seaborn, xgboost

Further Reading

🏡🏷️ california housing price prediction using linear regression in python.

Summary- The project includes analysis on the California Housing Dataset with some Exploratory data analysis . There was encoding of categorical data using the one-hot encoding present in pandas. ...

🔎📊 Principal Component Analysis with XGBoost Regression in Python

Summary- This project is based on data from the Mercedes Benz test bench for vehicles at the testing and quality assurance phase during the production cycle. The dataset consists of high number of...

💬⚙️ NLP Project - Phone Review Analysis and Topic Modeling with Latent Dirichlet Allocation in Python

Summary- This is a Natural Language Processing project which includes analysis of buyer’s reviews/comments of a popular mobile phone by Lenovo from an e-commerce website. Analysis done for the proj...

Data Science

Search Submit search

DS: 401 Capstone Projects

Selected projects from 2022:.

P1: Public Health - Overdose Data Dashboard  

P2: USDA Commodity Dashboard(s)

P3: Classification and Analysis Pipeline of Political Video Advertisements - Dashboard with Google Cloud

P4: CSAFE Assessing and modeling quality of 3d topographic scans of fired bullets

P5: Department of Residence - The Impact of Living on Campus on Student Success

The Public Science Collaborative at ISU is looking for advanced data science students to join a research project focusing on the engineering and visualization of public health data. The key task for the spring semester will be to build an opioid overdose data dashboard similar to  this one in California . DS 401 interns will work in a supervised, collaborative team science environment to clean, analyze, and visualize data from four data sets, including a) vital statistics mortality data, b) emergency department overdose data, c) substance abuse treatment episodes data, and d) the Iowa Youth Survey dataset. We are a pluralistic coding environment and welcome students using Python, R, Stata, SAS, and other data management and analytic platforms.

Because this project is funded by an  Overdose Data to Action.  a grant from the Centers for Disease Control, students who are accepted to the project will have the opportunity to pair a funded research assistantship with their DS 401 internship. This opportunity would be an especially good fit for students who are interested in data visualization and data science communication.

Project advisors: Shawn F. Dorius - Associate Professor of Sociology

DS401-S22-P1-Poster_PublicHealth.pdf

Students selecting this project will develop a series of dashboards using Tableau. These dashboards will utilize data from the USDA Agricultural Census to show trends in production for selected commodities (such as apples, cheese, grapes, dairy, pork, lettuce, tomatoes, potatoes, strawberries, bees, and honey or wine). Trends may also include the monthly or annual quantity, the number of producers, acres in production, total sales, and other metrics at multiple geographies (county, state nation). Students will also incorporate demographic data for selected areas of interest that highlights the potential regional market and the market and consumption profile (food expenditures, farmers' marker density, schools with farms-to-school programs, etc). Students will be provided with access to Tableau and Tableau Server and will utilize R’s TidyCensus package to acquire data from the American Community Survey (ACS).

Project advisors:

Christopher J. Seeger, PLA, GISP - Professor, GIS Specialist and Director of Extension Indicators Program and 2022 DSPG Chair

Bailey Hanson, GISP - GIS Specialist; Leads GIS program and Data for Decision Maker program. Note her background includes a Master in Human-Computer Interaction.

DS401-S22-P2-Poster_USDA.pdf

Using the public data from the Google Transparency Report, this project will create a pipeline of extracting, processing, classifying, and visualizing the Political Ads data using a Google Cloud computing platform.

Campaign advertising through social media platforms has been growing at a high rate, which creates a large volume of content on the Internet. To increase transparency in federal campaign advertising, Google Inc. created  Google Transparency Report (GTR) . GTR provides websites and searchable databases about federal election campaign ads aired on Google and partners’ platforms. According to GTR, political advertisers have spent around $800M on election campaigns since May 2018.

This project made a platform for a collection of video ads aired on YouTube and for automated content analysis. It's able to 1) automatically classify a video ad into either a political category or a non-political one, (2) analyze predicted political ads into one of these types of interest to political science scholars: promote, attack, or contrast, 3) extract issues of interest for political science research, and 4) determine the polarity and subjectivity of a given ad. 5) Create various visualization charts from the previous analysis. 

Adisak Sukul - Associate Teaching Professor, Computer Science. Instructors for Data Science courses. Google Cloud Faculty Expert

DS401-S22-P3-Poster_GoogleTransparencyReport.pdf

A large part of a forensic examiner’s job is to visually compare evidence to decide whether two pieces of evidence come from the same source (e.g. bullets fired from the same barrel, prints from the same shoe, the same finger).

3d digital microscopy provides a basis to bring in algorithms in an attempt to make comparisons of evidence objective and quantify similarities (or dissimilarities). The high-resolution microscopy lab at Iowa State has acquired scans of bullet lands. 

Good-quality scans are essential for assessing the similarity of the striations (the marks engraved on the bullet as it passes through the barrel). 

In this project, the goal is to derive features capturing (aspects of) the quality of scans and build a model to predict a quality indicator. Ideally, this feedback will be given at the time of scanning, such that a lack of quality can be addressed immediately.

Students will work under the guidance of Dr. Heike Hofmann to derive features capturing scan quality, work on a model incorporating these scan analytics, and depending on time, design an app for giving feedback to scanning personnel.

Preferred skills: proficiency in R, and knowledge of HTML/javascript would be a plus.

Heike Hofmann, Professor, and Professor in Charge of the Data Science Program - Department of Statistics

Final R Package:  https://github.com/heike/DS401

DS401-S22-P4-Poster_CSAFE.pdf

Project Description: The Department of Residence is interested in understanding how living on campus, both your first year and subsequent years after, impacts student success measures such as graduation and retention.  We’re also looking to understand whether those impacts are the same or different for different sub-groups of students (such as students of color, first-generation students, etc.).  The audience for this data would be considered a non-technical audience, with a limited background in understanding and analyzing data.  The data file is already compiled and will be provided to this team.  No preference for analysis software. 

This project contains sensitive and private information. All of the findings from this project will remain private.

Dr. Elizabeth Housholder, serves as the Senior Research Analyst for the Department of Residence.

DS401-S22-P5-Poster_Residence.pdf

healthcare data science capstone project

High-dimensional Statistical Learning, Causal Inference, Robust ML, Fair ML

Post-prediction inference on political twitter.

  • Group members: Luis Ledezma-Ramos, Dylan Haar, Alicia Gunawan

Abstract: Having observed data seems to be a necessary requirement to conduct inference, but what happens when observed outcomes cannot easily be obtained? The simplest practice seems to proceed with using predicted outcomes, but without any corrections this can result in issues like bias and incorrect standard errors. Our project studies a correction method for inference conducted on predicted, not observed outcomes—called post-prediction inference—through the lens of political data. We are investigating the kinds of phrases or words in a tweet that will most strongly indicate a person’s political alignment to US politics. We have discovered that these correction techniques are promising in their ability to correct for post-prediction inference in the field of political science.

NFL-Analysis

  • Group members: Jonathan Langley, Sujeet Yeramareddy, Yong Liu

Abstract: After researching about a new inference correction approach called post-prediction inference, we chose to apply it to sports analysis based on NFL games. We designed a model that can predict the Spread of a football game, such as which team will win and what the margin of their victory will be. We then analyzed the most/least important features so that we can accurately correct inference for these variables in order to more accurately understand their impact on our response variable, Spread.

Machine Learning (TBA)

Investigation on latent dirichlet allocation.

  • Group members: Duha Aldebakel, Rui Zhang, Anthony Limon, Yu Cao

Abstract: We explore both Markov Chain Monte Carlo algorithms and variational inference methods for Latent Dirichlet Allocation (LDA), a generative probabilistic topic model for data such as text data. LDA is a generative probabilistic topic model, meaning we treat data as observations that arise from a generative probabilistic process including hidden variables, i.e. structure we want to find in the data. Topic modelling allows us to fulfill algorithmic needs to organize, understand, and annotate documents according to the discovered structure. For text data, hidden variables reflect the thematic structure of a corpus that we don't have access to, we only have access to our observations which are the documents of the collection themselves. Our aim is to infer this hidden structure through posterior inference, that is, we want to compute the conditional distribution of the hidden variables given our observations, and we use our knowledge from Q1 about inference methods to solve this problem.

Wildfire and Environmental Data Analysis

Machine learning for physical systems, locating sound with machine learning.

  • Group members: Raymond Zhao, Brady Zhou

Abstract: In this domain, we learned about the methods around localizing sound waves using special devices called microphone arrays. Broadly speaking, this device can figure what a sound is and where it came from. With the growing ubiquity of microphone devices, we find this to be a potentially useful use-case. The base case scenario method involves what is called "affine mapping" which is essentially another form of linear transformation. In this project, we decided to examine how machine learning techniques such as Neural Networks, Support Vector Machines, and Random Forest may benefit (or not benefit) in this field.

Environmental Monitoring, remote sensing, cyber-physical systems, Engineers for Exploration

E4e microfaune project.

  • Group members: Jinsong Yang, Qiaochen Sun

Abstract: Nowadays, human activities such as wildfires and hunting have become the largest factor that would have serious negative effects on biodiversity. In order to deeply understand how anthropogenic activities deeply affect wildlife populations, field biologists utilize automated image classification driven by neural networks to get relevant biodiversity information from the images. However, for some small animals such as insects or birds, the camera could not work very well because of the small size of these animals. It is extremely hard for cameras to capture the movement and activities of small animals. To effectively solve this problem, passive acoustic monitoring (PAM) has become one of the most popular methods. We could utilize sounds we collect from PAM to train certain machine learning models which could tell us the fluctuation of biodiversity of all these small animals. The goal of the whole program is to test the biodiversity of these small animals (most of them are birds). However, the whole program could be divided into plenty of small parts. I and Jinsong will pay attention to the intermediate step of the program. The goal of our project is to generate subsets of audio recordings that have higher probability of vocalization of interest, which could help our labeling volunteer to save time and energy. The solutions could help us reduce down the amount of time and resources required to achieve enough training data for species-level classifiers. We perform the same thing with AID_NeurIPS_2021. Only the data is different between these two github. For this github, we use the peru data instead of Coastal_Reserve data.

  • Group members: Harsha Jagarlamudi, Kelly Kong

Eco-Acoustic Event Detection: Classifying temporal presence of birds in recorded bird vocalization audio

  • Group members: Alan Arce, Edmundo Zamora

Abstract: Leveraging "Deep Learning" methods to classify temporal presence birds in recorded bird vocalization audio. Using a hybrid CNN-RNN model, trained on audio data, in the interest of benefitting wildlife monitoring and preservation.

Pyrenote - User Profile Design & Accessible Data

  • Group members: Dylan Nelson

Abstract: Pyrenote is a project in development by a growing group of student researchers here at UCSD. It's primary purpose is to allow anyone to contribute to research by labeling data in an intuitive and accessible way. Right now it is currently being used to develop a sort of voice recognition for birds. The goal is to make an algorithm that can strongly label data (say where in the clip a bird is calling and what bird is making the call). To do this, a very vast dataset is needed to be labeled. I worked mostly on the user experience side. Allowing them to interact with their labeling in new ways, such as keeping tabs on their progress and reaching goals. Developing a User Profile page was the primary source for receiving this data and was developed iteratively as a whole new page for the site

Pyrenote Webdeveloper

  • Group members: Wesley Zhen

Abstract: The website, Pyrenote, is helping scientists track bird populations by identifying them using machine learning classifiers on publicly annotated audio recordings. I have implemented three features over the course of two academic quarters aimed at streamlining user experience and improving scalability. The added scalability will be useful for future projects as we start becoming more ambitious with the number of users we bring to the site.

Spread of Misinformation Online

Who is spreading misinformation and worries in twitter.

  • Group members: Lehan Li, Ruojia Tao

Abstract: Spread of misinformation over social media posts challenges to daily information intake and exchange. Especially under current covid 19 pandemic, the disperse of misinformation regarding to covid 19 diseases and vaccination posts threats to individuals' wellbeing's and general publish health. The people's worries also increase with misinformation such as the shortage of food and water. This spread of misinformation also provide This project seeks to investigate the spread of misinformation over social media (Twitter) under covid 19 pandemic. wo main directions are investigated in the project. The first direction is the analysis of the effect of bot users on the spread of misinformation: We want to explore what is the role that robot user plays in spreading the misinformation. Where are the bot users located in the social network. The second direction is the sentiment analysis that examines users' attitudes towards misinformation: We want to see the spread of sentiment with different places in social networks. We also mixed the two directions: What is the relationship between bot-users with positive and negative emptions? Since online social medias users form social networks, the project also seeks to investigate the effect of social network on the above two topics. Moreover, the project is also interested in exploring the change in proportion of bot users and users' attitude towards misinformation as the social network becomes more concentrated and tightly connected.

Misinformation on Reddit

  • Group members: Samuel Huang, David Aminifard

Abstract: As social media has grown in popularity, namely Reddit, its use for rapidly sharing information based on categories or topics (subreddits) has had massive implications for how people are usually exposed to information and the quality of the information they interact with. While Reddit has its benefits, e.g. providing instant access to - nearly - real time, categorized information, it has possibly played a role in worsening divisions and the spread of misinformation. Our results showed that subreddits with the highest proportions of misinformation posts tend to lean more towards politics and news. In addition, we found that despite the frequency of misinformation per subreddit, the average upvote ratio per submission seemed consistently high, which indicated that subreddits tend to be ideologically homogeneous.

The Spread of YouTube Misinformation Through Twitter

  • Group members: Alisha Sehgal, Anamika Gupta

Abstract: In our Capstone Project, we explore the spread of misinformation online. More specifically, we look at the spread of misinformation across Twitter and YouTube because of the large role these two social media platforms play in the dissemination of news and information. Our main objectives are to understand how YouTube videos contribute to spreading misinformation on Twitter, evaluate how effectively YouTube is removing misinformation and if these policies also prevent users from engaging with misinformation. We take a novel approach of analyzing tweets, YouTube video captions, and other metadata using NLP to determine the presence of misinformation and investigate how individuals interact or spread misinformation. Our research focuses on the domain of public health as this is the subject of many conspiracies, varying opinions, and fake news.

Particle Physics

Understanding higgs boson particle jets with graph neural networks.

  • Group members: Charul Sharma, Rui Lu, Bryan Ambriz

Abstract: Extending the content of last quarter of deep sets neural network, fully connected neural network classifier, adversarial deep set model and designed decorrelated tagger (DDT), we went a little bit further this quarter about picking up different layers in neural network like GENConv and EdgeConv. GENConv and EdgeConv play incredibly important roles here for boosting the performances of our basic GNN model. We also evaluated the performance of our model using ROC (Receiver-Operating Curve) curves describing AUC (Area Under the Curve). Meanwhile, based on previous experiences of project one and past project of particle physics domain, we decided to add one more section, exploratory data analysis in our project for conducting some basic theory, bootstrapping or common sense of our dataset. But we have not produced all the optimal outcomes so far even though we finished the EdgeConv part and for the following weeks, we would like to finish the GENConv and may try some other layers to find out the potential to increase the performance of our model.

Predicting a Particle's True Mass

  • Group members: Jayden Lee, Dan Ngo, Isac Lee

Abstract: The Large Hadron Collider (LHC) collides protons traveling near light speed to generate high-energy collisions. These collisions produce new particles and have led to the discovery of new elementary particles (e.g., Higgs Boson). One key information to collect from this collision event is the structure of the particle jet, which refers to a group of collective spray of decaying particles that travel in the same direction, as accurately identifying the type of these jets - QCD or signal - play a crucial role in discovery of high-energy elementary particles like Higgs particle. There are several properties that determine jet type with jet mass being one of the strongest indicators in jet type classification. A previous study jet mass estimation, called “soft drop declustering,” has been one of the most effective methods in making rough estimations on the jet mass. With this in mind, we aim to implement machine learning in jet mass estimation through various neural network architectures. With data collected and processed by CERN, we implemented a model capable of improving jet mass prediction through jet features.

Mathematical Signal Processing (compression of deep nets, or optimization for data-science/ML)

Graph neural networks, graph neural network based recommender systems for spotify playlists.

  • Group members: Benjamin Becze, Jiayun Wang, Shone Patil

Abstract: With the rise of music streaming services on the internet in the 2010’s, many have moved away from radio stations to streaming services like Spotify and Apple Music. This shift offers more specificity and personalization to users’ listening experiences, especially with the ability to create playlists of whatever songs that they wish. Oftentimes user playlists have a similar genre or theme between each song, and some streaming services like Spotify offer recommendations to expand a user’s existing playlist based on the songs in it. Using Node2vec and GraphSAGE graph neural network methods, we set out to create a recommender system for songs to add to an existing playlist by drawing information from a vast graph of songs we built from playlist co-occurrences. The result is a personalized song recommender based not only on Spotify’s community of playlist creators, but also the specific features within a song.

Dynamic Stock Industry Classification

  • Group members: Sheng Yang

Abstract: Use Graph-based Analysis to Re-classify Stocks in China A-share and Improve Markowitz Portfolio Optimization

NLP, Misinformation

Hdsi faculty exploration tool.

  • Group members: Martha Yanez, Sijie Liu, Siddhi Patel, Brian Qian

Abstract: The Halıcıoğlu Data Science Institute (HDSI) at University of California, San Diego is dedicated to the discovery of new methods and training of students and faculty to use data science to solve problems in the current world. The HDSI has several industry partners that are often searching for assistance to tackle their daily activities and need experts in different domain areas. Currently, there are around 55 professors affiliated to HDSI. They all have diverse research interests and have written numerous papers in their own fields. Our goal was to create a tool that allows HDSI to select the best fit from their faculty, based on their published work, to aid their industry partners in their specific endeavors. We did this with Natural Language Processing (NLP) by managing all the abstracts from the faculty’s published work and organizing them by topics. We will then obtained the proportion of papers of each faculty associated with each of the topics and drew a relationship between researchers and their most published topics. This will allow HDSI to personalize recommendations of faculty candidates to their industry partner’s particular job.

  • Group members: Du Xiang

AI in Healthcare, Deep Reinforcement Learning, Trustworthy Machine Learning

Improving robustness in deep fusion modeling against adversarial attacks.

  • Group members: Ayush More, Amy Nguyen

Abstract: Autonomous vehicles rely heavily on deep fusion modeling, which utilize multiple inputs for its inferences and decision making. By using the data from these inputs, the deep fusion model benefits from shared information, which is primarily associated with robustness as these input sources can face different levels of corruption. Thus, it is highly important that the deep fusion models used in autonomous vehicles are robust to corruption, especially to input sources that are weighted more heavily in different conditions. We explore a different approach in training the robustness for a deep fusion model through adversarial training. We fine-tune the model on adversarial examples and evaluate its robustness against single source noise and other forms of corruption. Our experimental results show that adversarial training was effective in improving the robustness of a deep fusion model object detector against adversarial noise and Gaussian noise while maintaining performance on clean data. The results also highlighted the lack of robustness of models that are not trained to handle adversarial examples. We believe that this is relevant given the risks that autonomous vehicles pose to pedestrians - it is important that we ensure the inferences and decisions made by the model are robust against corruption, especially if it is intentional from outside threats.

Healthcare: Adversarial Defense In Medical Deep Learning Systems

  • Group members: Rakesh Senthilvelan, Madeline Tjoa

Abstract: In order to combat against such adversarial instances, there needs to be robust training done with these models in order to best protect against the methods that these attacks use on deep learning systems. In the scope of this paper, we will be looking into the methods of fast gradient signed method and projected gradient descent, two methods used in adversarial attacks to maximize loss functions and cause the affected system to make opposing predictions, in order to train our models against them and allow for stronger accuracy when faced with adversarial examples.

Satellite image analysis

Ml for finance, ml for healthcare, fair ml, ml for science, actionable recourse.

  • Group members: Shweta Kumar, Trevor Tuttle, Takashi Yabuta, Mizuki Kadowaki, Jeffrey Feng

Abstract: In American society today there is a constant encouraged reliance on credit, despite it not being available to everyone as a legal right. Currently, there are countless evaluation methods of an individual's creditworthiness in practice. In an effort to regulate the selection criteria of different financial institutions, the Equal Credit Opportunity Act (ECOA) requires that applicants denied a loan are entitled to an Adverse Action notice, a statement from the creditor explaining the reason for the denial. However, these adverse action notices are frequently unactionable and ineffective in providing feedback to give an individual recourse, which is the ability to act up on a reason for denial to raise one’s odds of getting accepted for a loan. In our project, we will be exploring whether it is possible to create an interactive interface to personalize adverse action notices in alignment with personal preferences for individuals to gain recourse.

Social media; online communities; text analysis; ethics

Finding commonalities in misinformative articles across topics.

  • Group members: Hwang Yu, Maximilian Halvax, Lucas Nguyen

Abstract: In order to combat the large scale distribution of misinformation online, We wanted to develop a way to flag news articles that are misinformative and could potentially mislead the general public. In addition to flagging news articles, we also wanted to find commonalities between the misinformation that we found. Were some topics in specific containing more misleading information than others? How much overlap do these articles have when we break their content down into TF IDF and see what words carry the most importance when put into various models detecting misinformation. We wanted to narrow down our models to be trained on four different topics: economics, politics, science, and general which is a dataset encompassing the three previous topics. We Found that general included the most overlap overall, while the topics themselves, while mostly different than the other specific topics, had certain models that still put emphasis on similar words, indicating a possible pattern of misinformative language in these articles. We believe, from these results, that we can find a pattern that could direct further investigation into how misinformation is written and distributed online.

The Effect of Twitter Cancel Culture on the Music Industry

  • Group members: Peter Wu, Nikitha Gopal, Abigail Velasquez

Abstract: Musicians often trend on social media for various reasons but in recent years, there has been a rise in musicians being “canceled” for committing offensive or socially unacceptable behavior. Due to the wide accessibility of social media, the masses are able to hold accountable musicians for their actions through “cancel culture”, a form of modern ostracism. Twitter has become a well-known platform for “cancel culture” as users can easily spread hashtags and see what’s trending, which also has the potential to facilitate the spread of toxicity. We analyze how public sentiment towards canceled musicians on Twitter changes in respect to the type of issue they were canceled for, their background, and the strength of their parasocial relationship with their fans. Through our research, we aim to determine whether “cancel culture” leads to an increase in toxicity and negative sentiment towards a canceled individual.

Analyzing single cell multimodality data via (coupled) autoencoder neural networks

Coupled autoencoders for single-cell data analysis.

  • Group members: Alex Nguyen, Brian Vi

Abstract: Historically, analysis on single-cell data has been difficult to perform, due to data collection methods often resulting in the destruction of the cell in the process of collecting information. However, an ongoing endeavor of biological data science has recently been to analyze different modalities, or forms, of the genetic information within a cell. Doing so will allow modern medicine a greater understanding of cellular functions and how cells work in the context of illnesses. The information collected on the three modalities of DNA, RNA, and protein can be done safely and because it is known that they are same information in different forms, analysis done on them can be extrapolated understand the cell as a whole. Previous research has been conducted by Gala, R., Budzillo, A., Baftizadeh, F. et al. to capture gene expression in neuron cells with a neural network called a coupled autoencoder. This autoencoder framework is able to reconstruct the inputs, allowing the prediction of one input to another, as well as align the multiple inputs in the same low dimensional representation. In our paper, we build upon this coupled autoencoder on a data set of cells taken from several sites of the human body, predicting from RNA information to protein. We find that the autoencoder is able to adequately cluster the cell types in its lower dimensional representation, as well as perform decently at the prediction task. We show that the autoencoder is a powerful tool for analyzing single-cell data analysis and may prove to be a valuable asset in single-cell data analysis.

Machine Learning, Natural Language Processing

On evaluating the robustness of language models with tuning.

  • Group members: Lechuan Wang, Colin Wang, Yutong Luo

Abstract: Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Activity Based Travel Models and Feature Selection

A tree-based model for activity based travel models and feature selection.

  • Group members: Lisa Kuwahara, Ruiqin Li, Sophia Lau

Abstract: In a previous study, Deloitte Consulting LLP developed a method of creating city simulations through cellular location and geospatial data. Using these simulations of human activity and traffic patterns, better decisions can be made regarding modes of transportation or road construction. However, the current commonly used method of estimating transportation mode choice is a utility model that involves many features and coefficients that may not necessarily be important but still make the model more complex. Instead, we used a tree-based approach - in particular, XGBoost - to identify just the features that are important for determining mode choice so that we can create a model that is simpler, robust, and easily deployable, in addition to performing better than the original utility model on both the full dataset and population subsets.

Explainable AI, Causal Inference

Explainable ai.

  • Group members: Jerry Chan, Apoorv Pochiraju, Zhendong Wang, Yujie Zhang

Abstract: Nowadays, the algorithmic decision-making system has been very common in people’s daily lives. Gradually, some algorithms become too complex for humans to interpret, such as some black-box machine learning models and deep neural networks. In order to assess the fairness of the models and make them better tools for different parties, we need explainable AI (XAI) to uncover the reasoning behind the predictions made by those black-box models. In our project, we will be focusing on using different techniques from causal inferences and explainable AI to interpret various classification models across various domains. In particular, we are interested in three domains - healthcare, finance, and the housing market. Within each domain, we are going to train four binary classification models first, and we have four goals in general: 1) Explaining black-box models both globally and locally with various XAI methods. 2) Assessing the fairness of each learning algorithm with regard to different sensitive attributes; 3) Generating recourse for individuals - a set of minimal actions to change the prediction of those black-box models. 4) Evaluating the explanations from those XAI methods using domain knowledge.

AutoML Platforms

Deep learning transformer models for feature type inference.

  • Group members: Andrew Shen, Tanveer Mittal

Abstract: The first step AutoML software must take after loading in the data is to identify the feature types of individual columns in input data. This information then allows the software to understand the data and then preprocess it to allow machine learning algorithms to run on it. Project Sortinghat of the ADA lab at UCSD frames this task of Feature Type Inference as a machine learning multiclass classification problem. Machine learning models defined in the original SortingHat feature type inference paper use 3 sets of features as input. 1. The name of the given column 2. 5 not null sample values 3. Descriptive numeric features about the column The textual features are easy to access, however the descriptive statistics previous models rely on require a full pass through the data which make preprocessing less scalable. Our goal is to produce models that may rely less on these statistics by better leveraging the textual features. As an extension of Project SortingHat, we experimented with deep learning transformer models and varying the sample sizes used by random forest models. We found that our transformer models achieved state of the art results on this task which outperform all existing tools and ML models that have been benchmarked against SortingHat's ML Data Prep Zoo. Our best model used a pretrained Bidirectional Encoder Representations Transformer(BERT) language model to produce word embeddings which are then processed by a Convolutional Neural Network(CNN) model. As a result of this project, we have published 2 BERT CNN models using the PyTorch Hub api. This is to allow software engineers to easily integrate our models or train similar ones for use in AutoML platforms or other automated data preparation applications. Our best model uses all the features defined above, while the other only uses column names and sample values while offering comparable performance and much better scalability for all input data.

Exploring Noise in Data: Applications to ML Models

  • Group members: Cheolmin Hwang, Amelia Kawasaki, Robert Dunn

Abstract: In machine learning, models are commonly built in such a way to avoid what is known as overfitting. As it is generally understood, overfitting is when a model is fit exactly to the training data causing the model to have poor performance on new examples. This means that overfit models tend to have poor accuracy on unseen data because the model is fit exactly to the training data. Therefore, in order to generalize to all examples of data and not only the examples found in a given training set, models are built with certain techniques to avoid fitting the data exactly. However, it can be found that overfitting does not always work in this way that one might expect as will be shown by fitting models with a given level of noisiness. Specifically, it is seen that some models fit exactly to data with high levels of noise still produce results with high accuracy whereas others are more prone to overfitting.

Group Testing for Optimizing COVID-19 Testing

Covid-19 group testing optimization strategies.

  • Group members: Mengfan Chen, Jeffrey Chu, Vincent Lee, Ethan Dinh-Luong

Abstract: The COVID-19 pandemic that has persisted for more than two years has been combated by efficient testing strategies that reliably identifies positive individuals to slow the spread of the pandemic. Opposed to other pooling strategies within the domain, the methods described in this paper prioritize true negative samples over overall accuracy. In the Monte Carlo simulations, both nonadaptive and adaptive testing strategies with random pool sampling resulted in high accuracy approaching at least 95% with varying pooling sizes and population sizes to decrease the number of tests given. A split tensor rank 2 method attempts to identify all infected samples within 961 samples, converging the number of tests to 99 as the prevalence of infection converges to 1%.

Causal Discovery

Patterns of fairness in machine learning.

  • Group members: Daniel Tong, Anne Xu, Praveen Nair

Abstract: Machine learning tools are increasingly used for decision-making in contexts that have crucial ramifications. However, a growing body of research has established that machine learning models are not immune to bias, especially on protected characteristics. This had led to efforts to create mathematical definitions of fairness that could be used to estimate whether, given a prediction task and a certain protected attribute, an algorithm is being fair to members of all classes. But just like how philosophical definitions of fairness can vary widely, mathematical definitions of fairness vary as well, and fairness conditions can in fact be mutually exclusive. In addition, the choice of model to use to optimize fairness is also a difficult decision we have little intuition for. Consequently, our capstone project centers around an empirical analysis for studying the relationships between machine learning models, datasets, and various fairness metrics. We produce a 3-dimensional matrix of the performance of a certain machine learning model, for a certain definition of fairness, for a certain given dataset. Using this matrix on a sample of 8 datasets, 7 classification models, and 9 fairness metrics, we discover empirical relationships between model type and performance on specific metrics, in addition to correlations between metric values across different dataset-model pairs. We also offer a website and command-line interface for users to perform this experimentation on their own datasets.

Causal Effects of Socioeconomic and Political Factors on Life Expectancy in 166 Different Countries

  • Group members: Adam Kreitzman, Maxwell Levitt, Emily Ramond

Abstract: This project examines causal relationships between various socioeconomic variables and life expectancy outcomes in 166 different countries, with the ability to account for new, unseen data and variables with an intuitive data pipeline process with detailed instructions and the PC algorithm with updated code to account for missingness in data. With access to this model and pipeline, we hope that questions such as “do authoritarian countries have a direct relation to life expectancy?” or “how does women in government affect perceived notion of social support?” will now be able to be answered and understood. Through our own analysis, we were able to find intriguing results, such as a higher Perception of Corruption is distinctly related to a lower Life Ladder score. We also found that higher quality of life perceptions is related to lower economic inequality. These results aim to educate not only the general public, but government officials as well.

Time series analysis in health

Time series analysis on the effect of light exposure on sleep quality.

  • Group members: Shubham Kaushal, Yuxiang Hu, Alex Liu

Abstract: The increase of artificial light exposure through the increased prevalence of technology has an affect on the sleep cycle and circadian rhythm of humans. The goal of this project is to determine how different colors and intensities of light exposure prior to sleep affects the quality of sleep through the classification of time series data.

Sleep Stage Classification for Patients With Sleep Apnea

  • Group members: Kevin Chin, Yilan Guo, Shaheen Daneshvar

Abstract: Sleeping is not uniform and consists of four stages: N1, N2, N3, and REM sleep. The analysis of sleep stages is essential for understanding and diagnosing sleep-related diseases, such as insomnia, narcolepsy, and sleep apnea; however, sleep stage classification often does not generalize to patients with sleep apnea. The goal of our project is to build a sleep stage classifier specifically for people with sleep apnea and understand how it differs from the normal sleep stage. We will then explore whether or not the inclusion and featurization of ECG data will improve the performance of our model.

Environmental health exposures & pollution modeling & land-use change dynamics

Supervised classification approach to wildfire mapping in northern california.

  • Group members: Alice Lu, Oscar Jimenez, Anthony Chi, Jaskaranpal Singh

Abstract: Burn severity maps are an important tool for understanding fire damage and managing forest recovery. We have identified several issues with current mapping methods used by federal agencies that affect the completeness, consistency, and efficiency of their burn severity maps. In order to address these issues, we demonstrate the use of machine learning as an alternative to traditional methods of producing severity maps, which rely on in-situ data and spectral indices derived from image algebra. We have trained several supervised classifiers on sample data collected from 17 wildfires across Northern California and evaluate their performance at mapping fire severity.

Network Performance Classification

Network signal anomaly detection.

  • Group members: Laura Diao, Benjamin Sam, Jenna Yang

Abstract: Network degradation occurs in many forms, and our project will focus on two common factors: packet loss and latency. Packet loss occurs when one or more data packets transmitted across a computer network fail to reach their destination. Latency can be defined as a measure of delay for data to transmit across a network. For internet users, high rates of packet loss and significant latency can manifest in jitter or lag, which are indicators of overall poor network performance as perceived by the end user. Thus, when issues arise in these two factors, it would be beneficial for internet service providers to know exactly when the user is experiencing problems in real time. In real world scenarios, situations or environments such as poor port quality, overloaded ports, network congestion and more can impact overall network performance. In order to detect some of these issues in network transmission data, we built an anomaly detection system that predicts the estimated packet loss and latency of a connection and detects whether there is a significant degradation of network quality for the duration of the connection.

Real Time Anomaly Detection in Networks

  • Group members: Justin Harsono, Charlie Tran, Tatum Maston

Abstract: Internet companies are expected to deliver the speed their customer has paid for. However, for various reasons such as congestion or connectivity issues, it is inevitable for one to perceive degradations in network quality. To still ensure the customer is satisfied, certain monitoring systems must be built to inspect the quality of the connection. Our goal is to build a model that would be able to detect, in real time, these regions of networks degradations, so that an appropriate recovery can be enacted to offset these degradations. Our solution is a combination of two anomaly detection methods that successfully detects shifts in the data, based on a rolling window of data it has seen.

System Usage Reporting

Intel telemetry: data collection & time-series prediction of app usage.

  • Group members: Srikar Prayaga, Andrew Chin, Arjun Sawhney

Abstract: Despite advancements in hardware technology, PC users continue to face frustrating app launch times, especially on lower end Windows machines. The desktop experience differs vastly from the instantaneous app launches and optimized experience we have come to expect even from low end smartphones. We propose a solution to preemptively run Windows apps in the background based on the app usage patterns of the user. Our solution is two-step. First, we built telemetry collector modules in C/C++ to collect real-world app usage data from two of our personal Windows 10 devices. Next, we developed neural network models, trained on the collected data, to predict app usage times and corresponding launch sequences in python. We achieved impressive results on selected evaluation metrics across different user profiles.

Predicting Application Use to Reduce User Wait Time

  • Group members: Sasami Scott, Timothy Tran, Andy Do

Abstract: Our goal for this project was to lower the user wait time when loading programs by predicting the next used application. In order to obtain the needed data, we created data collection libraries. Using this data, we created a Hidden Markov Model (HMM) and a Long Short-Term Memory (LSTM) model, but the latter proved to be better. Using LSTM, we can predict the application use time and expand this concept to more applications. We created multiple LSTM models with varying results, but ultimately chose a model that we think had potential. We decided on using the model that reported a 90% accuracy.

INTELlinext: A Fully Integrated LSTM and HMM-Based Solution for Next-App Prediction With Intel SUR SDK Data Collection

  • Group members: Jared Thach, Hiroki Hoshida, Cyril Gorlla

Abstract: As the power of modern computing devices increases, so too do user expectations for them. Despite advancements in technology, computer users are often faced with the dreaded spinning icon waiting for an application to load. Building upon our previous work developing data collectors with the Intel System Usage Reporting (SUR) SDK, we introduce INTELlinext, a comprehensive solution for next-app prediction for application preload to improve perceived system fluidity. We develop a Hidden Markov Model (HMM) for prediction of the k most likely next apps, achieving an accuracy of 64% when k = 3. We then implement a long short-term memory (LSTM) model to predict the total duration that applications will be used. After hyperparameter optimization leading to an optimal lookback value of 5 previous applications, we are able to predict the usage time of a given application with a mean absolute error of ~45 seconds. Our work constitutes a promising comprehensive application preload solution with data collection based on the Intel SUR SDK and prediction with machine learning.

CodeAvail

21 Interesting Data Science Capstone Project Ideas [2024]

data science capstone project ideas

Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. 

Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings. 

These projects serve as a culmination of their learning journey, providing invaluable opportunities for skill development and problem-solving. 

Our blog is dedicated to guiding prospective students through the selection process of data science capstone project ideas. It offers curated ideas and insights to help them embark on a fulfilling educational experience. 

Join us as we navigate the dynamic world of data science, empowering students to thrive in this exciting field.

Data Science Capstone Project: A Comprehensive Overview

Table of Contents

Data science capstone projects are an essential component of data science education, providing students with the opportunity to apply their knowledge and skills to real-world problems. 

Capstone projects challenge students to acquire and analyze data to solve real-world problems. These projects are designed to test students’ skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning. 

In addition, capstone projects are conducted with industry, government, and academic partners, and most projects are sponsored by an organization. 

The projects are drawn from real-world problems, and students work in teams consisting of two to four students and a faculty advisor. 

However, the goal of the capstone project is to create a usable/public data product that can be used to show students’ skills to potential employers. 

Best Data Science Capstone Project Ideas – According to Skill Level

Data science capstone projects are a great way to showcase your skills and apply what you’ve learned in a real-world context. Here are some project ideas categorized by skill level:

best data science capstone project ideas - according to skill level

Beginner-Level Data Science Capstone Project Ideas

beginner-level data science capstone project ideas

1. Exploratory Data Analysis (EDA) on a Dataset

Start by analyzing a dataset of your choice and exploring its characteristics, trends, and relationships. Practice using basic statistical techniques and visualization tools to gain insights and present your findings clearly and understandably.

2. Predictive Modeling with Linear Regression

Build a simple linear regression model to predict a target variable based on one or more input features. Learn about model evaluation techniques such as mean squared error and R-squared, and interpret the results to make meaningful predictions.

3. Classification with Decision Trees

Use decision tree algorithms to classify data into distinct categories. Learn how to preprocess data, train a decision tree model, and evaluate its performance using metrics like accuracy, precision, and recall. Apply your model to practical scenarios like predicting customer churn or classifying spam emails.

4. Clustering with K-Means

Explore unsupervised learning by applying the K-Means algorithm to group similar data points together. Practice feature scaling and model evaluation to identify meaningful clusters within your dataset. Apply your clustering model to segment customers or analyze patterns in market data.

5. Sentiment Analysis on Text Data

Dive into natural language processing (NLP) by analyzing text data to determine sentiment polarity (positive, negative, or neutral). 

Learn about tokenization, text preprocessing, and sentiment analysis techniques using libraries like NLTK or spaCy. Apply your skills to analyze product reviews or social media comments.

6. Time Series Forecasting

Predict future trends or values based on historical time series data. Learn about time series decomposition, trend analysis, and seasonal patterns using methods like ARIMA or exponential smoothing. Apply your forecasting skills to predict stock prices, weather patterns, or sales trends.

7. Image Classification with Convolutional Neural Networks (CNNs)

Explore deep learning concepts by building a basic CNN model to classify images into different categories. 

Learn about convolutional layers, pooling, and fully connected layers, and experiment with different architectures to improve model performance. Apply your CNN model to tasks like recognizing handwritten digits or classifying images of animals.

Intermediate-Level Data Science Capstone Project Ideas

intermediate-level data science capstone project ideas

8. Customer Segmentation and Market Basket Analysis

Utilize advanced clustering techniques to segment customers based on their purchasing behavior. Conduct market basket analysis to identify frequent item associations and recommend personalized product suggestions. 

Implement techniques like the Apriori algorithm or association rules mining to uncover valuable insights for targeted marketing strategies.

9. Time Series Anomaly Detection

Apply anomaly detection algorithms to identify unusual patterns or outliers in time series data. Utilize techniques such as moving average, Z-score, or autoencoders to detect anomalies in various domains, including finance, IoT sensors, or network traffic. 

Develop robust anomaly detection models to enhance data security and predictive maintenance.

10. Recommendation System Development

Build a recommendation engine to suggest personalized items or content to users based on their preferences and behavior. Implement collaborative filtering, content-based filtering, or hybrid recommendation approaches to improve user engagement and satisfaction. 

Evaluate the performance of your recommendation system using metrics like precision, recall, and mean average precision.

11. Natural Language Processing for Topic Modeling

Dive deeper into NLP by exploring topic modeling techniques to extract meaningful topics from text data. 

Implement algorithms like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix Factorization (NMF) to identify hidden themes or subjects within large text corpora. Apply topic modeling to analyze customer feedback, news articles, or academic papers.

12. Fraud Detection in Financial Transactions

Develop a fraud detection system using machine learning algorithms to identify suspicious activities in financial transactions. Utilize supervised learning techniques such as logistic regression, random forests, or gradient boosting to classify transactions as fraudulent or legitimate. 

Employ feature engineering and model evaluation to improve fraud detection accuracy and minimize false positives.

13. Predictive Maintenance for Industrial Equipment

Implement predictive maintenance techniques to anticipate equipment failures and prevent costly downtime. 

Analyze sensor data from machinery using machine learning algorithms like support vector machines or recurrent neural networks to predict when maintenance is required. Optimize maintenance schedules to minimize downtime and maximize operational efficiency.

14. Healthcare Data Analysis and Disease Prediction

Utilize healthcare datasets to analyze patient demographics, medical history, and diagnostic tests to predict the likelihood of disease occurrence or progression. 

Apply machine learning algorithms such as logistic regression, decision trees, or support vector machines to develop predictive models for diseases like diabetes, cancer, or heart disease. Evaluate model performance using metrics like sensitivity, specificity, and area under the ROC curve.

Advanced Level Data Science Capstone Project Ideas

advanced level data science capstone project ideas

15. Deep Learning for Image Generation

Explore generative adversarial networks (GANs) or variational autoencoders (VAEs) to generate realistic images from scratch. Experiment with architectures like DCGAN or StyleGAN to create high-resolution images of faces, landscapes, or artwork. 

Evaluate image quality and diversity using perceptual metrics and human judgment.

16. Reinforcement Learning for Game Playing

Implement reinforcement learning algorithms like deep Q-learning or policy gradients to train agents to play complex games like Atari or board games. 

Experiment with exploration-exploitation strategies and reward-shaping techniques to improve agent performance and achieve superhuman levels of gameplay.

17. Anomaly Detection in Streaming Data

Develop real-time anomaly detection systems to identify abnormal behavior in streaming data streams such as network traffic, sensor readings, or financial transactions. 

Utilize online learning algorithms like streaming k-means or Isolation Forest to detect anomalies and trigger timely alerts for intervention.

18. Multi-Modal Sentiment Analysis

Extend sentiment analysis to incorporate multiple modalities such as text, images, and audio to capture rich emotional expressions. 

However, utilize deep learning architectures like multimodal transformers or fusion models to analyze sentiment across different modalities and improve understanding of complex human emotions.

19. Graph Neural Networks for Social Network Analysis

Apply graph neural networks (GNNs) to model and analyze complex relational data in social networks. Use techniques like graph convolutional networks (GCNs) or graph attention networks (GATs) to learn node embeddings and predict node properties such as community detection or influential users.

20. Time Series Forecasting with Deep Learning

Explore advanced deep learning architectures like long short-term memory (LSTM) networks or transformer-based models for time series forecasting. 

Utilize attention mechanisms and multi-horizon forecasting to capture long-term dependencies and improve prediction accuracy in dynamic and volatile environments.

21. Adversarial Robustness in Machine Learning

Investigate techniques to improve the robustness of machine learning models against adversarial attacks. 

Explore methods like adversarial training, defensive distillation, or certified robustness to mitigate vulnerabilities and ensure model reliability in adversarial perturbations, particularly in critical applications like autonomous vehicles or healthcare.

These project ideas cater to various skill levels in data science, ranging from beginners to experts. Choose a project that aligns with your interests and skill level, and don’t hesitate to experiment and learn along the way!

Factors to Consider When Choosing a Data Science Capstone Project

Choosing the right data science capstone project is crucial for your learning experience and effectively showcasing your skills. Here are some factors to consider when selecting a data science capstone project:

Personal Interest

Select a project that aligns with your passions and career goals to stay motivated and engaged throughout the process.

Data Availability

Ensure access to relevant and sufficient data to complete the project and draw meaningful insights effectively.

Complexity Level

Consider your current skill level and choose a project that challenges you without overwhelming you, allowing for growth and learning.

Real-World Impact

Aim for projects with practical applications or societal relevance to showcase your ability to solve tangible problems.

Resource Requirements

Evaluate the availability of resources such as time, computing power, and software tools needed to execute the project successfully.

Mentorship and Support

Seek projects with opportunities for guidance and feedback from mentors or peers to enhance your learning experience.

Novelty and Innovation

Explore projects that push boundaries and explore new techniques or approaches to demonstrate creativity and originality in your work.

Tips for Successfully Completing a Data Science Capstone Project

Successfully completing a data science capstone project requires careful planning, effective execution, and strong communication skills. Here are some tips to help you navigate through the process:

  • Plan and Prioritize: Break down the project into manageable tasks and create a timeline to stay organized and focused.
  • Understand the Problem: Clearly define the project objectives, requirements, and expected outcomes before analyzing.
  • Explore and Experiment: Experiment with different methodologies, algorithms, and techniques to find the most suitable approach.
  • Document and Iterate: Document your process, results, and insights thoroughly, and iterate on your analyses based on feedback and new findings.
  • Collaborate and Seek Feedback: Collaborate with peers, mentors, and stakeholders, actively seeking feedback to improve your work and decision-making.
  • Practice Communication: Communicate your findings effectively through clear visualizations, reports, and presentations tailored to your audience’s understanding.
  • Reflect and Learn: Reflect on your challenges, successes, and lessons learned throughout the project to inform your future endeavors and continuous improvement.

By following these tips, you can successfully navigate the data science capstone project and demonstrate your skills and expertise in the field.

Wrapping Up

In wrapping up, data science capstone project ideas are invaluable in bridging the gap between theory and practice, offering students a chance to apply their knowledge in real-world scenarios.

They are a cornerstone of data science education, fostering critical thinking, problem-solving, and practical skills development. 

As you embark on your journey, don’t hesitate to explore diverse and challenging project ideas. Embrace the opportunity to push boundaries, innovate, and make meaningful contributions to the field. 

Share your insights, challenges, and successes with others, and invite fellow enthusiasts to exchange ideas and experiences. 

1. What is the purpose of a data science capstone project?

A data science capstone project serves as a culmination of a student’s learning experience, allowing them to apply their knowledge and skills to solve real-world problems in the field of data science. It provides hands-on experience and showcases their ability to analyze data, derive insights, and communicate findings effectively.

2. What are some examples of data science capstone projects?

Data science capstone projects can cover a wide range of topics and domains, including predictive modeling, natural language processing, image classification, recommendation systems, and more. Examples may include analyzing customer behavior, predicting stock prices, sentiment analysis on social media data, or detecting anomalies in financial transactions.

3. How long does it typically take to complete a data science capstone project?

The duration of a data science capstone project can vary depending on factors such as project complexity, available resources, and individual pace. Generally, it may take several weeks to several months to complete a project, including tasks such as data collection, preprocessing, analysis, modeling, and presentation of findings.

Related Posts

Science Fair Project Ideas For 6th Graders

Science Fair Project Ideas For 6th Graders

When it comes to Science Fair Project Ideas For 6th Graders, the possibilities are endless! These projects not only help students develop essential skills, such…

Java Project Ideas For Beginners

Java Project Ideas for Beginners

Java is one of the most popular programming languages. It is used for many applications, from laptops to data centers, gaming consoles, scientific supercomputers, and…

NYC Data Science Acedemy

  • DATA SCIENCE BOOTCAMP
  • ONLINE DATA SCIENCE BOOTCAMP
  • Professional Development Courses
  • CORPORATE OFFERINGS
  • HIRING PARTNERS

NYC Data Science Academy

  • Learn Pandas
  • Learn NumPy
  • Learn SciPy
  • Learn Matplotlib
  • Random Forest
  • Linear Regression
  • Decision Tree
  • Learn Generative AI
  • Learn ChatGPT-3.5
  • Learn ChatGPT-4
  • Learn Google Bard
  • Learn Python
  • Learn MySQL
  • Learn NoSQL
  • Learn PySpark
  • Learn PyTorch
  • Python Hard
  • Python Easy

A Data Investigation of Healthcare Insurance Fraud

healthcare data science capstone project

The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.

According to Blue Cross Blue Shield data, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management. This is especially detrimental for those who require government assistance through systems like Medicare. Government resources are already limited, and being able to alleviate pressures that occur from fraud may allow more freedom to help those in need.

Many of these fraudulent claims come in a variety of forms, including but not limited to: Billing for services that were not provided, duplicate submission of a claim for the same service, misrepresenting the service provided, charging for a more complex or expensive service than was actually provided, and even billing for a covered service when the service actually provided was not covered.

The Problem at Hand

This project is a proof of concept on using data science as a tool to attempt to improve upon existing fraud detection models, and subsequently save government programs like Medicare millions of dollars in insurance fraud management. You may find the code for this project on my Github .

Utilizing historical data from Medicare itself, exploratory data analysis and modeling was performed, and areas for improvement upon the existing system were brought to light. The Medicare data used for this analysis comes from a problem hosted on Kaggle , with the purpose of trying to identify problematic providers who consistently submit fraudulent claims to Medicare and those who are more likely to commit healthcare fraud. Understanding their behavioral patterns, along with how they relate to inpatient/outpatient care and claims can help the healthcare system save resources and devote them to people who need them.

Within the sample datasets, there are over 1300 unique providers and ~ 140,000 unique beneficiaries, whom submitted over 500,000 claims made between November 2008 and December 2009. Data categories consist of areas like deductible paid, reimbursement amounts, provider identifiers, medical history and other insurance related descriptors.

Exploring the Data

Capstone Project: A Data Investigation of Insurance Fraud

The dataset was conveniently broken up into categories that included things like the amount reimbursed per claim, and whether or not the claim was flagged for fraud. One of the first things that stuck out about the data once it was cleaned up, was that out of the total amount of money that was to go to claims during the sample year, more than half of those funds would have gone to fraudulent claims. The difference isn't large enough to be statistically significant, but it is definitely large enough to warrant further looking-into.

Differences Between Fraudulent and Non-Fraudulent

Capstone Project: A Data Investigation of Insurance Fraud

As we continue to look at the differences in the fraudulent and non-fraudulent claims, it is visually clear that the average claim amount, flagged as fraud, is larger than the average non-fraud claim. This would realistically make sense, as a provider would want to make as much money back on a fraudulent claim as possible, so on average you'd think they would be higher amounts.

Once again, these differences did not turn out to be statistically significant, due to the very large variance in the means of the two classifications, but the average claim amounts can be used later in analysis to determine the performance of this investigation. Similarly, the total amount of non-fraud claims was also visually higher than claims flagged as fraudulent.

Data of Procedure and Diagnosis Code

healthcare data science capstone project

In the above figures, the proportion of fraud/non-fraud claims are organized by procedure code (left) and diagnosis code (right. Taking a look at the procedure code figure, it shows the top 10 codes based on money involved in the transaction. It is clear here that for these codes, the fraudulent claims are flagged more often than not. Given that these transactions have the most money involved, it makes sense, as if a crime is intentionally being committed, one would want to reap the most rewards from said transaction.

Looking at the diagnosis code figure, organized in a similar manner, non-fraud claims are more prevalent than vice-versa. This also kind of makes sense realistically, as there is likely more money to be made treating something than diagnosing it, therefore fraudulent claims may be less common in that regard.

healthcare data science capstone project

If we take a look at the top 20 physicians, by code, plotted against claim count (fraud vs non-fraud), we can see that there are definitely a number of physicians with high numbers of fraudulent flagged claims. Its possible this is just due to the procedure they commonly do, or the particular field they are in, or even the practice they work for, but it does show that there are areas where fraud is clear and these areas can theoretically be weeded out systematically.

healthcare data science capstone project

Although there is a very large variance in the means between fraud and non-fraud claims, when claims are plotted against providers as a whole (above), we can clearly see that fraudulent claims are buried in the mountain of verifiable ones. There isn't necessarily a cut and dry way to distinguish the claim types from one another, and as investigators, we have to be clever in our tool use.

Data on Feature Engineering

healthcare data science capstone project

Unsupervised learning tools like NetworkX can be used to establish connections between data points that aren't readily available. In this case, it was used to form a ratio of the amount of physicians associated with each provider, to the amount of patients associated with each provider. This ratio, visualized in the figures above, shows that for fraudulent claims, providers tend to be connected with more patients than they are providers, however this also holds true for non-fraud claims, just to a lesser extent.

Although this difference was not statistically significant, all the ratio's were plotted on a KDE (bottom left) and by claim type (bottom right) to see if some kind of bi-modal shape would form, and thus a ratio threshold established for determining fraud. Unfortunately, the figures only further established how hard the task at hand was to resolve. It is very clear how deeply ingrained and mixed in the fraudulent claims are with non-fraud claims, to the point where their density shapes almost completely overlap.

In addition to the NetworkX ratio, other feature engineering involved, Stratifying Y-s for equal distribution by setting up a Train/Test split set to 70/30. Oversampling of training data was done utilizing SMOTE, with model accuracy showing improvement when classes were balanced . Removal of unnecessary categories was done, as they were only used to group by, no more information was needed from them to run models. Things like Beneficiary Id, ClaimID, operating and specific attending physician/diagnosis codes and admission and discharge dates etc. ranked low in feature importance during the investigation.

Data Analysis: Logistic Regression & Random Forest

healthcare data science capstone project

Since the main feature of the data we are analyzing is binary, (e.g Fraud/Not Fraud), Logistic Regression was the first place to start, as this tool is good for handling binary data. After fine tuning the model to its best parameters using GridSearch, the model handled training data fairly well. It's detection of True Positives/Negatives was fairly high in both training and validation sets, when it came to the F-Score however, the model did not do amazing on unseen data.

This tends to happen when the model is good at finding patterns in the training data because it has a lot to work with, but not so great when working with a newer/smaller set of data, also known as overfitting. This type of issue can sometime be refined with certain techniques, like increasing the sample data size, or randomly removing features from the data programmatically, which forces the model to find new patterns in new data sets.

healthcare data science capstone project

Model Data Analysis and Significance

If we take a look at the measures of importance taken from the LR model, the top four features that stood out were:

  • Per Provider Average Insurance Claim Amount Reimbursed
  • Insurance Claim Amount Reimbursed
  • Per Claim Diagnosis Code 2 Average Insurance Claim Amount Reimbursed
  • Per Attending Physician Average Insurance Claim Amount Reimbursed

These all make sense for the most part, as the things that we would think fraud would revolve around would be the claim amounts coming from the physicians and the individual providers. The model appears to be doing its job.

healthcare data science capstone project

Despite the model currently being overfit, it performed well in most aspects. The rate of actual fraud that was flagged, and the amount of valid claims flagged were both fair and acceptable. The only thing of concern is the rate of false negatives, which may be a prominent source of error. Realistically, any false positive claims that are flagged by future versions of this model can be verified by providers on a case by case basis, although this number should probably come down a bit as well with further tuning.

healthcare data science capstone project

Visualization of Random Forest Machine Learning Model Results

healthcare data science capstone project

A Random Forest model was developed alongside the LR model for comparison, and it performed similarly. It did fairly well in all aspects after tuning with GridSearch, and yielded a slightly different yet fairly similar ranking of feature importance. It too, however, suffered from a lower F-Score when it came to validation, indicating it is also overfit as well.

healthcare data science capstone project

Model Evaluations/Takeaway

This investigation resulted in some interesting takeaways of the current Medicare insurance fraud issue. Physician to patient ratio shows fraud is thoroughly mixed in with non-fraud, which very much so highlights the difficulty of problem at hand. Claims difficult to distinguish from one another, and any criminal activity seems to be well hidden amongst the masses of data submitted to Medicare. Some strengths and weaknesses of the model developed were that:

  • Accuracy, Sensitivity and Specificity are good but F1 is low in validation 
  • False (+) and False (-) values matter because these represent monetary value, so this effectiveness score is important 
  • False (+) flag rate is higher than False (-) but theoretically those can be resolved on a case by case basis
  • True (+) flag rate is relatively high, which means the model is catching fraud properly.
  • False (-) flag rate is concerning source of error, we don't want fraud to go unnoticed.

Cost/Benefit Data Analysis

While error rates are concerning, the big thing when it comes to government programs that support people in need, is money. How much money do these things cost us and what can be saved. If we break down the issue, and how this model handled it, this is what we can say.

Original Data

  • ORIGINALLY, Medicare found 38% of claims submitted for reimbursement were fraud. If we take the average for fraudulent claim reimbursements, that’s ~ $1300 a claim IF none of them got caught. 
  • In the training data we have ~ 558,211 total claims. 
  • 38% of total claims is ~ 212,120 claims. 
  • At an average of 1300 per fraud claims that's $275,756,234 saved if they’re caught
  • The other 62% of claims accounts for ~ 346,090 claims or (x$700 per claim) that’s ~ $242,263,700
  • At those rates, the amount saved with Medicare’s current  fraud detection method is 
  • $275,756,234-$242,263,700 = ~$33,492,534
  • Under MY LR machine learning model
  • Total true negatives amount to 41.90% or 233,890 claims or $163,723,286 at $700 average. 
  • Adding in the false negatives or 8.10% or ~45,215 claims or $58,779,618 at $1300 average 
  • Totaling $222,502,904 paid out
  • Total True positive amount to 44.93% or 250,84 claims or $326,045,462 at $1300 saved
  • False positives account for 5.07% or 28,301 claims or $19,810,908 at $700 avg
  • If we add up the amount paid out that’s ~$242,313,812
  • The amount we saved catching fraud is ~$326,045,462 -$242,313,812 = ~$83,731,650

healthcare data science capstone project

About Author

healthcare data science capstone project

David Green

Leave a comment, view posts by categories, our recent popular posts, view posts by tags, nyc data science academy.

NYC Data Science Academy teaches data science, trains companies and their employees to better profit from data, excels at big data project consulting, and connects trained Data Scientists to our industry.

NYC Data Science Academy is licensed by New York State Education Department.

Get detailed curriculum information about our amazing bootcamp!

  • Refund Policy

SOCIAL MEDIA

healthcare data science capstone project

Capstone Projects: Fall 2020

Please click the links below to view pdf posters from the fall 2020 capstone projects..

  • Reconstruction of Coordination Ellipsis from Clinical Trial Eligibility Criteria Text
  • COVID-19 Randomized Controlled Trial (RCT) Summarization
  • Peace Speech Analysis via NLP
  • Measuring Startup Strategy and Its Evolution (Team 1)
  • Measuring Startup Strategy and Its Evolution (Team 2)
  • Outdated Posts about AWS on StackOverflow

  • Exploring Thematic Fit in Language with Neural Models
  • Improving automatic event understanding through sequential and non-sequential deep learning architectures
  • Predicting Forward Citations for Patents
  • Auto-annotation of Pathology Images
  • Deep Learning in Cardiology
  • Automatic Photodamage assessment from Facial Image
  • Betta Fish Evolutionary Morphology
  • Detecting Mosaic Mutations with Deep Learning
  • Deep Learning Methods to Discover Novel Cells
  • Active Learning for Computer Vision

  • Energy Efficient Machine Learning at the Edge
  • Energy Efficient AI on Edge
  • Use Neural Structured Learning for Beaconing Detection
  • Identifying Trading Opportunities using Unsupervised Learning
  • Reinforcement Learning for Trading
  • Reinforcement Learning for Taxi Driver Re-positioning Problem in NYC
  • Sensor-Based Repackaged Android App Detection
  • Sensor-Based Repackaged Malware Detection
  • Predicting Covid-19 outbreaks using social media images
  • Automated Model Reduction for Atmospheric Chemical Mechanisms (AMORE Project)
  • AutoML Prediction Machine of Adverse Outcomes Following Hip Fracture Surgery
  • Identifying patients missing HS diagnosis

  • Returns Propensity Prediction for Online Orders
  • Predicting the thermodynamic stability of perovskite oxides
  • Detecting Market Manipulation in Small – Cap Equities
  • Clustering Analysis of Investors & Trending Topic Detection

Return to Poster Sessions

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

The objective of this project is to predict patients’ healthcare costs and to identify factors contributing to this prediction.

Shivam12591/Healthcare-Insurance-Analysis_Capstone-Project

Folders and files, repository files navigation.

Project Task: Week 1 Data science/data analysis

Collate the files so that all the information is in one place

Check for missing values in the dataset

Find the percentage of rows that have trivial value (for example, ?), and delete such rows if they do not contain significant information

Use the necessary transformation methods to deal with the nominal and ordinal categorical variables in the dataset

The dataset has State ID, which has around 16 states. All states are not represented in equal proportions in the data. Creating dummy variables for all regions may also result in too many insignificant predictors. Nevertheless, only R1011, R1012, and R1013 are worth investigating further. Create a suitable strategy to create dummy variables with these restraints.

The variable NumberOfMajorSurgeries also appears to have string values. Apply a suitable method to clean up this variable. Note: Use Excel as well as Python to complete the tasks

Age appears to be a significant factor in this analysis. Calculate the patients' ages based on their dates of birth.

The gender of the patient may be an important factor in determining the cost of hospitalization. The salutations in a beneficiary's name can be used to determine their gender. Make a new field for the beneficiary's gender.

You should also visualize the distribution of costs using a histogram, box and whisker plot, and swarm plot.

State how the distribution is different across gender and tiers of hospitals

Create a radar chart to showcase the median hospitalization cost for each tier of hospitals

Create a frequency table and a stacked bar chart to visualize the count of people in the different tiers of cities and hospitals Note: Use Excel as well as Python to complete the tasks.

  • Test the following null hypotheses: a. The average hospitalization costs for the three types of hospitals are not significantly different b. The average hospitalization costs for the three types of cities are not significantly different c. The average hospitalization cost for smokers is not significantly different from the average cost for nonsmokers d. Smoking and heart issues are independent Note: Use Excel as well as Python to complete the tasks

Project Task: Week 2 Machine Learning

Examine the correlation between predictors to identify highly correlated predictors. Use a heatmap to visualize this.

Develop and evaluate the final model using regression with a stochastic gradient descent optimizer. Also, ensure that you apply all the following suggestions: Note: • Perform the stratified 5-fold cross-validation technique for model building and validation • Use standardization and hyperparameter tuning effectively • Use sklearn-pipelines • Use appropriate regularization techniques to address the bias-variance trade-off a. Create five folds in the data, and introduce a variable to identify the folds b. For each fold, run a for loop and ensure that 80 percent of the data is used to train the model and the remaining 20 percent is used to validate it in each iteration c. Develop five distinct models and five distinct validation scores (root mean squared error values) d. Determine the variable importance scores, and identify the redundant variables Project Task: Week 2 Machine Learning

Use random forest and extreme gradient boosting for cost prediction, share your cross�validation results, and calculate the variable importance scores

Case scenario: Estimate the cost of hospitalization for Christopher, Ms. Jayna (her date of birth is 12/28/1988, height is 170 cm, and weight is 85 kgs). She lives in a tier-1 city and her state’s State ID is R1011. She lives with her partner and two children. She was found to be nondiabetic (HbA1c = 5.8). She smokes but is otherwise healthy. She has had no transplants or major surgeries. Her father died of lung cancer. Hospitalization costs will be estimated using tier-1 hospitals.

Find the predicted hospitalization cost using all five models. The predicted value should be the mean of the five models' predicted values.

Project Task: Week 2 SQL

  • To gain a comprehensive understanding of the factors influencing hospitalization costs, it is necessary to combine the tables provided. Merge the two tables by first identifying the columns in the data tables that will help you in merging. a. In both tables, add a Primary Key constraint for these columns Hint: You can remove duplicates and null values from the column and then use ALTER TABLE to add a Primary Key constraint.

Project Task: Week 2 SQL 2. Retrieve information about people who are diabetic and have heart problems with their average age, the average number of dependent children, average BMI, and average hospitalization costs

Find the average hospitalization cost for each hospital tier and each city level

Determine the number of people who have had major surgery with a history of cancer

Determine the number of tier-1 hospitals in each state

Project Task: Week 2 Tableau

  • Create a dashboard in Tableau by selecting the appropriate chart types and business metrics Note: Put more emphasis on data storytelling Thank Y
  • Jupyter Notebook 100.0%

IMAGES

  1. A step-by-step guide for creating an authentic data science portfolio

    healthcare data science capstone project

  2. capstone-project-ideas-for-data-science

    healthcare data science capstone project

  3. Request a Powerful Data Science Capstone from Us & Shine

    healthcare data science capstone project

  4. Data Science in Healthcare

    healthcare data science capstone project

  5. Capstone Project Ideas For Data Analytics

    healthcare data science capstone project

  6. GitHub

    healthcare data science capstone project

VIDEO

  1. IBM Coursera Advanced Data Science Capstone

  2. Data Science Capstone Project

  3. Advanced Data Science Capstone Sara Iaccheo

  4. Healthcare Challenge Capstone NSG 435

  5. Aldie Adrian

  6. IBM Coursera Advanced Data Science Capstone

COMMENTS

  1. SoumyaRSethi/Data-Science-Capstone-Healthcare

    Data Science Capstone Project Using Python and Tableau 10. DESCRIPTION. Problem Statement NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK.

  2. 5 Data Science Projects in Healthcare that will get you hired

    Photo by Clark Tibbs on Unsplash. Stroke Prediction: We will be applying Support Vector Machines to solve this problem. SVM is the most extensively used algorithm in the field of Healthcare because of some advantages it provides. Therefore, it is necessary to get a hold of this algorithm which will ultimately be very useful when applying it in the healthcare industry.

  3. Capstone Projects

    Business Analyst at Ascension Health Care . Capstone Project: Analysis of patient safety event reports data. Industry Mentor: MedStar Health. National Center for Human Factors in Healthcare ... Master's in Health Informatics & Data Science. 2115 Wisconsin Ave NW, G1 Level, Suite 050. Washington DC 20007. Email: [email protected]. Maps ...

  4. 10 Unique Data Science Capstone Project Ideas

    Project Idea #1: Analyzing Health Trends. When it comes to data science capstone projects, analyzing health trends is an intriguing idea that can have a significant impact on public health. By leveraging data from various sources, data scientists can uncover valuable insights that can help improve healthcare outcomes and inform policy decisions ...

  5. kaushik-prasad-dey/Data-Science-Capstone-Healthcare

    Data Science Capstone Project on Health Care Case study Using Python and Tableau. DESCRIPTION NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases.

  6. Data Science Capstone Course (Johns Hopkins)

    There are 7 modules in this course. The capstone project class will allow students to create a usable/public data product that can be used to show your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry, government, and academic partners.

  7. Data Science: Capstone

    By completing this capstone project you will get an opportunity to apply the knowledge and skills in R data analysis that you have gained throughout the series. This final project will test your skills in data visualization, probability, inference and modeling, data wrangling, data organization, regression, and machine learning.

  8. Artificial Intelligence in Healthcare Capstone

    In the capstone project, you will take a guided, interactive tour through the patient journey to see how different decisions can create different datasets and outcomes, and how regulatory and ethical limitations affect its process. With hands-on experience in the position of a medical data miner, you'll see how data has the power to transform ...

  9. Dartmouth College

    Capstone summary. Students will apply and refine their statistical, computational, and investigative skillset toward a research project in health data science in the capstone experience offered with this MS degree. This three-month experience culminates in a white paper and research presentation. This course trains students in professional ...

  10. ‍⚕️ Data Science Course Capstone Project

    Summary- This is comprehensive project completed by me as part of the Data Science Post Graduate Programme. This project includes multiple classification algorithms over a dataset collected on health/diagnostic variables to predict of a person has diabetes or not based on the data points. Apart from extensive EDA to understand the distribution and other aspects of the data, pre-processing was ...

  11. DS: 401 Capstone Projects

    The Public Science Collaborative at ISU is looking for advanced data science students to join a research project focusing on the engineering and visualization of public health data. The key task for the spring semester will be to build an opioid overdose data dashboard similar to this one in California. DS 401 interns will work in a supervised ...

  12. GitHub

    I worked on this capstone project towards completion of final assessment for PGP in Data Science course from Simplilearn-Purdue University. My job was to analyze the datasets from NIDDK consisting of several medical predictor variables and one target variable (Outcome). Predictor variables includes the number of pregnancies the patient has had ...

  13. Health Data Science (MS)

    Data Science Capstone Project: 8: DATASCI 300: Data Science Educational Practice 1: 1 Units: 10: Winter; DATASCI 221: Data Science Program Seminar II: 1: DATASCI 222: Data Science Capstone Project: 8 Units: 9: Spring; DATASCI 221: Data Science Program Seminar II: 1: DATASCI 222: Data Science Capstone Project: 8 Units: 9 Total Units: 57.5-58.5

  14. GitHub

    Healthcare-Data Science Capstone. DESCRIPTION. NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK. The objective is to predict whether or not a patient has ...

  15. UCSD Data Science Capstone Projects: 2021-2022

    This page contains the project materials for UCSD's Data Science Capstone sequence. Projects are grouped into subject-matter areas called domains of inquiry, led by the domain mentors listed below. Each project listing contains: The title and abstract, A link to the project's website. A link to the project's code repository.

  16. 21 Interesting Data Science Capstone Project Ideas [2024]

    21 Interesting Data Science Capstone Project Ideas [2024] By Mohini Saxena. Data science, encompassing the analysis and interpretation of data, stands as a cornerstone of modern innovation. Capstone projects in data science education play a pivotal role, offering students hands-on experience to apply theoretical concepts in practical settings.

  17. A Data Investigation of Healthcare Insurance Fraud

    The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy. Background. According to Blue Cross Blue Shield data, approximately 3-10% of US healthcare spending or $68-$230 billion dollars are spent on fraudulent healthcare claims and management.

  18. Data Science Capstone

    Data Science Capstone - Healthcare by Jay Shembekar. Details . 24. 2,816. Data Science Capstone - Healthcare #Healthcare, #Simplilearn, #Capstone, #DataScience. Published: Feb 21, 2021 Updated: Dec 4, 2022. English (US) Deutsch; English (UK) English (US) Español; Français (Canada) Français (France)

  19. Capstone Projects: Fall 2020

    Please click the links below to view PDF posters from the Fall 2020 Capstone Projects. Reconstruction of Coordination Ellipsis from Clinical Trial Eligibility Criteria Text. COVID-19 Randomized Controlled Trial (RCT) Summarization. Peace Speech Analysis via NLP.

  20. GitHub

    Data_Science_Capstone. Health Care Project of Simplilearn. DESCRIPTION. Problem Statement NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK.

  21. Healthcare Analytics Capstone Project

    Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource] code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 ...

  22. Data-Science-Capstone-Healthcare/Capstone project .ipynb at master

    \""," ],"," \"text/plain\": ["," \" Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \\\\\n\","," \"0 6 148 72 35 0 33.6 \\n\","," \"1 1 85 66 29 0 26.6 ...

  23. Shivam12591/Healthcare-Insurance-Analysis_Capstone-Project

    Project Task: Week 1 Data science/data analysis. Age appears to be a significant factor in this analysis. Calculate the patients' ages based on their dates of birth. The gender of the patient may be an important factor in determining the cost of hospitalization. The salutations in a beneficiary's name can be used to determine their gender.