• Digital Marketing
  • Apps & Website

Expand My Business

Data Mining Case Studies & Benefits

Data Mining Case Studies: Real-world Applications and Success Stories

  • Key Takeaways

Data mining has improved the decision-making process for over 80% of companies. (Source: Gartner).

Statista reports that global spending on robotic process automation (RPA) is projected to reach $98 billion by 2024, indicating a significant investment in automation technologies.

According to Grand View Research, the global data mining market will reach $16,9 billion in 2027.

Ethical Data Mining preserves individual rights and fosters trust.

A successful implementation requires defining clear goals, choosing data wisely, and constant adaptation.

The term “data-mining case studies” is a guide for modern businesses in the complex world of information exploration. At its core, data mining is the process of extracting valuable insights from large datasets and transforming them into actionable intelligence. This article takes a journey through the applications, significance and success stories of data mining to demonstrate its transformative power in modern decision-making.

Businesses across all industries are increasingly using data mining in an age where data is a key component of strategic decision making. This allows them to extract meaningful patterns and trends from the sea of data that continues to grow. Data mining is more than just analysis. It’s a catalyst for innovation that propels organizations to make informed strategic decisions. Businesses can gain an edge on the market by leveraging data to optimize their operations and improve their competitiveness.

We go beyond the theoretical definitions to explore real-world examples and success stories, which highlight the tangible impact data mining has. These case studies show how data mining can be a catalyst for change. From improving marketing strategies to revolutionizing healthcare, they demonstrate the power of data mining. Data mining is a field that goes beyond data analysis. It’s a force for change in business, research, and more.

1. The Importance of Data Mining for Modern Business: Understanding the Role in Decision Making

Data mining has taken on a central role in the modern world of business. Data is a major issue for businesses today. Making informed decisions with this data can be crucial to staying competitive. This article explores the many aspects of data mining and its impact on decisions.

  • 1.1. Unraveling Data Landscape

Businesses generate a staggering amount of data, including customer interactions, market patterns, and internal operations. Decision-makers face an information overload without effective tools for sorting through all this data. Data mining is a process which not only organizes, structures and extracts patterns and insights from this vast amount of data. It acts as a compass to guide decision makers through the complex landscape of data.

  • 1.2. Empowering Strategic Decision Making

Data mining is a powerful tool for strategic decision making. Businesses can predict future trends and market behavior by analyzing historical data. This insight allows businesses to better align their strategies with predicted shifts. Data mining can provide the strategic insights required for successful decision making, whether it is launching a product, optimizing supply chain, or adjusting pricing strategies.

  • 1.3. Customer-Centric Determining

Understanding and meeting the needs of customers is paramount in an era where customer-centricity reigns. Data mining is crucial in determining customer preferences, behaviors, and feedback. This information allows businesses to customize products and services in order to meet the expectations of customers, increase satisfaction and build lasting relationships. With customer-centric insights, decision-makers can make choices that resonate with their target audiences and foster loyalty and brand advocacy.

  • 1.4. Making informed choices to reduce risks

Data mining is a powerful tool for risk mitigation. Decision-makers are able to make proactive choices by analyzing historical data, identifying risks, and minimizing negative impacts. Data mining gives businesses the insight they need to make informed decisions, whether it is about financial risks, volatility in the market, or disruptions to supply chains.

  • 1.5. Continuous improvement through feedback loops

Through feedback loops, data mining can foster a culture that encourages continuous improvement. Businesses can refine strategies and tactics by analyzing past decisions. Iterative decision-making, driven by data-driven insights ensures that decisions are not static, but change in line with changing circumstances. In today’s fast-paced, ever-changing business environment, the ability to adapt and learn based on data is essential for success.

2. Data Mining: Applications across industries

Data mining is transforming the way companies operate and make business decisions. This article explores the various applications of data-mining, highlighting case studies that illuminate its impact in the healthcare, retail, and finance sectors.

  • 2.1. Healthcare Case Studies: Revolutionizing Patient Care

Data mining is a powerful tool in the healthcare industry. It can improve patient outcomes and treatment plans. Discover compelling case studies in which data mining played a crucial role in predicting patterns of disease, optimizing treatment and improving patient care. These examples, which range from early detection of health risks to personalized medicines, show the impact that data mining has had on the healthcare industry.

State of Technology 2024

Humanity's Quantum Leap Forward

Explore 'State of Technology 2024' for strategic insights into 7 emerging technologies reshaping 10 critical industries. Dive into sector-wide transformations and global tech dynamics, offering critical analysis for tech leaders and enthusiasts alike, on how to navigate the future's technology landscape.

  • 2.2. Retail Success stories: Enhancing customer experiences

Retail is at the forefront of leveraging data mining to enhance customer experiences and streamline operations. Discover success stories of how data mining empowered businesses to better understand consumer behavior, optimize their inventory management and create personalized marketing strategies. These case studies, which range from e-commerce giants and brick-and-mortar shops, show how data mining can boost sales, improve customer satisfaction, transform the retail landscape, etc.

  • 2.3. Financial Sector Examples: Navigating Complexity

Data mining is a valuable tool in the finance industry, where precision and risk assessment are key. Explore case studies that demonstrate how data mining can be used for fraud detection and risk assessment. These examples demonstrate how financial institutions use data mining to make better decisions, protect against fraud, and customize services to their clients’ needs.

  • 2.4. Data Mining and Education: Shaping The Future of Learning

Data mining has been used in the education sector to enhance learning beyond healthcare, retail and finance. Learn how educational institutions use data mining to optimize learning outcomes, analyze student performance and personalize materials. These examples, ranging from adaptive learning platforms and predictive analytics to predictive modeling, demonstrate the potential for data mining to revolutionize how we approach education.

  • 2.5. Manufacturing efficiency: Streamlining production processes

Data mining is a powerful tool for streamlining manufacturing processes. Examine case studies that demonstrate how data mining can be used to improve supply chain management, predict maintenance requirements, and increase overall operational efficiency. These examples show how data-driven insights can lead to cost savings, increased productivity, and a competitive advantage in manufacturing.

Data mining is a key component in each of these applications. It unlocks insights, streamlines operations, and shapes the future of decisions. Data mining is transforming the landscapes of many industries, including healthcare, retail, education, finance, and manufacturing.

3. Data Mining Techniques – Unveiling Insight Extraction

Data mining techniques help businesses gain an edge by extracting valuable insights and information from large datasets. This exploration will provide an overview of the most popular data mining methods, and back each one with insightful case studies.

  • 3.1. Popular Data Mining Techniques

3.1.1. Clustering Analysis

The clustering technique involves grouping data points based on a set of criteria. This method is useful for detecting patterns in data sets and can be used to segment customers, detect anomalies, or recognize patterns. The case studies will show how clustering can be used to improve marketing strategies, streamline products, and increase overall operational efficiency.

3.1.2. Association Rule Mining

Association rule mining reveals relationships between variables within large datasets. Market basket analysis is a common application of association rule mining, which identifies patterns in co-occurring products in transactions. Real-world examples of how association rule mining is used in retail to improve product placements, increase sales, and enhance the customer experience.

3.1.3. Decision Tree Analysis

The decision tree is a visual representation of the process of making decisions. This technique is a powerful tool for classification tasks. It helps businesses make decisions using a set of criteria. Through case studies, you will learn how decision tree analyses have been used in the healthcare industry for disease diagnosis and fraud detection, as well as predictive maintenance in manufacturing.

3.1.4. Regression Analysis

Regression analysis is a way to explore the relationship between variables. This allows businesses to predict and understand how one variable affects another. Discover case studies that demonstrate how regression analysis is used to predict customer behavior, forecast sales trends, and optimize pricing strategies.

4. Data Mining Case Studies: Real Success Stories

Success stories in the dynamic world of data mining are a testament to the power of extracting actionable insight from large datasets. This exploration is centered on three compelling cases that demonstrate the real-world impact data mining has in different business scenarios. These narratives, which range from improving customer retention and optimizing supply-chain operations to enhancing marketing strategies, shed light on tangible benefits organizations can reap by harnessing data.

  • 4.1. Improving Customer Retention 

This case study shows how a leading e-commerce company revolutionized its customer retention strategy by leveraging the power of data-mining. The company analyzed customer behavior, engagement patterns, and purchase histories to identify key factors that influence customer loyalty. The business saw a notable increase in customer satisfaction as well as a reduction in churn by implementing personalized recommendations and targeted promotional campaigns based on data. This success story shows how data mining can not only identify areas of improvement but also provide actionable solutions to foster long-term relationships with customers.

  • 4.2. The Case for Optimising Supply Chain Efficiency

Data mining was a game changer for a multinational corporation struggling with complex supply-chain challenges. The company identified inefficiencies and bottlenecks in its supply chain through a thorough analysis of historical data and market trends. The organization saw a significant increase in efficiency after implementing data-driven optimizations such as demand forecasting and inventory management tools. This case study shows how data mining can transform supply chain management into a strategic asset, resulting in cost savings and improved business performance.

  • 4.3. The Precision of Marketing Campaigns

Data mining has proven to be the compass that guided a global advertising firm to succeed in the highly competitive world of marketing. The agency used advanced analytics to analyze consumer demographics, internet behavior, and response patterns in order to create marketing campaigns that were tailored with unmatched precision. The result? The result? A significant increase in conversion rates and ROI for clients. This case study is a beacon to businesses looking to elevate their marketing strategy through strategic data mining. It emphasizes the crucial role data plays in modern marketing.

  • 4.4. Overcoming challenges and gaining insights

These success stories highlight the transformative power of data mining. However, they also reveal the challenges that organizations face when implementing and leveraging this technology. Data mining is a complex landscape that requires a strategic view. From privacy concerns to the demand for data analysts with the right skills, it’s important to take a long-term approach. The overarching message is that the benefits of increased customer retention, improved supply chains, and enhanced marketing strategies outweigh any challenges. Data mining is a world of endless possibilities for businesses that continue to innovate and adjust. It allows organizations to thrive in a data-driven environment.

5. Benefits and ROI: Demonstrating tangible benefits

Businesses are increasingly realizing the benefits of data mining in the current dynamic environment. The benefits are numerous and tangible, ranging from improved decision-making to increased operational efficiency. We’ll explore these benefits, and how businesses can leverage data mining to achieve significant gains.

  • 5.1. Enhancing Decision Making

Data mining provides businesses with actionable insight derived from massive datasets. Analyzing patterns and trends allows organizations to make more informed decisions. This reduces uncertainty and increases the chances of success. There are many case studies that show how data mining has transformed the decision-making process of businesses in various sectors.

  • 5.2. Operational Efficiency

Data mining is essential to achieving efficiency, which is the cornerstone of any successful business. Organizations can improve their efficiency by optimizing processes, identifying bottlenecks, and streamlining operations. These real-world examples show how businesses have made remarkable improvements in their operations, leading to savings and resource optimization.

  • 5.3. Personalized Customer Experiences

Data mining has the ability to customize experiences for customers. Businesses can increase customer satisfaction and loyalty by analyzing the behavior and preferences of their customers. Discover case studies that show how data mining has been used to create engaging and personalized customer journeys.

  • 5.4. Competitive Advantage

Gaining a competitive advantage is essential in today’s highly competitive environment. Data mining gives businesses insights into the market, competitor strategies, and customer expectations. These insights can give organizations a competitive edge and help them achieve success. Look at case studies that show how companies have outperformed their competitors by using data mining.

  • 5.5. Risk Mitigation

Data mining allows organizations to identify and mitigate threats proactively. Data mining tools are able to analyze patterns and predict risks, whether it is fraud detection, disruptions in the supply chain, or fluctuations in the market. Discover real-world cases where companies have used data mining to manage risk and protect their operations.

  • 5.6. Increased Revenue

Data mining can have a direct effect on revenue generation. Businesses can achieve significant revenue growth by identifying sales opportunities and optimizing pricing strategies. Explore case studies to see how data mining is a major driver of revenue growth for diverse businesses.

  • 5.7. Employee Productivity and Satisfaction

Data mining can improve internal processes as well. Businesses can improve productivity by analyzing employee data. Learn how companies have used data mining to improve employee performance and satisfaction.

  • 5.8. Adaptability to market changes

Adaptability is essential in today’s fast-paced environment. Data mining gives businesses the tools to respond quickly to changes in market conditions, new trends, and changing consumer preferences. Data mining case studies show how companies have used it to adapt strategies and stay agile in dynamic markets.

6. Calculating ROI and Benefits

To justify investments, businesses must also quantify their return on investment. Calculating ROI for data mining initiatives requires a thorough analysis of the costs, benefits, and long-term impacts. Let’s examine the complexities of ROI within the context of data-mining.

  • 6.1. Cost-Benefit Analysis

Prior to focusing on ROI, companies must perform a cost-benefit assessment of their data mining projects. It involves comparing the costs associated with implementing data-mining tools, training staff, and maintaining infrastructure to the benefits anticipated, such as higher revenue, cost savings and better decision-making. Case studies from real-world situations provide insight into cost-benefit analysis.

  • 6.2. Quantifying Tangible and intangible benefits

Data mining initiatives can yield tangible and intangible benefits. Quantifying tangible benefits such as an increase in sales or a reduction in operational costs is easier. Intangible benefits such as improved brand reputation or customer satisfaction are also important, but they may require a nuanced measurement approach. Examine case studies that quantify both types.

  • 6.3. Long-term Impact Assessment

ROI calculations should not be restricted to immediate gains. Businesses need to assess the impact their data mining projects will have in the future. Consider factors like sustainability, scalability, and ongoing benefits. Case studies that demonstrate the success of data-mining strategies over time can provide valuable insight into long-term impact assessment.

  • 6.4. Key Performance Indicators for ROI

Businesses must establish KPIs that are aligned with their goals in order to measure ROI. KPIs can be used to evaluate the success of data-mining initiatives, whether it is tracking sales growth, customer satisfaction rates, or operational efficiency. Explore case studies to learn how to select and monitor KPIs strategically for ROI measurement.

  • 6.5. Benchmarking against Industry Standards

It is important to compare the performance of data-mining initiatives with industry standards in order to determine if a company is achieving optimal outcomes. Case studies that demonstrate benchmarking strategies against peers in the industry provide valuable insight into best practices and areas for improvement.

  • 6.6. Scalability and flexibility

Scalability and flexibility should be considered when calculating ROI. Businesses that can scale their data-mining capabilities seamlessly to meet changing needs and industry shifts are more likely to see a sustained ROI. These real-world examples show how companies have created flexible and scalable data mining infrastructures.

  • 6.7. Learn from your mistakes

Data mining initiatives do not always yield the ROI expected. It is important to learn from failures in order to refine strategies and improve future initiatives. Data mining case studies that openly discuss challenges, setbacks and the adjustments made by companies can be a valuable resource for those who are just starting out.

  • 6.8. Communication ROI Success Stories

It is important to effectively communicate the success stories behind data mining initiatives in order to gain support from the rest of the organization. Discover case studies that show how companies have created compelling narratives to showcase their ROI and foster a culture of data-driven decision making.

7. Data Mining Ethics

Data mining is a field where ethical considerations are crucial to ensuring transparent and responsible practices. It is important to carefully navigate the ethical landscape as organizations use data to extract valuable insights. This section examines ethical issues in data mining and highlights cases that demonstrate ethical practices.

  • 7.1. Understanding Ethical Considerations

Data mining ethics revolves around privacy, consent, and responsible information use. Businesses are faced with the question of how they use and collect data. Ethics also includes the biases in data and the fairness of algorithms.

  • 7.2. Balance Innovation and Privacy

Finding the right balance between privacy and innovation is a major ethical issue in data mining. In order to gain an edge in the market through data insights and to innovate, organizations must walk a tightrope between innovation and privacy. Case studies will illuminate how companies have successfully balanced innovation and privacy.

  • 7.3. Transparency and informed consent

Transparency in the processes is another important aspect of ethical data mining. This is to ensure that individuals are informed and consented before their data is used. This subtopic will explore the importance of transparency in data collection and processing, with case studies that highlight instances where organizations have established exemplary standards to obtain informed consent.

9. Data Mining: Implementing it

Data mining is a powerful tool for businesses, but it requires a strategy to avoid common pitfalls. This section is a comprehensive guide to help businesses implement data mining effectively. It outlines step-by-step procedures and warns against potential challenges.

  • 9.1. Determining the scope and objectives

Data mining begins with defining the objectives and scope of the project. It is important to identify the business challenges to be solved and understand how data mining can help. The importance of a clearly defined scope will be illustrated by practical examples and case studies.

  • 9.2. Selecting data sources and preprocessing

The success of any data mining project depends on the selection of the correct data sources. This section examines the criteria to be used when selecting data sources. It emphasizes the importance of good data quality. It also explains the steps to be taken in order to prepare and clean the data before analysis.

  • 9.3. The Right Algorithms

Data mining projects are based on selecting the best algorithms that can extract patterns and insights. Businesses will learn about different algorithms and how they are used. Case studies show examples of when the right algorithms have led to breakthroughs.

  • 9.4. Avoid these common pitfalls

Businesses must be aware that data mining has many benefits but also some potential pitfalls. This subtopic highlights the common challenges and mistakes that businesses face during data mining projects. It provides them with knowledge they can use to overcome these obstacles.

  • 9.5. Data Security

When dealing with sensitive information, security is of paramount importance. This section examines the importance of data security measures when implementing them, offering case studies that highlight the consequences of ignoring security protocols as well as demonstrating successful methods of safeguarding data.

  • 9.6. Addressing Resistance and Culture Shift

Data mining initiatives can be hampered by resistance to change or a lack of data-driven culture. This sub-topic discusses how to overcome resistance in an organization and create a culture of data-driven decision making. Case studies illustrate successful cultural shifts.

  • 9.7. Monitoring and adaptation of continuous monitoring

Data mining is a process that requires constant monitoring and adaptation. This section stresses the importance of continuous evaluation, adjustment, and learning from insights gained through data-mining. The case studies will show how organizations have adapted to changing data patterns.

  • 9.8. Data Mining as a Strategic Advantage

Implementing data mining has as its ultimate goal to achieve a competitive advantage. This subtopic brings together the entire data mining process by highlighting case studies that demonstrate how businesses have successfully used data mining insights to improve operations, make informed decisions and stay competitive.

  • 10. Conclusion

The exploration of Data Mining Ethics highlights the importance of ethical considerations as the landscape of data usage continues to evolve. Data mining ethics become increasingly important as organizations use data to drive innovation and inform their decision-making. Creating a delicate balance between innovation, privacy, and business practices is a challenge that requires businesses to be transparent and get informed consent. Case studies from the real world have shown how organizations have successfully navigated ethical waters. They also show that ethical data mining preserves individual privacy and fosters trust.

The journey of Implementing Data Mining is complex and rewarding. This step-by-step manual provides businesses with the blueprint they need to succeed, from setting clear goals and choosing appropriate data sources to selecting the right algorithms. The path to success is not without challenges. This section also explores common pitfalls such as ignoring data security measures or encountering resistance to changes.

In a broader context, the integration of ethical considerations and strategic implementation speaks about the responsibility as well as the opportunity that comes with the age of data. The organizations that implement data mining strategies while prioritizing ethical data mining practices are not only stewards but also leaders in their industries. The synergy of ethical considerations with strategic implementation is not only a risk mitigation tool, but it also helps to achieve the goal of using data mining to have positive and sustainable effects on the business world, society, and the changing landscape of technological innovations. The intersection of ethics with implementation is where data mining’s true potential is realized.

  • Q. What ethical considerations are important in data mining?

Privacy and consent are important ethical considerations for data mining.

  • Q. How can companies avoid common pitfalls when implementing data mining?

By ensuring the security of data, addressing cultural opposition, and encouraging continuous learning and adaptation.

  • Q. Why is transparency important in data mining?

Transparency and consent to use collected data ethically are key elements of building trust.

  • Q. What are the main steps to implement data mining in businesses?

Define your objectives, select data sources, select algorithms and monitor continuously.

  • Q. How can successful organizations use data mining to gain a strategic advantage?

By taking informed decisions, improving operations and staying on top of the competition.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

favicon

Related Post

Crunching numbers, driving sales: harnessing big data for marketing success, the importance of data compliance in today’s world, essential data cleaning tools for streamlining insights, data wrangling made simple: top tools to simplify your process, privacy in iot: safeguarding your data in a connected world, designing your data quality framework for excellence, table of contents.

Expand My Business is Asia's largest marketplace platform which helps you find various IT Services like Web and App Development, Digital Marketing Services and all others.

  • IT Staff Augmentation
  • Data & AI
  • E-commerce Development

Article Categories

  • Technology 512
  • Business 285
  • Digital Marketing 232
  • Social Media Marketing 125
  • E-Commerce 116
  • Website Development 92

Copyright © 2024 Mantarav Private Limited. All Rights Reserved.

  • Data Center
  • Applications
  • Open Source

Logo

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More .

Companies understand that data mining can provide insights to improve the organization. Yet, many struggle with the right types of data to collect, where to start, or what project may benefit from data mining.

Examining the data mining success of others in a variety of circumstances illuminates how certain methods and software in the market can assist companies. See below how five organizations benefited from data mining in different industries: cybersecurity, finance, health care, logistics, and media.

See more: What is Data Mining? Types & Examples

Featured Partners: Business Intelligence Software

Yellowfin

1. Cerner Corporation

Over 14,000 hospitals, physician’s offices, and other medical facilities use Cerner Corporation’s software solutions.

Cerner’s access allows them to combine patient medical records and medical device data to create an integrated medical database and improve health care.

Using Cloudera’s data mining allows different devices to feed into a common database and predict medical conditions.

“In our first attempts to build this common platform, we immediately ran into roadblocks,” says Ryan Brush, senior director and distinguished engineer at Cerner.

“Our clients are reporting that the new system has actually saved hundreds of lives by being able to predict if a patient is septic more effectively than they could before.”

Industry: Health care

Data mining provider: Cloudera

  • Collect data from unlimited and different sources
  • Enhance operational and financial performance for health care facilities
  • Improve patient diagnosis and save lives

Read the Cerner Corporation and Cloudera, Inc. case study.

DHL Temperature Management Solutions provides temperature controlled pharmaceutical logistics to ensure pharmaceutical and biological goods stay within required temperature ranges to retain potency.

Previously, DHL transferred data into spreadsheets that took a week to compile and would only contain a portion of the potential information.

Moving to DOMO’s data mining platform allows for real-time reporting of a broader set of data categories to improve insight.

“We’re able to pinpoint issues that we couldn’t see before. For example, a certain product, on a certain lane, at a certain station is experiencing an issue repeatedly,” says Dina Bunn, global head of central operations and IT for DHL Temperature Management Solutions.

Industry: Logistics

Data mining provider: DOMO

  • Real-time versus week-old logistics information
  • More insight into sources of delays or problems at both a high and a detailed level
  • More customer engagement

Read the DHL and DOMO case study.

See more: Current Trends & Future Scope of Data Mining

The Nasdaq electronic stock exchange integrates Sisense’s data mining capabilities into their IR Insight software to help customers analyze huge data sets.

“Our customers rely on a range of content sets, including information that they license from others, as well as data that they input themselves,” says James Tickner, head of data analytics for Nasdaq Corporate Solutions.

“Being able to layer those together and attain a new level of value from content that they’ve been looking at for years but in another context.”

The combined application provides real-time analysis and clear reports easy for customers to understand and communicate internally.

Industry: Finance

Data mining provider: Sisense

  • Meets rigorous data security regulations
  • Quickly processes huge data sets from a variety of sources
  • Provides clients with new ways to visualize and interpret data to extract new value

Read or watch the Nasdaq and Sisense case study.

The Public Broadcasting System (PBS) of the U.S. manages an online website to service 353 PBS member stations and their viewers. Their 330 million sessions, 800 million page views, and 17.5 million episode plays generate enormous data that the PBS team struggled to analyze.

PBS worked with LunaMetrics to perform data mining on the Google Analytics 360 platform to speed up insights into PBS customers.

Dan Haggerty, director of digital analytics for PBS, says “that was the coolest thing about it. A machine took our data without prior assumptions and reaffirmed and strengthened ideas that subject matter experts already suspected about our audiences based on our contextual knowledge.”

Industry: Media

Data mining provider: Google Analytics and LunaMetrics

  • Identified seven key audience segments based on web behaviors
  • Developed in-depth personas per segment through data mining
  • Insights help direct future content and feature development

Read the PBS, LunaMetrics, and Google Analytics case study.

5. The Pegasus Group

Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services.

Under extreme time pressure, The Pegasus Group needed to find a way to use data mining to analyze up to 35GB of data with no prior knowledge of the data contents.

“[I analyzed] the first three million lines and [used RapidMiner’s data mining to perform] a stratified sampling to see which ones [were] benign, which packets [were] really part of the network, and which packets were part of the attack,” says Rodrigo Fuentealba Cartes of The Pegasus Group.

“In just 15 minutes … I used this amazing simulator to see what kinds of parameters I could use to filter packets … and in another two hours, the attack was stopped.”

Industry: Cybersecurity

Data mining provider: RapidMinder

  • Uploaded and analyzed three million lines of data 
  • Recommended analysis models provided answers within 15 minutes
  • Data analysis suggested solutions that stopped the attack within two hours

Watch The Pegasus Group and RapidMiner case study.

See more: Top Data Mining Tools

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

12 top data mining certifications of 2024, state of observability: surveys show 84% of companies struggle with costs, complexity, tools, top 7 data analyst companies hiring in 2024, get the free newsletter.

Subscribe to Data Insider for top news, trends & analysis

Latest Articles

12 top data mining..., 10 top storage certifications..., state of observability: surveys..., top 7 data analyst....

Logo

  • April 27, 2021
  • Data Science

Share This Post

Table of Contents

In the last decade, advances in processing power and speed have allowed us to move from tedious and time-consuming manual practices to fast and easy automated data analysis. The more complex the data sets collected, the greater the potential to uncover relevant information. Retailers, banks, manufacturers, healthcare companies, etc., are using data mining to uncover the relationships between everything from price optimisation, promotions and demographics to how economics, risk, competition and online presence affect their business models, revenues, operations and customer relationships. Today, data scientists have become indispensable to organisations around the world as companies seek to achieve bigger goals than ever before with data science. In this article, you will learn about the main use cases of data mining and how it has opened up a world of possibilities for businesses.

Today, organisations have access to more data than ever before. However, making sense of the huge volumes of structured and unstructured data to implement improvements across the organisation can be extremely difficult due to the sheer volume of information.

What is Data Mining

Data mining is the process of analyzing massive volumes of data to discover business intelligence that helps companies solve problems, mitigate risks, and seize new opportunities. Data mining, also called knowledge discovery in databases, in computer science, the process of discovering interesting and useful patterns and relationships in large volumes of data. The field combines tools from statistics and artificial intelligence with database management to analyze large digital collections, known as data sets. Data mining is widely used in business , science research, and government security. It is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. It is a process used by companies to turn raw data into useful information.

The data mining process breaks down into five steps:

1. Organizations collect data and load it into their data warehouses 2. They store and manage the data, either on in-house servers or the cloud 3. Business analysts, management teams and information technology professionals access the data and determine 4. how they want to organize it 5. Application software sorts the data based on the user’s results 6. The end-user presents the data in an easy-to-share format, such as a graph or table.

Data mining practitioners typically achieve timely, reliable results by following a structured, repeatable process that involves these six steps :

  • Business understanding Developing a thorough understanding of the project parameters, including the current business situation, the primary business objective of the project, and the criteria for success.
  • Data understanding Determining the data that will be needed to solve the problem and gathering it from all available sources.
  • Data preparation Preparing the data in the appropriate format to answer the business question, fixing any data quality problems such as missing or duplicate data.
  • Modeling Using algorithms to identify patterns within the data.
  • Evaluation Determining whether and how well the results delivered by a given model will help achieve the business goal. There is often an iterative phase to find the best algorithm in order to achieve the best result.
  • Deployment Making the results of the project available to decision-makers.

CTA Software

Data Mining Techniques

There are many data mining techniques that organisations can use to turn raw data into actionable insights. These techniques range from advanced AI to the fundamentals of data preparation, which are essential to maximising the value of data investments: 1. Pattern tracking Pattern tracking is a fundamental technique of data mining. It is about identifying and monitoring trends or patterns in data to make intelligent inferences about business outcomes. When an organisation identifies a trend in sales data, for example, it has a basis for taking action to leverage that information. If it is determined that a certain product sells better than others for a particular demographic, an organisation can use this knowledge to create similar products or services, or simply stock the original product better for that demographic.

2. Data cleaning and preparation Data cleaning and preparation is an essential part of the data mining process. Raw data must be cleaned and formatted to be useful for the various analysis methods. Data cleaning and preparation includes various elements of data modelling, transformation, migration, integration and aggregation. It is a necessary step in understanding the basic characteristics and attributes of the data to determine its best use.

3. Classification Classification-based data mining techniques involve analysing the various attributes associated with different types of data. Once organisations have identified the key characteristics of these data types, they can categorise or classify the corresponding data. This is essential for identifying, for example, personally identifiable information that organisations may wish to protect or delete from records.

4. Outlier detection Outlier detection identifies anomalies in data sets. Once organisations have found outliers in their data, it is easier to understand why these anomalies occur and to prepare for any future occurrences to better meet business objectives. For example, if there is a spike in the use of transactional credit card systems at a certain time of day, organisations can leverage this information by discovering the reason for the spike to optimise their sales for the rest of the day.

5. Association Association is a data mining technique related to statistics. It indicates that certain data is related to other data or data-driven events. It is similar to the notion of co-occurrence in machine learning, where the probability of one data-based event is indicated by the presence of another. This means that data analysis shows that there is a relationship between two data events: for example, the fact that the purchase of hamburgers is frequently accompanied by the purchase of Chips.

6. Clustering Clustering is an analysis technique that relies on visual approaches to understanding data. Clustering mechanisms use graphs to show where the distribution of data is with respect to different types of metrics. Clustering techniques also use different colours to show the distribution of data. Graphical approaches are ideal for using cluster analysis. With graphs and clustering, in particular, users can visually see how data is distributed to identify trends that are relevant to their business objectives. 7. Regression Regression techniques are useful for identifying the nature of the relationship between variables in a data set. These relationships may be causal in some cases, or simply correlated in others. Regression is a simple white box technique that clearly reveals the relationship between variables. Regression techniques are used in some aspects of forecasting and data modelling. 8. Sequential patterns This data mining technique focuses on finding a series of events that occur in sequence. It is particularly useful for transactional data mining. For example, this technique can reveal which items of clothing customers are most likely to buy after an initial purchase of, say, a pair of shoes. Understanding sequential patterns can help organisations to recommend additional items to customers to boost sales. 9. Prediction Prediction is a very powerful aspect of data mining and is one of the four branches of analytics. Predictive analytics uses patterns found in current or historical data to extend them into the future. In this way, it gives organisations insight into trends that will occur in their data in the future. There are several different approaches to using predictive analytics. Some of the more advanced ones involve aspects of machine learning and artificial intelligence. However, predictive analytics does not necessarily rely on these techniques, but can also be facilitated by simpler algorithms.

10. Decision trees Decision trees are a specific type of predictive model that allows organisations to efficiently extract data. Technically, a decision tree is part of machine learning, but it is better known as a “white box” machine learning technique due to its extremely simple nature. A decision tree allows users to clearly understand how data inputs affect outcomes. When multiple decision tree models are combined, they create predictive analytics models known as a random forest. Complicated random forest models are considered “black box” machine learning techniques because it is not always easy to understand their results based on their inputs. However, in most cases, this basic form of ensemble modelling is more accurate than using decision trees alone.

11. Neural networks A neural network is a specific type of machine learning model that is often used with AI and deep learning. So-called because they have different layers that resemble the functioning of neurons in the human brain, neural networks are one of the most accurate machine learning models used today. 12. Visualization Data visualisations are another important part of data mining. They offer users a view of data based on sensory perceptions that people can see. Today’s data visualisations are dynamic, useful for real-time data streaming, and are characterised by different colours that reveal different trends and patterns in the data. Dashboards are a powerful way to use data visualisations to uncover information about data operations. Organisations can base dashboards on different metrics and use visualisations to highlight patterns in the data, rather than simply using numerical results from statistical models.

13. Statistical techniques Statistical techniques are at the heart of most analyses involved in the data mining process. Different analysis models are based on statistical concepts, which produce numerical values applicable to specific business objectives. For example, neural networks use complex statistics based on different weights and measures to determine whether an image is a dog or a cat in image recognition systems. 14. Long-term memory processing Long-term memory processing refers to the ability to analyse data over long periods. Historical data stored in data warehouses are useful for this purpose. When an organisation can analyse over a long period of time, it is able to identify patterns that would otherwise be too subtle to detect. For example, by analysing attrition over a period of several years, an organisation can find subtle clues that could lead to a reduction in attrition in finance.

15. Data warehousing Data warehousing is an important part of the data mining process. Traditionally, data warehousing was about storing structured data in relational database management systems so that it could be analysed for business intelligence, reporting and basic dashboards. Today, there are cloud-based data warehouses and semi-structured and unstructured data warehouses such as Hadoop. While data warehouses were traditionally used for historical data, many modern approaches can provide deep analysis of data in real time.

16. Machine learning and artificial intelligence Machine learning and artificial intelligence (AI) represent some of the most advanced developments in the field of data mining. Advanced forms of machine learning, such as deep learning, offer highly accurate predictions when working with large-scale data. They are therefore useful for data processing in AI implementations such as computer vision, speech recognition or sophisticated text analysis using natural language processing. These data mining techniques help to determine the value of semi-structured and unstructured data.

Why is data mining important?

Data mining allows you to:

  • Sift through all the chaotic and repetitive noise in your data.
  • Understand what is relevant and then make good use of that information to assess likely outcomes.
  • Accelerate the pace of making informed decisions.

Benefits of Data Mining

  • Data mining helps companies to get knowledge-based information.
  • It can be implemented in new systems as well as existing platforms
  • Data mining helps organizations to make the profitable adjustments in operation and production.
  • Facilitates automated prediction of trends and behaviors as well as automated discovery of hidden patterns.
  • The data mining is a cost-effective and efficient solution compared to other statistical data applications.
  • Data mining helps with the decision-making process.
  • It is the speedy process which makes it easy for the users to analyze huge amount of data in less time.

Data Mining use cases and examples

The predictive capacity of data mining has changed the design of business strategies. Now, you can understand the present to anticipate the future. These are some uses cases and examples of data mining in current industry:

  • Marketing Data mining is used to explore increasingly large databases and to improve market segmentation. By analysing the relationships between parameters such as customer age, gender, tastes, etc., it is possible to guess their behaviour in order to direct personalised loyalty campaigns. Data mining in marketing also predicts which users are likely to unsubscribe from a service, what interests them based on their searches, or what a mailing list should include to achieve a higher response rate.
  • Banking Banks use data mining to better understand market risks. It is commonly applied to credit ratings and to intelligent anti-fraud systems to analyse transactions, card transactions, purchasing patterns and customer financial data. Data mining also allows banks to learn more about our online preferences or habits to optimise the return on their marketing campaigns, study the performance of sales channels or manage regulatory compliance obligations.
  • Education Data mining benefits educators to access student data, predict achievement levels and find students or groups of students which need extra attention. For example, students who are weak in maths subject.
  • E-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells through their websites. One of the most famous names is Amazon, who use Data mining techniques to get more customers into their eCommerce store.
  • Retail Supermarkets, for example, use joint purchasing patterns to identify product associations and decide how to place them in the aisles and on the shelves. Data mining also detects which offers are most valued by customers or increase sales at the checkout queue.
  • Service Providers Service providers like mobile phone and utility industries use Data Mining to predict the reasons when a customer leaves their company. They analyze billing details, customer service interactions, complaints made to the company to assign each customer a probability score and offer incentives.
  • Medicine Data mining enables more accurate diagnostics. Having all of the patient’s information, such as medical records, physical examinations, and treatment patterns, allows more effective treatments to be prescribed. It also enables more effective, efficient and cost-effective management of health resources by identifying risks, predicting illnesses in certain segments of the population or forecasting the length of hospital admission. Detecting fraud and irregularities, and strengthening ties with patients with an enhanced knowledge of their needs are also advantages of using data mining in medicine.
  • Insurance Data mining helps insurance companies to price their products profitable and promote new offers to their new or existing customers.
  • Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of production assets. They can anticipate maintenance which helps them reduce them to minimize downtime.
  • Crime Investigation Data Mining helps crime investigation agencies to deploy police workforce (where is a crime most likely to happen and when?), who to search at a border crossing etc.
  • Television and radio There are networks that apply real time data mining to measure their online television (IPTV) and radio audiences. These systems collect and analyse, on the fly, anonymous information from channel views, broadcasts and programming. Data mining allows networks to make personalised recommendations to radio listeners and TV viewers, as well as get to know their interests and activities in real time and better understand their behaviour. Networks also gain valuable knowledge for their advertisers, who use this data to target their potential customers more accurately.

Organizations across industries are achieving transformative results from data mining:

  • Bayer helps farmers with sustainable food production Weeds that damage crops have been a problem for farmers since farming began. A proper solution is to apply a narrow spectrum herbicide that effectively kills the exact species of weed in the field while having as few undesirable side effects as possible. But to do that, farmers first need to accurately identify the weeds in their fields. Using Talend Real-time Big Data, Bayer Digital Farming developed WEEDSCOUT, a new application farmers can download free. The app uses machine learning and artificial intelligence to match photos of weeds in a Bayer database with weed photos farmers send in. It gives the grower the opportunity to more precisely predict the impact of his or her actions such as, choice of seed variety, application rate of crop protection products, or harvest timing.
  • Air France KLM caters to customer travel preferences The airline uses data mining techniques to create a 360-degree customer view by integrating data from trip searches, bookings, and flight operations with web, social media, call center, and airport lounge interactions. They use this deep customer insight to create personalized travel experiences.
  • Groupon aligns marketing activities One of Groupon’s key challenges is processing the massive volume of data it uses to provide its shopping service. Every day, the company processes more than a terabyte of raw data in real time and stores this information in various database systems. Data mining allows Groupon to align marketing activities more closely with customer preferences, analyzing 1 terabyte of customer data in real time and helping the company identify trends as they emerge.
  • Domino’s helps customers build the perfect pizza The largest pizza company in the world collects 85,000 structured and unstructured data sources, including point of sales systems and 26 supply chain centers, and through all its channels, including text messages, social media, and Amazon Echo. This level of insight has improved business performance while enabling one-to-one buying experiences across touchpoints.

You can use data mining to solve almost any business problem that involves data, including:

  • Increasing revenue.
  • Understanding customer segments and preferences.
  • Acquiring new customers.
  • Improving cross-selling and up-selling.
  • Retaining customers and increasing loyalty.
  • Increasing ROI from marketing campaigns.
  • Detecting fraud.
  • Identifying credit risks.
  • Monitoring operational performance.

Data mining tools

Organizations can get started with data mining by accessing the necessary tools. Because the data mining process starts right after data ingestion, it’s critical to find data preparation tools that support different data structures necessary for data mining analytics. Organizations will also want to classify data in order to explore it with the numerous techniques discussed above.

1. Oracle Data Mining Oracle Data Mining popularly knownn as ODM is a module of the Oracle Advanced Analytics Database. This Data mining tool allows data analysts to generate detailed insights and make predictions. It helps predict customer behavior, develops customer profiles, identifies cross-selling opportunities.

2. Rapid Miner Rapid Miner is one of the best predictive analysis systems, it is written in JAVA programming language. It provides an integrated environment for deep learning, text mining, machine learning & predictive analysis. It offers a range of products to build new data mining processes and predictive setup analysis.

3. Orange Data Mining It is a perfect software suite for machine learning & data mining. It best aids the data visualization and is a component based software. The components of Orange are called “widgets.” These widgets range from preprocessing and data visualization to the assessment of algorithms and predictive modeling. Widgets deliver significant functionalities such as: displaying data table and allowing to select features, data reading, training predictors and comparison of learning algorithms, data element visualization, etc.

4. Weka Weka has a GUI that facilitates easy access to all its features. It is written in JAVA programming language. Weka is an open-source machine learning software with a vast collection of algorithms for data mining. It supports different data mining tasks, like preprocessing, classification, regression, clustering, and visualization, in a graphical interface that makes it easy to use. For each of these tasks, Weka provides built-in machine learning algorithms which allow you to quickly test your ideas and deploy models without writing any code. 5. KNIME It is the best integration platform for data analytics and reporting developed by KNIME.com AG. It operates on the concept of the modular data pipeline. KNIME constitutes of various machine learning and data mining components embedded together. It is a free, open-source platform for data mining and machine learning. Its intuitive interface allows you to create end-to-end data science workflows, from modeling to production. And different pre-built components enable fast modeling without entering a single line of code. A set of powerful extensions and integrations make KNIME a versatile and scalable platform to process complex types of data and use advanced algorithms. With KNIME, data scientists can create applications and services for analytics or business intelligence. In the financial industry, for instance, common use cases include credit scoring, fraud detection, and credit risk assessment.

6. Sisense Sisense is another effective Data mining tool. Sisense is extremely useful and best suited BI software when it comes to reporting purposes within the organization. It has a brilliant capability to handle and process data for the small scale/large scale organizations. It instantly analyzes and visualizes both big and disparate datasets. It is an ideal tool for creating dashboards with a wide variety of visualizations. It allows combining data from various sources to build a common repository and further, refines data to generate rich reports that get shared across departments for reporting. Sisense generates reports which are highly visual. It is specially designed for users that are non-technical. It allows drag & drop facility as well as widgets. Different widgets can be selected to generate the reports in form of pie charts, line charts, bar graphs etc. based on the purpose of an organization. Reports can be further drilled down by simply clicking to check details and comprehensive data.

7. Dundas Dundas is another excellent dashboard, reporting & data analytics tool. Dundas is quite reliable with its rapid integrations & quick insights. It provides unlimited data transformation patterns with attractive tables, charts & graphs. Dundas BI puts data in well-defined structures in a specific manner in order to ease the processing for the user. It constitutes of relational methods that facilitate multi-dimensional analysis and focuses on business-critical matters. As it generates reliable reports, thus it reduces cost and eliminates the requirement of other additional software.

8. Intetsoft Intetsoft is analytics dashboard and reporting tool that provides iterative development of data reports/views & generates pixel perfect reports. It allows the quick and flexible transformation of data from various sources.

9. Qlik Qlik is Data mining and visualization tool. It also offers dashboards and supports multiple data sources and file types.It has the following features: drag-and-drop interfaces to create flexible, interactive data visualizations, instantly respond to interactions and changes, supports multiple data sources and file types, allows easy security for data and content across all devices, allows you to share relevant analyses, including apps and stories, using a centralized hub.

10. MonkeyLearn MonkeyLearn is a machine learning platform that specializes in text mining. Available in a user-friendly interface, you can easily integrate MonkeyLearn with your existing tools to perform data mining in real-time. Start immediately with pre-trained text mining models like this sentiment analyzer, below, or build a customized solution to cater to more specific business needs. MonkeyLearn supports various data mining tasks, from detecting topics, sentiment, and intent, to extracting keywords and named entities.MonkeyLearn’s text mining tools are already being used to automate ticket tagging and routing in customer support, automatically detect negative feedback in social media, and deliver fine-grained insights that lead to better decision making.

I hope you found this article useful, if you need any help with Data Mining or Data Science Project in general, contact us ! We have experts in this field.

Ekaterina Novoseltseva

Ekaterina Novoseltseva is an experienced CMO and Board Director. Professor in prestigious Business Schools in Barcelona. Teaching about digital business design. Right now Ekaterina is a CMO at Apiumhub - software development hub based in Barcelona and organiser of Global Software Architecture Summit. Ekaterina is proud of having done software projects for companies like Tous, Inditex, Mango, Etnia, Adidas and many others. Ekaterina was taking active part in the Apiumhub office opening in Paseo de Gracia and in helping companies like Bitpanda open their tech hubs in Barcelona.

View all posts

Data-Science-Books

Leave a Reply Cancel Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Save my name, email, and website in this browser for the next time I comment.

Subscribe To Our Newsletter

Get updates from our latest tech findings

About Apiumhub

Apiumhub brings together a community of software developers & architects to help you transform your idea into a powerful and scalable product. Our Tech Hub specialises in  Software Architecture ,  Web Development  &  Mobile App Development . Here we share with you industry tips & best practices, based on our experience.

Estimate Your Project

  • Agile web and app development
  • Offshoring and outsourcing
  • Product Ownership
  • Software architecture
  • Software Architecture Sonar
  • Software Quality Assurance Category
  • Technology industry trends
  • User Experience Design

Popular posts​

  • Custom Software Development Services & Modern Practices
  • 5 Major Software Architecture Patterns
  • Software Development Service Providers You Can Trust
  • Software Outsourcing: Interesting Statistics, Predictions, Facts & Key Players
  • Tech Of The Future: Technology Predictions For Our World In 2050

Get our Book : Software Architecture Metrics

case study data mining

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona

  • (+34) 934 815 085
  • [email protected]
  • Consell de Cent, 333, 7o, 08007 Barcelona

OUR SERVICES

  • Dedicated Team
  • Team Extension
  • Software Architecture
  • Web Development
  • Mobile App Development
  • Agile Product Manager
  • QA Automation

LATEST BLOG NEWS

Key Android and iOS Accessibility Features

Ethical Considerations in AI Development

C4 PlantUML: Effortless Software Documentation

Customized Retrospective Dynamics for Your Goals

Data mining in clinical big data: the frequently used databases, steps, and methodological models

  • Wen-Tao Wu 1 , 2   na1 ,
  • Yuan-Jie Li 3   na1 ,
  • Ao-Zi Feng 1 ,
  • Tao Huang 1 ,
  • An-Ding Xu 4 &
  • Jun Lyu   ORCID: orcid.org/0000-0002-2237-8771 1  

Military Medical Research volume  8 , Article number:  44 ( 2021 ) Cite this article

40k Accesses

160 Citations

2 Altmetric

Metrics details

Many high quality studies have emerged from public databases, such as Surveillance, Epidemiology, and End Results (SEER), National Health and Nutrition Examination Survey (NHANES), The Cancer Genome Atlas (TCGA), and Medical Information Mart for Intensive Care (MIMIC); however, these data are often characterized by a high degree of dimensional heterogeneity, timeliness, scarcity, irregularity, and other characteristics, resulting in the value of these data not being fully utilized. Data-mining technology has been a frontier field in medical research, as it demonstrates excellent performance in evaluating patient risks and assisting clinical decision-making in building disease-prediction models. Therefore, data mining has unique advantages in clinical big-data research, especially in large-scale medical public databases. This article introduced the main medical public database and described the steps, tasks, and models of data mining in simple language. Additionally, we described data-mining methods along with their practical applications. The goal of this work was to aid clinical researchers in gaining a clear and intuitive understanding of the application of data-mining technology on clinical big-data in order to promote the production of research results that are beneficial to doctors and patients.

With the rapid development of computer software/hardware and internet technology, the amount of data has increased at an amazing speed. “Big data” as an abstract concept currently affects all walks of life [ 1 ], and although its importance has been recognized, its definition varies slightly from field to field. In the field of computer science, big data refers to a dataset that cannot be perceived, acquired, managed, processed, or served within a tolerable time by using traditional IT and software and hardware tools. Generally, big data refers to a dataset that exceeds the scope of a simple database and data-processing architecture used in the early days of computing and is characterized by high-volume and -dimensional data that is rapidly updated represents a phenomenon or feature that has emerged in the digital age. Across the medical industry, various types of medical data are generated at a high speed, and trends indicate that applying big data in the medical field helps improve the quality of medical care and optimizes medical processes and management strategies [ 2 , 3 ]. Currently, this trend is shifting from civilian medicine to military medicine. For example, the United States is exploring the potential to use of one of its largest healthcare systems (the Military Healthcare System) to provide healthcare to eligible veterans in order to potentially benefit > 9 million eligible personnel [ 4 ]. Another data-management system has been developed to assess the physical and mental health of active-duty personnel, with this expected to yield significant economic benefits to the military medical system [ 5 ]. However, in medical research, the wide variety of clinical data and differences between several medical concepts in different classification standards results in a high degree of dimensionality heterogeneity, timeliness, scarcity, and irregularity to existing clinical data [ 6 , 7 ]. Furthermore, new data analysis techniques have yet to be popularized in medical research [ 8 ]. These reasons hinder the full realization of the value of existing data, and the intensive exploration of the value of clinical data remains a challenging problem.

Computer scientists have made outstanding contributions to the application of big data and introduced the concept of data mining to solve difficulties associated with such applications. Data mining (also known as knowledge discovery in databases) refers to the process of extracting potentially useful information and knowledge hidden in a large amount of incomplete, noisy, fuzzy, and random practical application data [ 9 ]. Unlike traditional research methods, several data-mining technologies mine information to discover knowledge based on the premise of unclear assumptions (i.e., they are directly applied without prior research design). The obtained information should have previously unknown, valid, and practical characteristics [ 9 ]. Data-mining technology does not aim to replace traditional statistical analysis techniques, but it does seek to extend and expand statistical analysis methodologies. From a practical point of view, machine learning (ML) is the main analytical method in data mining, as it represents a method of training models by using data and then using those models for predicting outcomes. Given the rapid progress of data-mining technology and its excellent performance in other industries and fields, it has introduced new opportunities and prospects to clinical big-data research [ 10 ]. Large amounts of high quality medical data are available to researchers in the form of public databases, which enable more researchers to participate in the process of medical data mining in the hope that the generated results can further guide clinical practice.

This article provided a valuable overview to medical researchers interested in studying the application of data mining on clinical big data. To allow a clearer understanding of the application of data-mining technology on clinical big data, the second part of this paper introduced the concept of public databases and summarized those commonly used in medical research. In the third part of the paper, we offered an overview of data mining, including introducing an appropriate model, tasks, and processes, and summarized the specific methods of data mining. In the fourth and fifth parts of this paper, we introduced data-mining algorithms commonly used in clinical practice along with specific cases in order to help clinical researchers clearly and intuitively understand the application of data-mining technology on clinical big data. Finally, we discussed the advantages and disadvantages of data mining in clinical analysis and offered insight into possible future applications.

Overview of common public medical databases

A public database describes a data repository used for research and dedicated to housing data related to scientific research on an open platform. Such databases collect and store heterogeneous and multi-dimensional health, medical, scientific research in a structured form and characteristics of mass/multi-ownership, complexity, and security. These databases cover a wide range of data, including those related to cancer research, disease burden, nutrition and health, and genetics and the environment. Table 1 summarizes the main public medical databases [ 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 ]. Researchers can apply for access to data based on the scope of the database and the application procedures required to perform relevant medical research.

Data mining: an overview

Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [ 27 ]. Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians make clinical decisions [ 28 , 29 , 30 , 31 ].

Data-mining models

Data-mining has two kinds of models: descriptive and predictive. Predictive models are used to predict unknown or future values of other variables of interest, whereas descriptive models are often used to find patterns that describe data that can be interpreted by humans [ 32 ].

Data-mining tasks

A model is usually implemented by a task, with the goal of description being to generalize patterns of potential associations in the data. Therefore, using a descriptive model usually results in a few collections with the same or similar attributes. Prediction mainly refers to estimation of the variable value of a specific attribute based on the variable values of other attributes, including classification and regression [ 33 ].

Data-mining methods

After defining the data-mining model and task, the data mining methods required to build the approach based on the discipline involved are then defined. The data-mining method depends on whether or not dependent variables (labels) are present in the analysis. Predictions with dependent variables (labels) are generated through supervised learning, which can be performed by the use of linear regression, generalized linear regression, a proportional hazards model (the Cox regression model), a competitive risk model, decision trees, the random forest (RF) algorithm, and support vector machines (SVMs). In contrast, unsupervised learning involves no labels. The learning model infers some internal data structure. Common unsupervised learning methods include principal component analysis (PCA), association analysis, and clustering analysis.

Data-mining algorithms for clinical big data

Data mining based on clinical big data can produce effective and valuable knowledge, which is essential for accurate clinical decision-making and risk assessment [ 34 ]. Data-mining algorithms enable realization of these goals.

Supervised learning

A concept often mentioned in supervised learning is the partitioning of datasets. To prevent overfitting of a model, a dataset can generally be divided into two or three parts: a training set, validation set, and test set. Ripley [ 35 ] defined these parts as a set of examples used for learning and used to fit the parameters (i.e., weights) of the classifier, a set of examples used to tune the parameters (i.e., architecture) of a classifier, and a set of examples used only to assess the performance (generalized) of a fully-specified classifier, respectively. Briefly, the training set is used to train the model or determine the model parameters, the validation set is used to perform model selection, and the test set is used to verify model performance. In practice, data are generally divided into training and test sets, whereas the verification set is less involved. It should be emphasized that the results of the test set do not guarantee model correctness but only show that similar data can obtain similar results using the model. Therefore, the applicability of a model should be analysed in combination with specific problems in the research. Classical statistical methods, such as linear regression, generalized linear regression, and a proportional risk model, have been widely used in medical research. Notably, most of these classical statistical methods have certain data requirements or assumptions; however, in face of complicated clinical data, assumptions about data distribution are difficult to make. In contrast, some ML methods (algorithmic models) make no assumptions about the data and cross-verify the results; thus, they are likely to be favoured by clinical researchers [ 36 ]. For these reasons, this chapter focuses on ML methods that do not require assumptions about data distribution and classical statistical methods that are used in specific situations.

Decision tree

A decision tree is a basic classification and regression method that generates a result similar to the tree structure of a flowchart, where each tree node represents a test on an attribute, each branch represents the output of an attribute, each leaf node (decision node) represents a class or class distribution, and the topmost part of the tree is the root node [ 37 ]. The decision tree model is called a classification tree when used for classification and a regression tree when used for regression. Studies have demonstrated the utility of the decision tree model in clinical applications. In a study on the prognosis of breast cancer patients, a decision tree model and a classical logistic regression model were constructed, respectively, with the predictive performance of the different models indicating that the decision tree model showed stronger predictive power when using real clinical data [ 38 ]. Similarly, the decision tree model has been applied to other areas of clinical medicine, including diagnosis of kidney stones [ 39 ], predicting the risk of sudden cardiac arrest [ 40 ], and exploration of the risk factors of type II diabetes [ 41 ]. A common feature of these studies is the use of a decision tree model to explore the interaction between variables and classify subjects into homogeneous categories based on their observed characteristics. In fact, because the decision tree accounts for the strong interaction between variables, it is more suitable for use with decision algorithms that follow the same structure [ 42 ]. In the construction of clinical prediction models and exploration of disease risk factors and patient prognosis, the decision tree model might offer more advantages and practical application value than some classical algorithms. Although the decision tree has many advantages, it recursively separates observations into branches to construct a tree; therefore, in terms of data imbalance, the precision of decision tree models needs improvement.

The RF method

The RF algorithm was developed as an application of an ensemble-learning method based on a collection of decision trees. The bootstrap method [ 43 ] is used to randomly retrieve sample sets from the training set, with decision trees generated by the bootstrap method constituting a “random forest” and predictions based on this derived from an ensemble average or majority vote. The biggest advantage of the RF method is that the random sampling of predictor variables at each decision tree node decreases the correlation among the trees in the forest, thereby improving the precision of ensemble predictions [ 44 ]. Given that a single decision tree model might encounter the problem of overfitting [ 45 ], the initial application of RF minimizes overfitting in classification and regression and improves predictive accuracy [ 44 ]. Taylor et al. [ 46 ] highlighted the potential of RF in correctly differentiating in-hospital mortality in patients experiencing sepsis after admission to the emergency department. Nowhere in the healthcare system is the need more pressing to find methods to reduce uncertainty than in the fast, chaotic environment of the emergency department. The authors demonstrated that the predictive performance of the RF method was superior to that of traditional emergency medicine methods and the methods enabled evaluation of more clinical variables than traditional modelling methods, which subsequently allowed the discovery of clinical variables not expected to be of predictive value or which otherwise would have been omitted as a rare predictor [ 46 ]. Another study based on the Medical Information Mart for Intensive Care (MIMIC) II database [ 47 ] found that RF had excellent predictive power regarding intensive care unit (ICU) mortality [ 48 ]. These studies showed that the application of RF to big data stored in the hospital healthcare system provided a new data-driven method for predictive analysis in critical care. Additionally, random survival forests have recently been developed to analyse survival data, especially right-censored survival data [ 49 , 50 ], which can help researchers conduct survival analyses in clinical oncology and help develop personalized treatment regimens that benefit patients [ 51 ].

The SVM is a relatively new classification or prediction method developed by Cortes and Vapnik and represents a data-driven approach that does not require assumptions about data distribution [ 52 ]. The core purpose of an SVM is to identify a separation boundary (called a hyperplane) to help classify cases; thus, the advantages of SVMs are obvious when classifying and predicting cases based on high dimensional data or data with a small sample size [ 53 , 54 ].

In a study of drug compliance in patients with heart failure, researchers used an SVM to build a predictive model for patient compliance in order to overcome the problem of a large number of input variables relative to the number of available observations [ 55 ]. Additionally, the mechanisms of certain chronic and complex diseases observed in clinical practice remain unclear, and many risk factors, including gene–gene interactions and gene-environment interactions, must be considered in the research of such diseases [ 55 , 56 ]. SVMs are capable of addressing these issues. Yu et al. [ 54 ] applied an SVM for predicting diabetes onset based on data from the National Health and Nutrition Examination Survey (NHANES). Furthermore, these models have strong discrimination ability, making SVMs a promising classification approach for detecting individuals with chronic and complex diseases. However, a disadvantage of SVMs is that when the number of observation samples is large, the method becomes time- and resource-intensive, which is often highly inefficient.

Competitive risk model

Kaplan–Meier marginal regression and the Cox proportional hazards model are widely used in survival analysis in clinical studies. Classical survival analysis usually considers only one endpoint, such as the impact of patient survival time. However, in clinical medical research, multiple endpoints usually coexist, and these endpoints compete with one another to generate competitive risk data [ 57 ]. In the case of multiple endpoint events, the use of a single endpoint-analysis method can lead to a biased estimation of the probability of endpoint events due to the existence of competitive risks [ 58 ]. The competitive risk model is a classical statistical model based on the hypothesis of data distribution. Its main advantage is its accurate estimation of the cumulative incidence of outcomes for right-censored survival data with multiple endpoints [ 59 ]. In data analysis, the cumulative risk rate is estimated using the cumulative incidence function in single-factor analysis, and Gray’s test is used for between-group comparisons [ 60 ].

Multifactor analysis uses the Fine-Gray and cause-specific (CS) risk models to explore the cumulative risk rate [ 61 ]. The difference between the Fine-Gray and CS models is that the former is applicable to establishing a clinical prediction model and predicting the risk of a single endpoint of interest [ 62 ], whereas the latter is suitable for answering etiological questions, where the regression coefficient reflects the relative effect of covariates on the increased incidence of the main endpoint in the target event-free risk set [ 63 ]. Currently, in databases with CS records, such as Surveillance, Epidemiology, and End Results (SEER), competitive risk models exhibit good performance in exploring disease-risk factors and prognosis [ 64 ]. A study of prognosis in patients with oesophageal cancer from SEER showed that Cox proportional risk models might misestimate the effects of age and disease location on patient prognosis, whereas competitive risk models provide more accurate estimates of factors affecting patient prognosis [ 65 ]. In another study of the prognosis of penile cancer patients, researchers found that using a competitive risk model was more helpful in developing personalized treatment plans [ 66 ].

Unsupervised learning

In many data-analysis processes, the amount of usable identified data is small, and identifying data is a tedious process [ 67 ]. Unsupervised learning is necessary to judge and categorize data according to similarities, characteristics, and correlations and has three main applications: data clustering, association analysis, and dimensionality reduction. Therefore, the unsupervised learning methods introduced in this section include clustering analysis, association rules, and PCA.

Clustering analysis

The classification algorithm needs to “know” information concerning each category in advance, with all of the data to be classified having corresponding categories. When the above conditions cannot be met, cluster analysis can be applied to solve the problem [ 68 ]. Clustering places similar objects into different categories or subsets through the process of static classification. Consequently, objects in the same subset have similar properties. Many kinds of clustering techniques exist. Here, we introduced the four most commonly used clustering techniques.

Partition clustering

The core idea of this clustering method regards the centre of the data point as the centre of the cluster. The k-means method [ 69 ] is a representative example of this technique. The k-means method takes n observations and an integer, k , and outputs a partition of the n observations into k sets such that each observation belongs to the cluster with the nearest mean [ 70 ]. The k-means method exhibits low time complexity and high computing efficiency but has a poor processing effect on high dimensional data and cannot identify nonspherical clusters.

Hierarchical clustering

The hierarchical clustering algorithm decomposes a dataset hierarchically to facilitate the subsequent clustering [ 71 ]. Common algorithms for hierarchical clustering include BIRCH [ 72 ], CURE [ 73 ], and ROCK [ 74 ]. The algorithm starts by treating every point as a cluster, with clusters grouped according to closeness. When further combinations result in unexpected results under multiple causes or only one cluster remains, the grouping process ends. This method has wide applicability, and the relationship between clusters is easy to detect; however, the time complexity is high [ 75 ].

Clustering according to density

The density algorithm takes areas presenting a high degree of data density and defines these as belonging to the same cluster [ 76 ]. This method aims to find arbitrarily-shaped clusters, with the most representative algorithm being DBSCAN [ 77 ]. In practice, DBSCAN does not need to input the number of clusters to be partitioned and can handle clusters of various shapes; however, the time complexity of the algorithm is high. Furthermore, when data density is irregular, the quality of the clusters decreases; thus, DBSCAN cannot process high dimensional data [ 75 ].

Clustering according to a grid

Neither partition nor hierarchical clustering can identify clusters with nonconvex shapes. Although a dimension-based algorithm can accomplish this task, the time complexity is high. To address this problem, data-mining researchers proposed grid-based algorithms that changed the original data space into a grid structure of a certain size. A representative algorithm is STING, which divides the data space into several square cells according to different resolutions and clusters the data of different structure levels [ 78 ]. The main advantage of this method is its high processing speed and its exclusive dependence on the number of units in each dimension of the quantized space.

In clinical studies, subjects tend to be actual patients. Although researchers adopt complex inclusion and exclusion criteria before determining the subjects to be included in the analyses, heterogeneity among different patients cannot be avoided [ 79 , 80 ]. The most common application of cluster analysis in clinical big data is in classifying heterogeneous mixed groups into homogeneous groups according to the characteristics of existing data (i.e., “subgroups” of patients or observed objects are identified) [ 81 , 82 ]. This new information can then be used in the future to develop patient-oriented medical-management strategies. Docampo et al. [ 81 ] used hierarchical clustering to reduce heterogeneity and identify subgroups of clinical fibromyalgia, which aided the evaluation and management of fibromyalgia. Additionally, Guo et al. [ 83 ] used k-means clustering to divide patients with essential hypertension into four subgroups, which revealed that the potential risk of coronary heart disease differed between different subgroups. On the other hand, density- and grid-based clustering algorithms have mostly been used to process large numbers of images generated in basic research and clinical practice, with current studies focused on developing new tools to help clinical research and practices based on these technologies [ 84 , 85 ]. Cluster analysis will continue to have extensive application prospects along with the increasing emphasis on personalized treatment.

Association rules

Association rules discover interesting associations and correlations between item sets in large amounts of data. These rules were first proposed by Agrawal et al. [ 86 ] and applied to analyse customer buying habits to help retailers create sales plans. Data-mining based on association rules identifies association rules in a two-step process: 1) all high frequency items in the collection are listed and 2) frequent association rules are generated based on the high frequency items [ 87 ]. Therefore, before association rules can be obtained, sets of frequent items must be calculated using certain algorithms. The Apriori algorithm is based on the a priori principle of finding all relevant adjustment items in a database transaction that meet a minimum set of rules and restrictions or other restrictions [ 88 ]. Other algorithms are mostly variants of the Apriori algorithm [ 64 ]. The Apriori algorithm must scan the entire database every time it scans the transaction; therefore, algorithm performance deteriorates as database size increases [ 89 ], making it potentially unsuitable for analysing large databases. The frequent pattern (FP) growth algorithm was proposed to improve efficiency. After the first scan, the FP algorithm compresses the frequency set in the database into a FP tree while retaining the associated information and then mines the conditional libraries separately [ 90 ]. Association-rule technology is often used in medical research to identify association rules between disease risk factors (i.e., exploration of the joint effects of disease risk factors and combinations of other risk factors). For example, Li et al. [ 91 ] used the association-rule algorithm to identify the most important stroke risk factor as atrial fibrillation, followed by diabetes and a family history of stroke. Based on the same principle, association rules can also be used to evaluate treatment effects and other aspects. For example, Guo et al. [ 92 ] used the FP algorithm to generate association rules and evaluate individual characteristics and treatment effects of patients with diabetes, thereby reducing the readability rate of patients with diabetes. Association rules reveal a connection between premises and conclusions; however, the reasonable and reliable application of information can only be achieved through validation by experienced medical professionals and through extensive causal research [ 92 ].

PCA is a widely used data-mining method that aims to reduce data dimensionality in an interpretable way while retaining most of the information present in the data [ 93 , 94 ]. The main purpose of PCA is descriptive, as it requires no assumptions about data distribution and is, therefore, an adaptive and exploratory method. During the process of data analysis, the main steps of PCA include standardization of the original data, calculation of a correlation coefficient matrix, calculation of eigenvalues and eigenvectors, selection of principal components, and calculation of the comprehensive evaluation value. PCA does not often appear as a separate method, as it is often combined with other statistical methods [ 95 ]. In practical clinical studies, the existence of multicollinearity often leads to deviation from multivariate analysis. A feasible solution is to construct a regression model by PCA, which replaces the original independent variables with each principal component as a new independent variable for regression analysis, with this most commonly seen in the analysis of dietary patterns in nutritional epidemiology [ 96 ]. In a study of socioeconomic status and child-developmental delays, PCA was used to derive a new variable (the household wealth index) from a series of household property reports and incorporate this new variable as the main analytical variable into the logistic regression model [ 97 ]. Additionally, PCA can be combined with cluster analysis. Burgel et al. [ 98 ] used PCA to transform clinical data to address the lack of independence between existing variables used to explore the heterogeneity of different subtypes of chronic obstructive pulmonary disease. Therefore, in the study of subtypes and heterogeneity of clinical diseases, PCA can eliminate noisy variables that can potentially corrupt the cluster structure, thereby increasing the accuracy of the results of clustering analysis [ 98 , 99 ].

The data-mining process and examples of its application using common public databases

Open-access databases have the advantages of large volumes of data, wide data coverage, rich data information, and a cost-efficient method of research, making them beneficial to medical researchers. In this chapter, we introduced the data-mining process and methods and their application in research based on examples of utilizing public databases and data-mining algorithms.

The data-mining process

Figure  1 shows a series of research concepts. The data-mining process is divided into several steps: (1) database selection according to the research purpose; (2) data extraction and integration, including downloading the required data and combining data from multiple sources; (3) data cleaning and transformation, including removal of incorrect data, filling in missing data, generating new variables, converting data format, and ensuring data consistency; (4) data mining, involving extraction of implicit relational patterns through traditional statistics or ML; (5) pattern evaluation, which focuses on the validity parameters and values of the relationship patterns of the extracted data; and (6) assessment of the results, involving translation of the extracted data-relationship model into comprehensible knowledge made available to the public.

figure 1

The steps of data mining in medical public database

Examples of data-mining applied using public databases

Establishment of warning models for the early prediction of disease.

A previous study identified sepsis as a major cause of death in ICU patients [ 100 ]. The authors noted that the predictive model developed previously used a limited number of variables, and that model performance required improvement. The data-mining process applied to address these issues was, as follows: (1) data selection using the MIMIC III database; (2) extraction and integration of three types of data, including multivariate features (demographic information and clinical biochemical indicators), time series data (temperature, blood pressure, and heart rate), and clinical latent features (various scores related to disease); (3) data cleaning and transformation, including fixing irregular time series measurements, estimating missing values, deleting outliers, and addressing data imbalance; (4) data mining through the use of logical regression, generation of a decision tree, application of the RF algorithm, an SVM, and an ensemble algorithm (a combination of multiple classifiers) to established the prediction model; (5) pattern evaluation using sensitivity, precision, and the area under the receiver operating characteristic curve to evaluate model performance; and (6) evaluation of the results, in this case the potential to predicting the prognosis of patients with sepsis and whether the model outperformed current scoring systems.

Exploring prognostic risk factors in cancer patients

Wu et al. [ 101 ] noted that traditional survival-analysis methods often ignored the influence of competitive risk events, such as suicide and car accident, on outcomes, leading to deviations and misjudgements in estimating the effect of risk factors. They used the SEER database, which offers cause-of-death data for cancer patients, and a competitive risk model to address this problem according to the following process: (1) data were obtained from the SEER database; (2) demography, clinical characteristics, treatment modality, and cause of death of cecum cancer patients were extracted from the database; (3) patient data were deleted when there were no demographic, clinical, therapeutic, or cause-of-death variables; (4) Cox regression and two kinds of competitive risk models were applied for survival analysis; (5) the results were compared between three different models; and (6) the results revealed that for survival data with multiple endpoints, the competitive risk model was more favourable.

Derivation of dietary patterns

A study by Martínez Steele et al. [ 102 ] applied PCA for nutritional epidemiological analysis to determine dietary patterns and evaluate the overall nutritional quality of the population based on those patterns. Their process involved the following: (1) data were extracted from the NHANES database covering the years 2009–2010; (2) demographic characteristics and two 24 h dietary recall interviews were obtained; (3) data were weighted and excluded based on subjects not meeting specific criteria; (4) PCA was used to determine dietary patterns in the United States population, and Gaussian regression and restricted cubic splines were used to assess associations between ultra-processed foods and nutritional balance; (5) eigenvalues, scree plots, and the interpretability of the principal components were reviewed to screen and evaluate the results; and (6) the results revealed a negative association between ultra-processed food intake and overall dietary quality. Their findings indicated that a nutritionally balanced eating pattern was characterized by a diet high in fibre, potassium, magnesium, and vitamin C intake along with low sugar and saturated fat consumption.

The use of “big data” has changed multiple aspects of modern life, with its use combined with data-mining methods capable of improving the status quo [ 86 ]. The aim of this study was to aid clinical researchers in understanding the application of data-mining technology on clinical big data and public medical databases to further their research goals in order to benefit clinicians and patients. The examples provided offer insight into the data-mining process applied for the purposes of clinical research. Notably, researchers have raised concerns that big data and data-mining methods were not a perfect fit for adequately replicating actual clinical conditions, with the results potentially capable of misleading doctors and patients [ 86 ]. Therefore, given the rate at which new technologies and trends progress, it is necessary to maintain a positive attitude concerning their potential impact while remaining cautious in examining the results provided by their application.

In the future, the healthcare system will need to utilize increasingly larger volumes of big data with higher dimensionality. The tasks and objectives of data analysis will also have higher demands, including higher degrees of visualization, results with increased accuracy, and stronger real-time performance. As a result, the methods used to mine and process big data will continue to improve. Furthermore, to increase the formality and standardization of data-mining methods, it is possible that a new programming language specifically for this purpose will need to be developed, as well as novel methods capable of addressing unstructured data, such as graphics, audio, and text represented by handwriting. In terms of application, the development of data-management and disease-screening systems for large-scale populations, such as the military, will help determine the best interventions and formulation of auxiliary standards capable of benefitting both cost-efficiency and personnel. Data-mining technology can also be applied to hospital management in order to improve patient satisfaction, detect medical-insurance fraud and abuse, and reduce costs and losses while improving management efficiency. Currently, this technology is being applied for predicting patient disease, with further improvements resulting in the increased accuracy and speed of these predictions. Moreover, it is worth noting that technological development will concomitantly require higher quality data, which will be a prerequisite for accurate application of the technology.

Finally, the ultimate goal of this study was to explain the methods associated with data mining and commonly used to process clinical big data. This review will potentially promote further study and aid doctors and patients.

Abbreviations

Biologic Specimen and Data Repositories Information Coordinating Center

China Health and Retirement Longitudinal Study

China Health and Nutrition Survey

China Kadoorie Biobank

Cause-specific risk

Comparative Toxicogenomics Database

EICU Collaborative Research Database

Frequent pattern

Global burden of disease

Gene expression omnibus

Health and Retirement Study

International Cancer Genome Consortium

Medical Information Mart for Intensive Care

  • Machine learning

National Health and Nutrition Examination Survey

Principal component analysis

Paediatric intensive care

Random forest

Surveillance, epidemiology, and end results

Support vector machine

The Cancer Genome Atlas

Herland M, Khoshgoftaar TM, Wald R. A review of data mining using big data in health informatics. J Big Data. 2014;1(1):1–35.

Article   Google Scholar  

Wang F, Zhang P, Wang X, Hu J. Clinical risk prediction by exploring high-order feature correlations. AMIA Annu Symp Proc. 2014;2014:1170–9.

PubMed   PubMed Central   Google Scholar  

Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15:105. https://doi.org/10.1186/1471-2105-15-105 .

Article   CAS   Google Scholar  

Ramachandran S, Erraguntla M, Mayer R, Benjamin P, Editors. Data mining in military health systems-clinical and administrative applications. In: 2007 IEEE international conference on automation science and engineering; 2007. https://doi.org/10.1109/COASE.2007.4341764 .

Vie LL, Scheier LM, Lester PB, Ho TE, Labarthe DR, Seligman MEP. The US army person-event data environment: a military-civilian big data enterprise. Big Data. 2015;3(2):67–79. https://doi.org/10.1089/big.2014.0055 .

Article   PubMed   Google Scholar  

Mohan A, Blough DM, Kurc T, Post A, Saltz J. Detection of conflicts and inconsistencies in taxonomy-based authorization policies. IEEE Int Conf Bioinform Biomed. 2012;2011:590–4. https://doi.org/10.1109/BIBM.2011.79 .

Luo J, Wu M, Gopukumar D, Zhao Y. Big data application in biomedical research and health care: a literature review. Biomed Inform Insights. 2016;8:1–10. https://doi.org/10.4137/BII.S31559 .

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.

Sahu H, Shrma S, Gondhalakar S. A brief overview on data mining survey. Int J Comput Technol Electron Eng. 2011;1(3):114–21.

Google Scholar  

Obermeyer Z, Emanuel EJ. Predicting the future - big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–9.

Article   PubMed   PubMed Central   Google Scholar  

Doll KM, Rademaker A, Sosa JA. Practical guide to surgical data sets: surveillance, epidemiology, and end results (SEER) database. JAMA Surg. 2018;153(6):588–9.

Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3: 160035. https://doi.org/10.1038/sdata.2016.35 .

Ahluwalia N, Dwyer J, Terry A, Moshfegh A, Johnson C. Update on NHANES dietary data: focus on collection, release, analytical considerations, and uses to inform public policy. Adv Nutr. 2016;7(1):121–34.

Vos T, Lim SS, Abbafati C, Abbas KM, Abbasi M, Abbasifard M, et al. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2020;396(10258):1204–22. https://doi.org/10.1016/S0140-6736(20)30925-9 .

Palmer LJ. UK Biobank: Bank on it. Lancet. 2007;369(9578):1980–2. https://doi.org/10.1016/S0140-6736(07)60924-6 .

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20. https://doi.org/10.1038/ng.2764 .

Davis S, Meltzer PS. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;23(14):1846–7.

Article   PubMed   CAS   Google Scholar  

Zhang J, Bajari R, Andric D, Gerthoffert F, Lepsa A, Nahal-Bose H, et al. The international cancer genome consortium data portal. Nat Biotechnol. 2019;37(4):367–9.

Article   CAS   PubMed   Google Scholar  

Chen Z, Chen J, Collins R, Guo Y, Peto R, Wu F, et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int J Epidemiol. 2011;40(6):1652–66.

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, McMorran R, Wiegers J, et al. The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 2019;47(D1):D948–54. https://doi.org/10.1093/nar/gky868 .

Zeng X, Yu G, Lu Y, Tan L, Wu X, Shi S, et al. PIC, a paediatric-specific intensive care database. Sci Data. 2020;7(1):14.

Giffen CA, Carroll LE, Adams JT, Brennan SP, Coady SA, Wagner EL. Providing contemporary access to historical biospecimen collections: development of the NHLBI Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). Biopreserv Biobank. 2015;13(4):271–9.

Zhang B, Zhai FY, Du SF, Popkin BM. The China Health and Nutrition Survey, 1989–2011. Obes Rev. 2014;15(Suppl 1):2–7. https://doi.org/10.1111/obr.12119 .

Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China Health and Retirement Longitudinal Study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-centre database for critical care research. Sci Data. 2018;5:180178. https://doi.org/10.1038/sdata.2018.178 .

Fisher GG, Ryan LH. Overview of the health and retirement study and introduction to the special issue. Work Aging Retire. 2018;4(1):1–9.

Iavindrasana J, Cohen G, Depeursinge A, Müller H, Meyer R, Geissbuhler A. Clinical data mining: a review. Yearb Med Inform. 2009:121–33.

Zhang Y, Guo SL, Han LN, Li TL. Application and exploration of big data mining in clinical medicine. Chin Med J. 2016;129(6):731–8. https://doi.org/10.4103/0366-6999.178019 .

Ngiam KY, Khor IW. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019;20(5):e262–73.

Huang C, Murugiah K, Mahajan S, Li S-X, Dhruva SS, Haimovich JS, et al. Enhancing the prediction of acute kidney injury risk after percutaneous coronary intervention using machine learning techniques: a retrospective cohort study. PLoS Med. 2018;15(11):e1002703.

Rahimian F, Salimi-Khorshidi G, Payberah AH, Tran J, Ayala Solares R, Raimondi F, et al. Predicting the risk of emergency admission with machine learning: development and validation using linked electronic health records. PLoS Med. 2018;15(11):e1002695.

Kantardzic M. Data Mining: concepts, models, methods, and algorithms. Technometrics. 2003;45(3):277.

Jothi N, Husain W. Data mining in healthcare—a review. Procedia Comput Sci. 2015;72:306–13.

Piatetsky-Shapiro G, Tamayo P. Microarray data mining: facing the challenges. SIGKDD. 2003;5(2):1–5. https://doi.org/10.1145/980972.980974 .

Ripley BD. Pattern recognition and neural networks. Cambridge: Cambridge University Press; 1996.

Book   Google Scholar  

Arlot S, Celisse A. A survey of cross-validation procedures for model selection. Stat Surv. 2010;4:40–79. https://doi.org/10.1214/09-SS054 .

Shouval R, Bondi O, Mishan H, Shimoni A, Unger R, Nagler A. Application of machine learning algorithms for clinical predictive modelling: a data-mining approach in SCT. Bone Marrow Transp. 2014;49(3):332–7.

Momenyan S, Baghestani AR, Momenyan N, Naseri P, Akbari ME. Survival prediction of patients with breast cancer: comparisons of decision tree and logistic regression analysis. Int J Cancer Manag. 2018;11(7):e9176.

Topaloğlu M, Malkoç G. Decision tree application for renal calculi diagnosis. Int J Appl Math Electron Comput. 2016. https://doi.org/10.18100/ijamec.281134.

Li H, Wu TT, Yang DL, Guo YS, Liu PC, Chen Y, et al. Decision tree model for predicting in-hospital cardiac arrest among patients admitted with acute coronary syndrome. Clin Cardiol. 2019;42(11):1087–93.

Ramezankhani A, Hadavandi E, Pournik O, Shahrabi J, Azizi F, Hadaegh F. Decision tree-based modelling for identification of potential interactions between type 2 diabetes risk factors: a decade follow-up in a Middle East prospective cohort study. BMJ Open. 2016;6(12):e013336.

Carmona-Bayonas A, Jiménez-Fonseca P, Font C, Fenoy F, Otero R, Beato C, et al. Predicting serious complications in patients with cancer and pulmonary embolism using decision tree modelling: the EPIPHANY Index. Br J Cancer. 2017;116(8):994–1001.

Efron B. Bootstrap methods: another look at the jackknife. In: Kotz S, Johnson NL, editors. Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.

Chapter   Google Scholar  

Breima L. Random forests. Mach Learn. 2010;1(45):5–32. https://doi.org/10.1023/A:1010933404324 .

Franklin J. The elements of statistical learning: data mining, inference and prediction. Math Intell. 2005;27(2):83–5.

Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of in-hospital mortality in emergency department patients with sepsis: a local big data-driven, machine learning approach. Acad Emerg Med. 2016;23(3):269–78.

Lee J, Scott DJ, Villarroel M, Clifford GD, Saeed M, Mark RG. Open-access MIMIC-II database for intensive care research. Annu Int Conf IEEE Eng Med Biol Soc. 2011:8315–8. https://doi.org/10.1109/IEMBS.2011.6092050 .

Lee J. Patient-specific predictive modelling using random forests: an observational study for the critically Ill. JMIR Med Inform. 2017;5(1):e3.

Wongvibulsin S, Wu KC, Zeger SL. Clinical risk prediction with random forests for survival, longitudinal, and multivariate (RF-SLAM) data analysis. BMC Med Res Methodol. 2019;20(1):1.

Taylor JMG. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.

Hu C, Steingrimsson JA. Personalized risk prediction in clinical oncology research: applications and practical issues using survival trees and random forests. J Biopharm Stat. 2018;28(2):333–49.

Dietrich R, Opper M, Sompolinsky H. Statistical mechanics of support vector networks. Phys Rev Lett. 1999;82(14):2975.

Verplancke T, Van Looy S, Benoit D, Vansteelandt S, Depuydt P, De Turck F, et al. Support vector machine versus logistic regression modelling for prediction of hospital mortality in critically ill patients with haematological malignancies. BMC Med Inform Decis Mak. 2008;8:56. https://doi.org/10.1186/1472-6947-8-56 .

Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modelling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inform Decis Mak. 2010;10:16. https://doi.org/10.1186/1472-6947-10-16 .

Son YJ, Kim HG, Kim EH, Choi S, Lee SK. Application of support vector machine for prediction of medication adherence in heart failure patients. Healthc Inform Res. 2010;16(4):253–9.

Schadt EE, Friend SH, Shaywitz DA. A network view of disease and compound screening. Nat Rev Drug Discov. 2009;8(4):286–95.

Austin PC, Lee DS, Fine JP. Introduction to the analysis of survival data in the presence of competing risks. Circulation. 2016;133(6):601–9.

Putter H, Fiocco M, Geskus RB. Tutorial in biostatistics: competing risks and multi-state models. Stat Med. 2007;26(11):2389–430. https://doi.org/10.1002/sim.2712 .

Klein JP. Competing risks. WIREs Comp Stat. 2010;2(3):333–9. https://doi.org/10.1002/wics.83 .

Haller B, Schmidt G, Ulm K. Applying competing risks regression models: an overview. Lifetime Data Anal. 2013;19(1):33–58. https://doi.org/10.1007/s10985-012-9230-8 .

Fine JP, Gray RJ. A proportional hazards model for the subdistribution of a competing risk. J Am Stat Assoc. 1999;94(446):496–509.

Koller MT, Raatz H, Steyerberg EW, Wolbers M. Competing risks and the clinical community: irrelevance or ignorance? Stat Med. 2012;31(11–12):1089–97.

Lau B, Cole SR, Gange SJ. Competing risk regression models for epidemiologic data. Am J Epidemiol. 2009;170(2):244–56.

Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, et al. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 2020;13(1):57–69.

Yu Z, Yang J, Gao L, Huang Q, Zi H, Li X. A competing risk analysis study of prognosis in patients with esophageal carcinoma 2006–2015 using data from the surveillance, epidemiology, and end results (SEER) database. Med Sci Monit. 2020;26:e918686.

Yang J, Pan Z, He Y, Zhao F, Feng X, Liu Q, et al. Competing-risks model for predicting the prognosis of penile cancer based on the SEER database. Cancer Med. 2019;8(18):7881–9.

Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2018;19(6):1236–46.

Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.

Macqueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA: University of California Press;1967.

Forgy EW. Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics. 1965;21:768–9.

Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967;32(3):241–54.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec. 1996;25(2):103–14.

Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Rec. 1998;27(2):73–84.

Guha S, Rastogi R, Shim K. ROCK: a robust clustering algorithm for categorical attributes. Inf Syst. 2000;25(5):345–66.

Xu D, Tian Y. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Kriegel HP, Kröger P, Sander J, Zimek A. Density-based clustering. WIRES Data Min Knowl. 2011;1(3):231–40. https://doi.org/10.1002/widm.30 .

Ester M, Kriegel HP, Sander J, Xu X, editors. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of 2nd international conference on knowledge discovery and data mining Portland, Oregon: AAAI Press; 1996. p. 226–31.

Wang W, Yang J, Muntz RR. STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases, Morgan Kaufmann Publishers Inc.; 1997. p. 186–95.

Iwashyna TJ, Burke JF, Sussman JB, Prescott HC, Hayward RA, Angus DC. Implications of heterogeneity of treatment effect for reporting and analysis of randomized trials in critical care. Am J Respir Crit Care Med. 2015;192(9):1045–51.

Ruan S, Lin H, Huang C, Kuo P, Wu H, Yu C. Exploring the heterogeneity of effects of corticosteroids on acute respiratory distress syndrome: a systematic review and meta-analysis. Crit Care. 2014;18(2):R63.

Docampo E, Collado A, Escaramís G, Carbonell J, Rivera J, Vidal J, et al. Cluster analysis of clinical data identifies fibromyalgia subgroups. PLoS ONE. 2013;8(9):e74873.

Sutherland ER, Goleva E, King TS, Lehman E, Stevens AD, Jackson LP, et al. Cluster analysis of obesity and asthma phenotypes. PLoS ONE. 2012;7(5):e36631.

Guo Q, Lu X, Gao Y, Zhang J, Yan B, Su D, et al. Cluster analysis: a new approach for identification of underlying risk factors for coronary artery disease in essential hypertensive patients. Sci Rep. 2017;7:43965.

Hastings S, Oster S, Langella S, Kurc TM, Pan T, Catalyurek UV, et al. A grid-based image archival and analysis system. J Am Med Inform Assoc. 2005;12(3):286–95.

Celebi ME, Aslandogan YA, Bergstresser PR. Mining biomedical images with density-based clustering. In: International conference on information technology: coding and computing (ITCC’05), vol II. Washington, DC, USA: IEEE; 2005. https://doi.org/10.1109/ITCC.2005.196 .

Agrawal R, Imieliński T, Swami A, editors. Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data. Washington, DC, USA: Association for Computing Machinery; 1993. p. 207–16. https://doi.org/10.1145/170035.170072 .

Sethi A, Mahajan P. Association rule mining: A review. TIJCSA. 2012;1(9):72–83.

Kotsiantis S, Kanellopoulos D. Association rules mining: a recent overview. GESTS Int Trans Comput Sci Eng. 2006;32(1):71–82.

Narvekar M, Syed SF. An optimized algorithm for association rule mining using FP tree. Procedia Computer Sci. 2015;45:101–10.

Verhein F. Frequent pattern growth (FP-growth) algorithm. Sydney: The University of Sydney; 2008. p. 1–16.

Li Q, Zhang Y, Kang H, Xin Y, Shi C. Mining association rules between stroke risk factors based on the Apriori algorithm. Technol Health Care. 2017;25(S1):197–205.

Guo A, Zhang W, Xu S. Exploring the treatment effect in diabetes patients using association rule mining. Int J Inf Pro Manage. 2016;7(3):1–9.

Pearson K. On lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417.

Jolliffe IT, Cadima J. Principal component analysis: a review and recent developments. Philos Trans A Math Phys Eng Sci. 2016;374(2065):20150202.

Zhang Z, Castelló A. Principal components analysis in clinical studies. Ann Transl Med. 2017;5(17):351.

Apio BRS, Mawa R, Lawoko S, Sharma KN. Socio-economic inequality in stunting among children aged 6–59 months in a Ugandan population based cross-sectional study. Am J Pediatri. 2019;5(3):125–32.

Burgel PR, Paillasseur JL, Caillaud D, Tillie-Leblond I, Chanez P, Escamilla R, et al. Clinical COPD phenotypes: a novel approach using principal component and cluster analyses. Eur Respir J. 2010;36(3):531–9.

Vogt W, Nagel D. Cluster analysis in diagnosis. Clin Chem. 1992;38(2):182–98.

Layeghian Javan S, Sepehri MM, Layeghian Javan M, Khatibi T. An intelligent warning model for early prediction of cardiac arrest in sepsis patients. Comput Methods Programs Biomed. 2019;178:47–58. https://doi.org/10.1016/j.cmpb.2019.06.010 .

Wu W, Yang J, Li D, Huang Q, Zhao F, Feng X, et al. Competitive risk analysis of prognosis in patients with cecum cancer: a population-based study. Cancer Control. 2021;28:1073274821989316. https://doi.org/10.1177/1073274821989316 .

Martínez Steele E, Popkin BM, Swinburn B, Monteiro CA. The share of ultra-processed foods and the overall nutritional quality of diets in the US: evidence from a nationally representative cross-sectional study. Popul Health Metr. 2017;15(1):6.

Download references

This study was supported by the National Social Science Foundation of China (No. 16BGL183).

Author information

Wen-Tao Wu and Yuan-Jie Li have contributed equally to this work

Authors and Affiliations

Department of Clinical Research, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

Wen-Tao Wu, Ao-Zi Feng, Li Li, Tao Huang & Jun Lyu

School of Public Health, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Department of Human Anatomy, Histology and Embryology, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, 710061, Shaanxi, China

Yuan-Jie Li

Department of Neurology, The First Affiliated Hospital of Jinan University, Tianhe District, 613 W. Huangpu Avenue, Guangzhou, 510632, Guangdong, China

You can also search for this author in PubMed   Google Scholar

Contributions

WTW, YJL and JL designed the review. JL, AZF, TH, LL and ADX reviewed and criticized the original paper. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to An-Ding Xu or Jun Lyu .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Wu, WT., Li, YJ., Feng, AZ. et al. Data mining in clinical big data: the frequently used databases, steps, and methodological models. Military Med Res 8 , 44 (2021). https://doi.org/10.1186/s40779-021-00338-z

Download citation

Received : 24 January 2020

Accepted : 03 August 2021

Published : 11 August 2021

DOI : https://doi.org/10.1186/s40779-021-00338-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Clinical big data
  • Data mining
  • Medical public database

Military Medical Research

ISSN: 2054-9369

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

case study data mining

  • Subject guides
  • Text and data mining
  • Case studies

Text and data mining: Case studies

  • Licensed data sources
  • Open data sources
  • Researcher community developments
  • Issues to consider

This page outlines different case studies and use cases. The librarian-researcher case studies highlight the interaction between library professionals, researchers, scholarly resources and tools, while the external case studies focus on the research impact from text and data mining activities.

Librarian-Researcher case studies

  • Content Extraction from Web of Science This case study shows the interplay between researchers, library professionals and licensed databases to get the appropriate content and licensing sorted before the next part of the research project can begin.
  • Digital Humanities Assessment Case Study This case study shows the interplay between lecturers/researchers, licensed databases, open data, students and librarians to get the appropriate content and licensing sorted before the student assessment can be tested, clarified and set.

External case studies

  • Deconstructing Climate Change: facilitating network analysis of scientific influence This research project brought together three leading academic institutions, The University of Melbourne, La Trobe University and INSEAD and was conducted in partnership with Clarivate Analytics (a provider of intellectual property and scientific information, decision support tools and services) to access Clarivate's Custom Data Sets based on their Web of Science Core Collection.
  • An Epidemiology of Information: Data Mining the 1918 Influenza Pandemic Virginia Tech's project on using newspapers to study 1918 influenza, funded through the "Digging into Data Challenge" of the National Endowment for the Humanities.
  • << Previous: Open data sources
  • Next: Researcher community developments >>

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Data Mining in Healthcare: Applying Strategic Intelligence Techniques to Depict 25 Years of Research Development

Maikel luis kolling.

1 Graduate Program of Industrial Systems and Processes, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; [email protected] (M.L.K.); [email protected] (M.K.S.)

Leonardo B. Furstenau

2 Department of Industrial Engineering, Federal University of Rio Grande do Sul, Porto Alegre 90035-190, Brazil; rb.csinu.2xm@uanetsrufodranoel

Michele Kremer Sott

Bruna rabaioli.

3 Department of Medicine, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; moc.liamg@iloiabbaranurb

Pedro Henrique Ulmi

4 Department of Computer Science, University of Santa Cruz do Sul, Santa Cruz do Sul 96816-501, Brazil; [email protected]

Nicola Luigi Bragazzi

5 Laboratory for Industrial and Applied Mathematics (LIAM), Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada

Leonel Pablo Carvalho Tedesco

Associated data.

Not applicable.

In order to identify the strategic topics and the thematic evolution structure of data mining applied to healthcare, in this paper, a bibliometric performance and network analysis (BPNA) was conducted. For this purpose, 6138 articles were sourced from the Web of Science covering the period from 1995 to July 2020 and the SciMAT software was used. Our results present a strategic diagram composed of 19 themes, of which the 8 motor themes (‘NEURAL-NETWORKS’, ‘CANCER’, ‘ELETRONIC-HEALTH-RECORDS’, ‘DIABETES-MELLITUS’, ‘ALZHEIMER’S-DISEASE’, ‘BREAST-CANCER’, ‘DEPRESSION’, and ‘RANDOM-FOREST’) are depicted in a thematic network. An in-depth analysis was carried out in order to find hidden patterns and to provide a general perspective of the field. The thematic network structure is arranged thusly that its subjects are organized into two different areas, (i) practices and techniques related to data mining in healthcare, and (ii) health concepts and disease supported by data mining, embodying, respectively, the hotspots related to the data mining and medical scopes, hence demonstrating the field’s evolution over time. Such results make it possible to form the basis for future research and facilitate decision-making by researchers and practitioners, institutions, and governments interested in data mining in healthcare.

1. Introduction

Deriving from Industry 4.0 that pursues the expansion of its autonomy and efficiency through data-driven automatization and artificial intelligence employing cyber-physical spaces, the Healthcare 4.0 portrays the overhaul of medical business models towards a data-driven management [ 1 ]. In akin environments, substantial amounts of information associated to organizational processes and patient care are generated. Furthermore, the maturation of state-of-the-art technologies, namely, wearable devices, which are likely to transform the whole industry through more personalized and proactive treatments, will lead to a noteworthy increase in user patient data. Moreover, the forecast for the annual global growth in healthcare data should exceed soon 1.2 exabytes a year [ 1 ]. Despite the massive and growing volume of health and patient care information [ 2 ], it is still, to a great extent, underused [ 3 ].

Data mining, a subfield of artificial intelligence that makes use of vast amounts of data in order to allow significant information to be extracted through previously unknown patterns, has been progressively applied in healthcare to assist clinical diagnoses and disease predictions [ 2 ]. This information has been known to be rather complex and difficult to analyze. Furthermore, data mining concepts can also perform the analysis and classification of colossal bulks of information, grouping variables with similar behaviors, foreseeing future events, amid other advantages for monitoring and managing health systems ceaselessly seeking to look after the patients’ privacy [ 4 ]. The knowledge resulting from the application of the aforesaid methods may potentially improve resource management and patient care systems, assist in infection control and risk stratification [ 5 ]. Several studies in healthcare have explored data mining techniques to predict incidence [ 6 ] and characteristics of patients in pandemic scenarios [ 7 ], identification of depressive symptoms [ 8 ], prediction of diabetes [ 9 ], cancer [ 10 ], scenarios in emergency departments [ 11 ], amidst others. Thus, the utilization of data mining in health organizations ameliorates the efficiency of service provision [ 12 ], quality of decision making, and reduces human subjectivity and errors [ 13 ].

The understanding of data mining in the healthcare sector is, in this context, vital and some researchers have executed bibliometric analyses in the field with the intention of investigating the challenges, limitations, novel opportunities, and trends [ 14 , 15 , 16 , 17 ]. However, at the time of this study, there were no published works that provided a complete analysis of the field using a bibliometric performance and network analysis (BPNA) (see Table 1 . In the light of this, we have defined three research questions:

  • RQ1: What are the strategic themes of data mining in healthcare?
  • RQ2: How is the thematic evolution structure of data mining in healthcare?
  • RQ3: What are the trends and opportunities of data mining in healthcare for academics and practitioners?

Existing bibliometric analysis of data mining in healthcare in Web of Science (WoS).

Thus, with the objective to lay out a superior understanding of the data mining usage in the healthcare sector and to answer the defined research questions, we have performed a bibliometric performance and network analysis (BPNA) to set fourth an overview of the area. We used the Science Mapping Analysis Software Tool (SciMAT), a software developed by Cobo et al. [ 18 ] with the purpose of identifying strategic themes and the thematic evolution structure of a given field, which can be used as a strategic intelligence tool. The strategic intelligence, an approach that can enhance decision-making in terms of science and technology trends [ 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 ], can help researchers and practitioners to understand the area and devise new ideas for future works as well as to identify the trends and opportunities of data mining in healthcare.

This research is structured as follows: Section 2 highlights the methodology and the dataset. Section 3 presents the bibliometric performance of data mining in healthcare. In Section 4 , the strategic diagram presents the most relevant themes according to our bibliometric indicators as well as the thematic network structure of the motor themes and the thematic evolution structure, which provide a complete overview of data mining over time. Section 5 presents the conclusions, limitations, and suggestions for future works.

2. Methodology and Dataset

Attracting attention from companies, universities, and scientific journals, bibliometric analysis enhances decision-making by providing a reliable method to collect information from databases, to transform the aforementioned data into knowledge, and to stimulate wisdom development. Furthermore, the techniques of bibliometric analysis can provide higher and different perspectives of scientific production by using advanced measurement tools and methods to depict how authors, works, journals and institutions are advancing in a specific field of research through the hidden patterns that are embedded in large datasets.

The existing works on bibliometric analysis of data mining in health care in the Web of Science are shown in Table 1 , where it is depicted that only three studies have been performed and the differences between these approaches and this work are explained.

2.1. Methodology

For this study we have applied BPNA, a method that combines science mapping with performance analysis, to the field of data mining in healthcare with the support of the SciMAT software. This methodology has been chosen in view of the fact that such a combination, in addition to assisting decision-making for academics and practitioners, allows us to perform a deep investigation into the field of research by giving a new perspective of its intricacies. The BPNA conducted in this paper was composed of four steps outlined below.

2.1.1. Discovery of Research Themes

The themes were identified using a frequency and network reduction of keywords. In this process, the keywords were firstly normalized using the Salton’s Cosine, a correlation coefficient, and then clustered through the simple center algorithm. Finally, the thematic evolution structure co-word network was normalized using the equivalence index.

2.1.2. Depicting Research Themes

The previously identified themes were then plotted on a bi-dimensional diagram composed of four quadrants, in which the “vertical axis” characterizes the density (D) and the “horizontal axis” characterizes the centrality (C) of the theme [ 28 , 29 ] ( Figure 1 a) [ 18 , 20 , 25 , 30 , 31 , 32 , 33 ].

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g001.jpg

Strategic diagram ( a ). Thematic network structure ( b ). Thematic evolution structure ( c ).

  • (a) First quadrant—motor themes: trending themes for the field of research with high development.
  • (b) Second quadrant—basic and transversal themes: themes that are inclined to become motor themes in the future due to their high centrality.
  • (c) Third quadrant—emerging or declining themes: themes that require a qualitative analysis to define whether they are emerging or declining.
  • (d) Fourth quadrant—highly developed and isolated themes: themes that are no longer trending due to a new concept or technology.

2.1.3. Thematic Network Structure and Detection of Thematic Areas

The results were organized and structured in (a) a strategic diagram (b) a thematic network structure of motor themes, and (c) a thematic evolution structure. The thematic network structure ( Figure 1 b) represents the co-occurrence between the research themes and underlines the number of relationships (C) and internal strength among them (D). The thematic evolution structure ( Figure 1 c) provides a proper picture of how the themes preserve a conceptual nexus throughout the following sub-periods [ 23 , 34 ]. The size of the clusters is proportional to the number of core documents and the links indicate co-occurrence among the clusters. Solid lines indicate that clusters share the main theme, and dashed lines represent the shared cluster elements that are not the name of the themes [ 35 ]. The thickness of the lines is proportional to the inclusion index, which indicates that the themes have elements in common [ 35 ]. Furthermore, in the thematic network structure the themes were then manually classified between data mining techniques and medical research concepts.

2.1.4. Performance Analysis

The scientific contribution was measured by analyzing the most important research themes and thematic areas using the h-index, sum of citations, core documents centrality, density, and nexus among themes. The results can be used as a strategic intelligence approach to identify the most relevant topics in the research field.

2.2. Dataset

Composed of 6138 non-duplicated articles and reviews in English language, the dataset used in this work was sourced from the Web of Science (WoS) database utilizing the following query string (“data mining” and (“health*” OR “clinic*” OR “medic* OR “disease”)). The documents were then processed and had their keywords, both the author’s and the index controlled and uncontrolled terms, extracted and grouped in accordance with their meaning. In order to remove duplicates and terms which had less than two occurrences in the documents, a preprocessing step was applied to the authors, years, publication dates, and keywords. For instance, the preprocessing has reduced the total number of keywords from 21,838 to 5310, thus improving the bibliometric analysis clarity. With the exception of the strategic diagram that was plotted utilizing a single period (1995–July 2020), in this study, the timeline was divided into three sub-periods: 1995–2003, 2004–2012, and 2013–July 2020.

Subsequently, a network reduction was applied in order to exclude irrelevant words and co-occurrences. For the network extraction we wanted to identify co-occurrence among words. For the mapping process, we used a simple center algorithm. Finally, a core mapper was used, and the h-index and sum citations were selected. Figure 2 shows a good representation of the steps of the BPNA.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g002.jpg

Workflow of the bibliometric performance and network analysis (BPNA).

3. Bibliometric Performance of Data Mining in Healthcare

In this section, we measured the performance of the field of data mining in healthcare in terms of publications and citations over time, the most productive and cited researchers, as well as productivity of scientific journals, institutions, countries, and most important research areas in the WoS. To do this, we used indicators such as: number of publications, sum of citations by year, journal impact factor (JIF), geographic distribution of publications, and research field. For this, we examined the complete period (1995 to July 2020).

3.1. Publications and Citations Overtime

Figure 3 shows the performance analysis of publications and citations of data mining in healthcare over time from 1995 to July 2020 in the WoS. The first sub-period (1995–2003) shows the beginning of the research field with 316 documents and a total of 13,483 citations. Besides, the first article in the WoS was published by Szolovits (1995) [ 36 ] who presented a tutorial for handling uncertainty in healthcare and highlighted the importance to develop data mining techniques in order to assist the healthcare sector. This sub-period shows a slightly increasing number of citations until 2003 and the year with the highest number of citations was 2002.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g003.jpg

Number of publications over time (1995–July 2020).

The slightly increasing number continues from the first sub-period to the second subperiod (2004–2013) with a total of 1572 publications and 55,734 citations. The year 2006 presents the highest number of citations mainly due to the study of Fawcett [ 37 ] which attracted 7762 citations. The author introduced the concept of Receiver Operating Characteristics (ROC). This technique is widely used in data mining to assist medical decision-making.

From the second to the third sub-period, it is possible to observe a huge increase in the number of publications (4250 publications) and 41,821 citations. This elevated increase may have occurred due to the creation of strategies to implement emerging technologies in the healthcare sector in order to move forward with the third digital revolution in healthcare, the so-called Healthcare 4.0 [ 1 , 38 ]. Furthermore, although the citations are showing a positive trend, it is still possible to observe a downward trend from 2014 to 2020. This may happen, as Wang [ 39 ] highlights, due to the fact that a scientific document needs three to seven years to reach its peak point of citation [ 34 ]. Therefore, this is not a real trend.

3.2. Most Productive and Cited Authors

Table 2 displays the most productive and cited authors from 1995 to July 2020 of data mining in healthcare in the WoS. Leading as the most productive researcher in the field of data mining in healthcare is Li, Chien-Feng, a pathologist at Chi Mei Hospital which is sixth-ranked in publication numbers. He dedicates his studies to the molecular diagnosis of cancer with innovative technologies. In the sequence, Acharya, U. Rajendra, ranked in the top 1% of highly cited researchers in five consecutive years (2016, 2017, 2018, 2019, and 2020) in computer science according to Thomson’s essential science indicators, shares second place with Chung, Kyungyong from the Division of Engineering and Computer Science at the Kyonggi University in Su-won-si, South Korea. On the other hand, Bate, Andrew C., a member of the Food and Drug Administration (FDA) Science Council of Pharmacovigilance Subcommittee, which is the fourth-ranked institution in publication count as the most cited researcher with 945 citations. Subsequently, Lindquist, Marie, who monitors global pharmacovigilance and data management development at the World Health Organization (WHO), is ranked second with 943 citations. Last but not least, Edwards, E.R., an orthopedic surgeon at the Royal Australasian College of Surgeons is ranked third with 888 citations. Notably, this study does not demonstrate a direct correlation between the number of publications and the number of citations.

Most Cited/Productive authors from 1995 to July 2020.

3.3. Productivity of Scientific Journals, Universities, Countries and Most Important Research Fields

Table 3 shows the journals that publish studies related to data mining in healthcare. PLOS One is the first ranked with 124 publications, followed by Expert Systems with Applications with 105, and Artificial Intelligence in Medicine with 75. On the other hand, the journal Expert Systems with Applications is the journal that had the highest Journal Impact Factor (JIF) from 2019–2020.

Journals that publish studies to data mining in healthcare.

Table 4 shows the most productive institutions and the most productive countries. The first ranked is Columbia University followed by U.S. FDA Registration and Harvard University. In terms of country productivity, United States is the first in the rank, followed by China and England. In comparison with Table 2 , it is possible to notice that the most productive author is not related to the most productive institutions (Columbia University and U.S. FDA Registration). Besides, the institution with the highest number of publications is in the United States, which is found to be the most productive country.

Institutions and countries that publish studies to data mining in healthcare.

Regarding Columbia University, it is possible to verify its prominence in data mining in healthcare through its advanced data science programs, which are one of the best evaluated and advanced in the world. We highlight the Columbia Data Science Society, an interdisciplinary society that promotes data science at Columbia University and the New York City community.

The U.S. FDA Registration has a data mining council to promote the prioritization and governance of data mining initiatives within the Center for Biological Research and Evaluation to assess spontaneous reports of adverse events after the administration of regulated medical products. In addition, they created an Advanced and Standards-Based Network Analyzer for Clinical Assessment and Evaluation (PANACEA), which supports the application of standards recognition and network analysis for reporting these adverse events. It is noteworthy that the FDA Adverse Events Reporting System (FAERS) database is the main resource that identifies adverse reactions in medications marketed in the United States. A text mining system based on EHR that retrieves important clinical and temporal information is also highlighted along with support for the Cancer Prevention and Control Division at the Centers for Disease Control and Prevention in a big data project.

The Harvard University offers online data mining courses and has a Center for Healthcare Data Analytics created by the need to analyze data in large public or private data sets. Harvard research includes funding and providing healthcare, quality of care, studies on special and disadvantaged populations, and access to care.

Table 5 presents the most important WoS subject research fields of data mining in healthcare from 1995 to July 2020. Computer Science Artificial Intelligence is the first ranked with 768 documents, followed by Medical Informatics with 744 documents, and Computer Science Information Systems with 722 documents.

Most relevant WoS subject categories and research fields.

4. Science Mapping Analysis of Data Mining in Healthcare

In this section the science mapping analysis of data mining in healthcare is depicted. The strategic diagram shows the most relevant themes in terms of centrality and density. The thematic network structure uncovers the relationship (co-occurrence) between themes and hidden patterns. Lastly, the thematic evolution structure underlines the most important themes of each sub-period and shows how the field of study is evolving over time.

4.1. Strategic Diagram Analysis

Figure 4 presents 19 clusters, 8 of which are categorized as motor themes (‘NEURAL-NETWORKS’, ‘CANCER’, ‘ELETRONIC-HEALTH-RECORDS’, ‘DIABETES-MELLITUS’, ‘ADVERSE-DRUG-EVENTS’, ‘BREAST-CANCER’, ‘DEPRESSION’ and ‘RANDOM-FOREST’), 2 as basic and transversal themes (‘CORONARY-ARTERY-DISEASE’ and ‘PHOSPHORYLATION’), 7 as emerging or declining themes (‘PERSONALIZED-MEDICINE’, ‘DATA-INTEGRATION’, ‘INTENSIVE-CARE-UNIT’, ‘CLUSTER-ANALYSIS’, ‘INFORMATION-EXTRACTION’, ‘CLOUD-COMPUTING’ and ‘SENSORS’), and 2 as highly developed and isolated themes (‘ALZHEIMERS-DISEASE’, and ‘METABOLOMICS’).

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g004.jpg

Strategic diagram of data mining in healthcare (1995–July 2020).

Each cluster of themes was measured in terms of core documents, h-index, citations, centrality, and density. The cluster ‘NEURAL-NETWORKS’ has the highest number of core documents (336) and is ranked first in terms of centrality and density. On the other hand, the cluster ‘CANCER’ is the most widely cited with 5810 citations.

4.2. Thematic Network Structure Analysis of Motor Themes

The motor themes have an important role regarding the shape and future of the research field because they correspond to the key topics to everyone interested in the subject. Therefore, they can be considered as strategic themes in order to develop the field of data mining in healthcare. The eight motor themes are discussed below, and they are displayed below in Figure 5 together with the network structure of each theme.

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g005.jpg

Thematic network structure of mining in healthcare (1995–July 2020). ( a ) The cluster ‘NEURAL-NETWORKS’. ( b ) The cluster ‘CANCER’. ( c ) The cluster ‘ELECTRONIC-HEALTH-RECORDS’. ( d ) The cluster ‘DIABETES-MELLITUS’. ( e ) The cluster ‘BREAST-CANCER’. ( f ) The cluster ‘ALZHEIMER’S DISEASE’. ( g ) The cluster ‘DEPRESSION’. ( h ) The cluster ‘RANDOM-FOREST’.

4.2.1. Neural Network (a)

The cluster ‘NEURAL-NETWORKS’ ( Figure 5 a) is the first ranked in terms of core documents, h-index, centrality, and density. The ‘NEURAL-NETWORKS’ cluster is strongly influenced by subthemes related to data science algorithms, such as ‘SUPPORT-VECTOR-MACHINE’, ‘DECISION-TREE’, among others. This network represents the use of data mining techniques to detect patterns and find important information correlated to patient health and medical diagnosis. A reasonable explanation for this network might be related to the high number of studies which conducted benchmarking of neural networks with other techniques to evaluate performance (e.g., resource usage, efficiency, accuracy, scalability, etc.) [ 40 , 41 , 42 ]. Besides, the significant size of the cluster ‘MACHINE-LEARNING’ is expected since neural networks is a type of machine learning. On the other hand, the subtheme ‘HEART-DISEASE’ stands out as the single disease in this network, which can be justified by the high number of researches with the goal to apply data mining to support decision-making in heart disease treatment and diagnosis.

4.2.2. Cancer (b)

The cluster ‘CANCER’ ( Figure 5 b) is the second ranked in terms of core documents, h-index, and density. On the other hand, it is the first in terms of citations (5810). This cluster is highly influenced by the subthemes related to the studies of cancer genes mutations, such as ‘BIOMAKERS’, ‘GENE-EXPRESSION’, among others. The use of data mining techniques has been attracting attention and efforts from academics in order to help solve problems in the field of oncology. Cancer is known as the disease that kills the most people in the 21st century due to various environmental pollutions, food pesticides and additives [ 14 ], eating habits, mental health, among others. Thus, controlling any form of cancer is a global strategy and can be enhanced by applying data mining techniques. Furthermore, the subtheme ‘PROSTATE-CANCER’ highlights that the most efforts of data mining applications focused on prostate cancer’s studies. Prostate cancer is the most common cancer in men. Although the benefits of traditional clinical exams for screening (digital rectal examination, the prostate-specific antigen and blood test and transrectal ultrasound), there is still a lack in terms of efficacy to reduce mortality with the use of such tests [ 43 ]. In this sense, data mining may be a suitable solution since it has been used in bioinformatics analyses to understand prostate cancer mutation [ 44 , 45 ] and uncover useful information that can be used for diagnoses and future prognostic tests which enhance both patients and clinical decision-making [ 46 ].

4.2.3. Electronic Health Records (HER—c)

The cluster ‘ELECTRONIC-HEALTH-RECORDS’ ( Figure 5 c) represents the concept in which patient’s health data are stored. Such data are continuously increasing over time, thereby creating a large amount of data (big data) which has been used as input (EHR) for healthcare decision support systems to enhance clinical decision-making. The clusters ‘NATURAL-LANGUAGE-PROCESSING’ and ‘TEXT MINING’ highlight that these mining techniques are the most frequently used with data mining in healthcare. Another pattern that must be highlighted is the considerable density among the clusters ‘SIGNAL-DETECTION’ and ‘PHARMACOVIGILANCE’ which represents the use of data mining to depict a broad range of adverse drug effects and to identify signals almost in real-time by using EHR [ 47 , 48 ]. Besides, the cluster ‘MISSING-DATA’ is related to studies focused on the challenge regarding to incomplete EHR and missing data in healthcare centers, which compromise the performance of several prediction models [ 49 ]. In this sense, techniques to handle missing data have been under improvement in order to move forward with the accurate prediction based on medical data mining applications [ 50 ].

4.2.4. Diabetes Mellitus (DM—d)

Nowadays, DM is one of the most frequent endocrine disorders [ 51 ] and affected more than 450 million people worldwide in 2017 and is expected to grow to 693 million by the year 2045. The same applies for the 850 billion dollars spent just in 2017 by the health sector [ 52 ]. The cluster ‘DIABETES-MELLITUS’ ( Figure 5 d) has a strong association with the risk factor subtheme group (e.g., ‘INSULIN-RESISTENCE’, ‘OBESITY’, ‘BODY-MASS-INDEX’, ‘CARDIOVASCULAR-DISEASE’, and ‘HYPERTENSION’). However, the obesity (cluster ‘OBESITY’) is the major risk factor related to DM, particularly in Type 2 Diabetes (T2D) [ 51 ]. T2D shows a prevalence of 90% of worldwide diabetic patients when compared with T1D and T3D, mainly characterized by insulin resistance [ 51 ]. Thus, this might justify the presence of the clusters ‘TYPE-2-DIABETES’ and ‘INSULIN-RESISTANCE’ which seems to be highly developed by data mining academics and practitioners. The massive number of researches into all facets of DM has led to the formation of huge volumes of EHR, in which the mostly applied data mining technique is the association rules technique. It is used to identify associations among risk factors [ 51 ], thusly justifying the appearance of the cluster ‘ASSOCIATION-RULES’.

4.2.5. Breast Cancer (e)

The cluster ‘BREAST-CANCER’ ( Figure 5 e) presents the most prevalent type of cancer affecting approximately 12.5% of women worldwide [ 53 , 54 ]. The cluster ‘OVEREXPRESSION’ and ‘METASTASIS’ highlights the high number of studies using data mining to understand the association of overexpression of molecules (e.g., MUC1 [ 54 ], TRIM29 [ 55 ], FKBP4 [ 56 ], etc.) with breast cancer metastasis. Such overexpression of molecules also appears in other forms of cancers, justifying the group of subthemes: ‘LUNG CANCER’, ‘GASTRIC-CANCER’, ‘OVARIAN-CANCER’, and ‘COLORECTALCANCER’. Moreover, the cluster ‘IMPUTATION’ highlight efforts to develop imputation techniques (data missingness) for breast cancer record analysis [ 57 , 58 ]. Besides, the application of data mining to depict breast cancer characteristics and their causes and effects has been highly supported by ‘MICROARRAY-DATA’ [ 59 , 60 ], ‘PATHWAY’ [ 61 ], and ‘COMPUTER-AIDED-DIAGNOSIS’ [ 62 ].

4.2.6. Alzheimer’s Disease (AD—f)

The cluster ‘ALZHEIMER’S DISEASE’ ( Figure 5 f) is highly influenced by subthemes related to diseases, such as ‘DEMENTIA’ and ‘PARKINSON’S-DISEASE’. This co-occurrence happens because the AD is a neurodegenerative illness which leads to dementia and Parkinson’s disease. Studies show that the money spent on AD in 2015 was about $828 billion [ 63 ]. In this sense, data mining has been widely used with ‘GENOME-WIDE-ASSOCIATION’ techniques in order to identify genes related to the AD [ 64 , 65 ] and prediction of AD by using data mining in ‘MRI’ Brain images [ 66 , 67 ]. The cluster ‘NF-KAPPA-B’ highlights the efforts to identify associations of NF-κB (factor nuclear kappa B) with AD by using data mining techniques which can be used to advance anti-drug developments [ 68 ].

4.2.7. Depression (g)

The cluster ‘DEPRESSION’ ( Figure 5 g) presents a common disease which affects over 260 million people. In the worst case, it can lead to suicide which is the second leading cause of death in young adults. The cluster ‘DEPRESSION’ is a highly associated cluster. Its connections mostly represent the subthemes that have been the research focus of data mining applications [ 69 ]. The connection between both the sub theme ‘SOCIAL-MEDIA’ and ‘ADOLESCENTS’, especially in times of social isolation, are extremely relevant to help identify early symptoms and tendencies among the population [ 70 ]. Furthermore, the presence of the ‘COMORBIDITY’ and ‘SYMPTONS’ is not surprising given knowledge discovery properties of the data mining field could provide significant insights into the etiology of depression [ 71 ].

4.2.8. Random Forest (h)

An ensemble learning method that is used in this study is the last cluster approach, which, among other things, is used for classification. The presence of the ‘BAYESIAN-NETWORK’ subtheme, supported by the connection between both and the ‘INFERENCE’, might represent another alternative to which the applications in data mining using random forest are benchmarked against [ 72 ]. Since the ‘RANDOM-FOREST’ ( Figure 5 h) cluster has barely passed the threshold from a basic and transversal theme to a motor theme, the works developed under this cluster are not yet as interconnected as the previous one. Thus, the theme with the most representativeness is the ‘AIR-POLLUTION’ in conjunction with ‘POLLUTION’, where studies have been performed in order to obtain ‘RISK-ASSESSMENT’ through the exploration of the knowledge hidden in large databases [ 73 ].

4.3. Thematic Evolution Structure Analysis

The Computer Science’s themes related to data mining and the medical research concepts, depicted, respectively, in the grey and blue areas of the thematic evolution diagram ( Figure 6 ), demonstrates the evolution of the research field over the different sub-periods addressed in this study. In this way, each individual theme relevance is illustrated through its cluster size as well as with its relationships throughout the different sub-periods. Thus, in this section, an analysis of the different trends on themes will be presented to give a brief insight into the factors that might have influenced its evolution. Furthermore, the proceeding analysis will be split into two thematic areas where, firstly, the grey area (practices and techniques related to data mining in healthcare) will be discussed followed by the blue one (health concepts and disease supported by data mining).

An external file that holds a picture, illustration, etc.
Object name is ijerph-18-03099-g006.jpg

Thematic evolution structure of mining in healthcare (1995–July 2020).

4.3.1. Practices and Techniques Related to Data Mining in Healthcare

The cluster ‘KNOWLEDGE-DISCOVERY’ ( Figure 6 , 1995–2012), often known as a synonym for data mining, provides a broader view of the field differing in this way from the algorithm focused theme, that is data mining, where its appearance and, later in the third period, its fading could provide a first insight into the overall evolution of the data mining papers applied to healthcare. The occurrence of the cluster knowledge discovery in the first two periods could demonstrate the focus of the application of the data mining techniques in order to classify and predict conditions in the medical field. This gives rise to a competition with early machine learning techniques that could be potentially evidenced through the presence of the cluster ‘NEURAL-NETWORK’, which the data mining techniques could probably be benchmarked against. The introduction of the ‘FEATURE-SELECTION’, ‘ARTIFICIAL-INTELLIGENCE’, and ‘MACHINE-LEARNING’ clusters together with the fading of ‘KNOWLEDGE-DISCOVERY’ could imply the occurrence of a disruption of the field in the third sub-period that has led to a change in the perspective on the studies.

One instance that could represent such a disruption could have been a well-known paper published by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton [ 74 ], where a novel technique in neural networks was firstly applied to a major image recognition competition. A vast advantage over the other algorithms that have been used was obtained. The connection between the work previously mentioned and its impact on the data mining on healthcare research could be majorly supported by the disappearance of the cluster ‘IMAGE- MINING’ of the second sub-period which has no connections further on. Furthermore, the presence of the clusters ‘MACHINE-LEARNING’, ‘ARTIFICIAL-INTELLIGENCE’, ‘SUPPORT-VECTOR-MACHINES’, and ‘LOGISTIC-REGRESSION’ may be the evidence of a shift of focus on the data mining community for health care where, besides attempting to compete with machine learning algorithms, they are now striving to further improve the results previously obtained with machine learning through data mining. Moreover, given the presence of the colossal feature selection cluster, which circumscribes algorithms that enhance classification accuracy through a better selection of parameters, this trend could be given credence in consequence of its presence since it may be encompassing publications from the formerly stated clusters.

Although still small, the presence of the cluster ‘SECURITY’ in the last sub-period ( Figure 6 , 2013–2020) is, at the very least, relevant given the sensitive data that is handled in the medical space, such as patient’s history and diseases. Above all, the recent leaks of personal information have devised an ever-increasing attention to this topic focusing on, among other things, the de-identification of the personal information [ 75 , 76 , 77 ]. These kind of security processes allow, among others, data mining researchers to make use of the vast sensitive information that is stored in hospitals without any linkage that could associate a person to the data. For instance, the MIMIC Critical Care Database [ 78 ], an example of a de-identified database, has been allowing further research into many diseases and conditions in a secure way that would otherwise have been extremely impaired due to data limitations.

4.3.2. Health Concepts and Disease Supported by Data Mining

The cluster ‘GENE-EXPRESSION’ stands out in the first period and second period ( Figure 6 , 1995–2012) of medical research concepts and establishes strong co-occurrence with the cluster ‘CANCER’ in the third sub-period. This link can be explained by research involving the microarray technology, which makes it possible to detect deletions and duplications in the human genome by analyzing the expression of thousands of genes in different tissues. It is also possible to confirm the importance of genetic screening not only for cancer, but for several diseases, such as ‘ALZHEIMER’ and other brain disorders, thereby assisting in preventive medicine and enabling more efficient treatment plans [ 79 ]. For example, a research was carried out to analyze complex brain disorders such as schizophrenia from expression gene microarrays [ 80 ].

Sequencing technologies have undergone major improvements in recent decades to determine evolutionary changes in genetic, epigenetic mechanisms, and in the ‘MOLECULAR-CLASSIFICATION’, a topic that gained prominence as a cluster in the first period. An example of this can be found in a study published in 2010 which combined a global optimization algorithm called Dongguang Li (DGL) with cancer diagnostic methods based on gene selection and microarray analysis. It performed the molecular classification of colon cancers and leukemia and demonstrated the importance of machine learning, data mining, and good optimization algorithms for analyzing microarray data in the presence of subsets of thousands of genes [ 81 ].

The cluster ‘PROSTATE-CANCER’ in the second period ( Figure 6 , 2004–2012) presents a higher conceptual nexus to ‘MOLECULAR-CLASSIFICATION’ in the first sub-period and the same happens with clusters, such as ‘METASTASIS’, ‘BREAST-CANCER’, and ‘ALZHEIMER’, which appear more recently in the third sub-period. The significant increase in the incidence of prostate cancer in recent years results in the need for greater understanding of the disease in order to increase patient survival, since prostate cancer with metastasis was not well explored, despite having a survival rate much smaller compared to the early stages. In this sense, the understanding of age-specific survival of patients with prostate cancer in a hospital in using machine learning started to gain attention by academics and highlighted the importance of knowing survival after diagnosis for decision making and better genetic counseling [ 82 ]. In addition, the relationship between prostate cancer and Alzheimer’s disease is explained by the fact that the use of androgen deprivation therapy, used to treat prostate cancer, is associated with an increased risk of Alzheimer’s disease and dementia [ 81 ]. Therefore, the risks and benefits of long-term exposure to this therapy must be weighed. Finally, the relationship between prostate cancer and breast cancer in the thematic evolution can be explained due to the fact that studies are showing that men with a family history of breast cancer have a 21% higher risk of developing prostate cancer, including lethal disease [ 83 ].

The cluster ‘PHARMACOVIGILANCE’ appears in the second sub-period ( Figure 6 , 2004–2012) showing a strong co-occurrence with clusters of the third sub-period: ‘ADVERSE-DRUGS-REACTIONS’ and ‘ELECTRONIC-HEALTH-RECORDS’. In recent years, data-mining algorithms have stood out for their usefulness in detecting and screening patients with potential adverse drug reactions and, consequently, they have become a central component of pharmacovigilance, important for reducing the morbidity and mortality associated with the use of medications [ 48 ]. The importance of electronic medical records for pharmacovigilance is evident, which act as a health database and enable drug safety assessors to collect information. In addition, such medical records are also essential to optimize processes within health institutions, ensure more safety of patient data, integrate information, and facilitate the promotion of science and research in the health field [ 84 ]. These characteristics explain the large number of studies of ‘ELECTRONIC-HEALTH-RECORDS’ in the third sub-period and the growth of this theme in recent years, since the world has started to introduce electronic medical records, although currently there are few institutions that still use physical medical records.

The ‘DEPRESSION’ appears in the second sub-period ( Figure 6 , 2004–2012) and remains as a trend in the third sub-period with a significant increase in publications on the topic. It is known that this disease is numerous and is increasing worldwide, but that it still has many stigmas in its treatment and diagnosis. Globalization and the contemporary work environment [ 85 ] can be explanatory factors for the increase in the theme from the 2000s onwards and the COVID-19 pandemic certainly contributed to the large number of articles on mental health published in 2020. In this context, improving the detection of mental disorders is essential for global health, which can be enhanced by applying data mining to quantitative electroencephalogram signals to classify between depressed and healthy people and can act as an adjuvant clinical decision support to identify depression [ 69 ].

5. Conclusions

In this research, we have performed a BPNA to depict the strategic themes, the thematic network structure, and the thematic evolution structure of the data mining applied in healthcare. Our results highlighted several significant pieces of information that can be used by decision-makers to advance the field of data mining in healthcare systems. For instance, our results could be used by editors from scientific journals to enhance decision-making regarding special issues and manuscript review. From the same perspective, healthcare institutions could use this research in the recruiting process to better align the position needs to the candidate’s qualifications based on the expanded clusters. Furthermore, Table 2 presents a series of authors whose collaboration network may be used as a reference to identify emerging talents in a specific research field and might become persons of interest to greatly expand a healthcare institution’s research division. Additionally, Table 3 and Table 4 could also be used by researchers to enhance the alignment of their research intentions and partner institutions to, for instance, encourage the development of data mining applications in healthcare and advance the field’s knowledge.

The strategic diagram ( Figure 4 ) depicted the most important themes in terms of centrality and density. Such results could be used by researchers to provide insights for a better comprehension of how diseases like ‘CANCER’, ‘DIABETES-MELLITUS’, ‘ALZHEIMER’S-DISEASE’, ‘BREAST-CANCER’, ‘DEPRESSION’, and ‘CORONARY-ARTERY-DISEASE’ have made use of the innovations in the data mining field. Interestingly, none of the clusters have highlighted studies related to infectious diseases, and, therefore, it is reasonable to suggest the exploration of data mining techniques in this domain, especially given the global impact that the coronavirus pandemic has had on the world.

The thematic network structure ( Figure 5 ) demonstrates the co-occurrences among clusters and may be used to identify hidden patterns in the field of research to expand the knowledge and promote the development of scientific insights. Even though exhaustive research of the motor themes and their subthemes has been performed in this article, future research must be conducted in order to depict themes from the other quadrants (Q2, Q3, and Q4), especially emerging and declining themes, to bring to light relations between the rise and decay of themes that might be hidden inside the clusters.

The thematic evolution structure showed how the field is evolving over time and presented future trends of data mining in healthcare. It is reasonable to predict that clusters such as ‘NEURAL-NETWORKS’, ‘FEATURE-SELECTION’, ‘EHR’ will not decay in the near future due to their prevalence in the field and, most likely, due to the exponential increase in the amount of patient health that is being generated and stored daily in large data lakes. This unprecedented increase in data volume, which is often of dubious quality, leads to great challenges in the search for hidden information through data mining. Moreover, as a consequence of the ever-increasing data sensitivity, the cluster ‘SECURITY’, which is related to the confidentiality of the patient’s information, is likely to remain growing during the next years as government and institutions further develop structures, algorithms, and laws that aim to assure the data’s security. In this context, blockchain technologies specifically designed to ensure integrity and publicity of de-identified, similarly as it is done by the MIMIC-III (Medical Information Mart for Intensive Care III) [ 78 ], may be crucial to accelerate the advancement of the field by providing reliable information for health researchers across the world. Furthermore, future researches should be conducted in order to understand how these themes will behave and evolve during the next years, and interpret the cluster changes to properly assess the trends here presented. These results could also be used as teaching material for classes, as it provides strategic intelligence applications and the field’s historical data.

In terms of limitations, we used the WoS database since it has index journals with high JIF. Therefore, we suggest to analyze other databases, such as Scopus, PubMed, among others in future works. Besides, we used the SciMAT to perform the analysis and other bibliometric software, such as VOS viewer, Cite Space, Sci2tool, etc., could be used to explore different points of view. Such information will support this study and future works to advance the field of data mining in healthcare.

Author Contributions

Conceptualization, M.L.K., L.B.F., L.P.C.T. and N.L.B.; Data curation, L.B.F.; Formal analysis, L.B.F., B.R., and P.H.U.; Funding acquisition, N.L.B.; Investigation, M.L.K., L.B.F., L.P.C.T. and M.K.S.; Methodology, L.B.F.; Project administration, L.B.F., N.L.B. and L.P.C.T.; Resources, N.L.B.; Supervision, L.B.F., N.L.B. and L.P.C.T.; Validation, N.L.B. and L.P.C.T.; Visualization, N.L.B.; Writing—original draft, L.B.F. and N.L.B.; Writing—review & editing, N.L.B. All authors have read and agreed to the published version of the manuscript.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, and in part by the Brazilian Ministry of Health. N.L.B. is partially supported by the CIHR 2019 Novel Coronavirus (COVID-19) rapid research program.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Case study: how to apply data mining techniques in a healthcare data warehouse

Affiliation.

  • 1 Rush Medical College, USA.
  • PMID: 11452577

Healthcare provider organizations are faced with a rising number of financial pressures. Both administrators and physicians need help analyzing large numbers of clinical and financial data when making decisions. To assist them, Rush-Presbyterian-St. Luke's Medical Center and Hitachi America, Ltd. (HAL), Inc., have partnered to build an enterprise data warehouse and perform a series of case study analyses. This article focuses on one analysis, which was performed by a team of physicians and computer science researchers, using a commercially available on-line analytical processing (OLAP) tool in conjunction with proprietary data mining techniques developed by HAL researchers. The initial objective of the analysis was to discover how to use data mining techniques to make business decisions that can influence cost, revenue, and operational efficiency while maintaining a high level of care. Another objective was to understand how to apply these techniques appropriately and to find a repeatable method for analyzing data and finding business insights. The process used to identify opportunities and effect changes is described.

  • Database Management Systems / organization & administration*
  • Decision Support Systems, Management*
  • Diagnosis-Related Groups / economics*
  • Efficiency, Organizational / economics
  • Hospital Costs
  • Hospitals, Teaching / economics
  • Hospitals, Teaching / statistics & numerical data*
  • Information Centers / organization & administration*
  • Information Storage and Retrieval / methods*
  • Middle Aged
  • Organizational Case Studies
  • Systems Integration
  • User-Computer Interface

ORIGINAL RESEARCH article

This article is part of the research topic.

Environmental Impacts & Risks of Deep-Sea Mining: Recommendations for Exploitation Regulations

Deep learning-assisted biodiversity assessment in deep-sea benthic megafauna communities: a case study in the context of polymetallic nodule mining Provisionally Accepted

  • 1 OKEANOS Center, University of the Azores, Portugal
  • 2 Biodata Mining Group, Faculty of Technology, Bielefeld University, Germany

The final, formatted version of the article will be published soon.

Technological developments have facilitated the collection of large amounts of imagery from isolated deep-sea ecosystems such as abyssal nodule fields. Application of imagery as a monitoring tool in these areas of interest for deep-sea exploitation is extremely valuable. However, in order to collect a comprehensive number of species observations, thousands of images need to be analysed, especially if a high diversity is combined with low abundances such is the case in the abyssal nodule fields. As the visual interpretation of large volumes of imagery and the manual extraction of quantitative information is time-consuming and error-prone, computational detection tools may play a key role to lessen this burden. Yet, there is still no established workflow for efficient marine image analysis using deep learning-based computer vision systems for the task of fauna detection and classification.In this case study, a dataset of 2100 images from the deep-sea polymetallic nodule fields of the eastern Clarion-Clipperton Fracture zone from the SO268 expedition (2019) was selected to investigate the potential of machine learning-assisted marine image annotation workflows. The Machine Learning Assisted Image Annotation method (MAIA), provided by the BIIGLE system, was applied to different set-ups trained with manually annotated fauna data. The results computed with the different set-ups were compared to those obtained by trained marine biologists regarding accuracy (i.e. recall and precision) and time. Our results show that MAIA can be applied for a general object (i.e. species) detection with satisfactory accuracy (90.1% recall and 13.4% precision), when considered as one intermediate step in a comprehensive annotation workflow. We also investigated the performance for different volumes of training data, MAIA performance tuned for individual morphological groups and the impact of sediment coverage in the training data. We conclude that: a) steps must be taken to enable computer vision scientists to access more image data from the CCZ to improve the system's performance and b) computational species detection in combination with a posteriori filtering by marine biologists has a higher efficiency than fully manual analyses.

Keywords: Marine Imaging, Biodiversity, benthic communities, Computer Vision, deep learning

Received: 05 Jan 2024; Accepted: 26 Mar 2024.

Copyright: © 2024 Cuvelier, Zurowietz and Nattkemper. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mx. Daphne Cuvelier, OKEANOS Center, University of the Azores, Horta, Portugal Mx. Martin Zurowietz, Biodata Mining Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany Mx. Tim W. Nattkemper, Biodata Mining Group, Faculty of Technology, Bielefeld University, Bielefeld, Germany

People also looked at

Data mining model for scientific research classification: the case of digital workplace accessibility

  • Research Article
  • Published: 26 March 2024

Cite this article

  • Radka Nacheva   ORCID: orcid.org/0000-0003-3946-2416 1 ,
  • Maciej Czaplewski   ORCID: orcid.org/0000-0003-1888-8776 2 &
  • Pavel Petrov   ORCID: orcid.org/0000-0002-1284-2606 1  

Research classification is an important aspect of conducting research projects because it allows researchers to efficiently identify papers that are in line with the latest research in each field and relevant to projects. There are different approaches to the classification of research papers, such as subject-based, methodology-based, text-based, and machine learning-based. Each approach has its advantages and disadvantages, and the choice of classification method depends on the specific research question and available data. The classification of scientific literature helps to better organize and structure the vast amount of information and knowledge generated in scientific research. It enables researchers and other interested parties to access relevant information in a fast and efficient manner. Classification methods allow easier and more accurate extraction of scientific knowledge to be used as a basis for scientific research in each subject area. In this regard, this paper aims to propose a research classification model using data mining methods and techniques. To test the model, we selected scientific articles on digital workplace accessibility for the disabled retrieved from Scopus and Web of Science repositories. We believe that the classification model is universal and can be applied in other scientific fields.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

case study data mining

Source : own elaboration

case study data mining

Similar content being viewed by others

case study data mining

Text-based paper-level classification procedure for non-traditional sciences using a machine learning approach

Daniela Moctezuma, Carlos López-Vázquez, … José Pérez

case study data mining

Information Tracking from Research Papers Using Classification Techniques

case study data mining

Automatic Classification of Research Papers Using Machine Learning Approaches and Natural Language Processing

Aggarwal T, Salatino AA, Osborne F, Motta E (2022) R-classify: extracting research papers’ relevant concepts from a controlled vocabulary. Softw Impacts 14:100444. https://doi.org/10.1016/j.simpa.2022.100444

Article   Google Scholar  

ALDabbas A, Gál Z (2022) Recurrent neural network variants based model for Cassini-Huygens spacecraft trajectory modifications recognition. Neural Comput Appl 34(16):13575–13598. https://doi.org/10.1007/s00521-022-07145-0

Anshu (2019) Review paper on data mining techniques and applications. https://ssrn.com/abstract=3529347 . Accessed 30 Jan 2024

Antonova K, Ivanova P (2023) How to manage people in a dynamic environment—innovative approaches and practice. J HR Technol 1:25–44

Google Scholar  

Bártová B, Bína V, Váchová L (2022) A PRISMA-driven systematic review of data mining methods used for defects detection and classification in the manufacturing industry. Prod J. https://doi.org/10.1590/0103-6513.20210097

Birjandi SM, Khasteh SH (2021) A survey on data mining techniques used in medicine. J Diabetes Metab Disord 20(2):2055–2071. https://doi.org/10.1007/s40200-021-00884-2

Bose R (2009) Advanced analytics: opportunities and challenges. Ind Manag Data Syst 109(2):155–172. https://doi.org/10.1108/02635570910930073

Charbuty B, Abdulazeez AM (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2(01):20–28. https://doi.org/10.38094/jastt20165

Chaudhary R, Singh P, Mahajan R (2014) A survey on data mining techniques. Int J Adv Res Comput Commun Eng 3(1):5002–5003

Chowdhury S, Schoen MP (2020) Research paper classification using supervised machine learning techniques. In: 2020 intermountain engineering, technology and computing (IETC). https://doi.org/10.1109/ietc47856.2020.9249211

Deshpande S, Thakare VM (2010) Data mining system and applications: a review. Int J Distrib Parallel Syst 1(1):32–44. https://doi.org/10.5121/ijdps.2010.1103

Dimitrova D (2023) The concept “labour power” as a term in legislation and legal doctrine. Studia Iuris 1:24–31

Dunham MH (2003) Data mining introductory and advanced topics. https://openlibrary.org/books/OL26870779M/DataMiningIntroductoryandAdvancedTopics

Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):1–34. https://doi.org/10.1145/2379776.2379788

Gu C (2022) Application of data mining technology in financial intervention based on data Fusion information entropy. J Sens 2022:1–10. https://doi.org/10.1155/2022/2192186

Gupta S, Gupta A (2019) Dealing with noise problem in machine learning data-sets: a systematic review. Procedia Comput Sci 161:466–474. https://doi.org/10.1016/j.procs.2019.11.146

Ho TK, Hull JJ, Srihari SN (1994) Decision combination in multiple classifier systems. IEEE Trans Pattern Anal Mach Intell 16(1):66–75. https://doi.org/10.1109/34.273716

Hong L, Sun X, Sun Y, Gao Y (2017) Text feature extraction based on deep learning: a review. EURASIP J Wirel Commun Netw. https://doi.org/10.1186/s13638-017-0993-1

Jüngermann F, Křetínský J, Weininger M (2022) Algebraically explainable controllers: decision trees and support vector machines join forces. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2208.12804 . Accessed 30 Jan 2024

Ketui N, Wisomka W, Homjun K (2019) Using classification data mining techniques for students performance prediction. In: 2019 joint international conference on digital arts, media and technology with ECTI northern section conference on electrical, electronics, computer and telecommunications engineering (ECTI DAMT-NCON), pp 359–363. https://doi.org/10.1109/ecti-ncon.2019.8692227

Kim S-W, Gi J-M (2019) Research paper classification systems based on TF-IDF and LDA schemes. Human-Centric Comput Inf Sci. https://doi.org/10.1186/s13673-019-0192-7

Koleva V (2023) E-recruitment and generation z job seekers. J HR Technol 1:63–75

Lim B, Zohren S (2021) Time-series forecasting with deep learning: a survey. Philos Trans R Soc 379(2194):20200209. https://doi.org/10.1098/rsta.2020.0209

Mahmoud DF, Moussa SM, Badr NL (2016) The evolution of data mining techniques to big data analytics: an extensive study with application to renewable energy data analytics. Asian J Appl Sci 4(3). https://www.ajouronline.com/index.php?journal=AJAS&page=article&op=view&path%5B%5D=3792 . Accessed 30 Jan 2024

Massi MC, Ieva F, Lettieri E (2020) Data mining application to healthcare fraud detection: a two-step unsupervised clustering method for outlier detection with administrative databases. BMC Med Inform Decis Mak 20(1):160. https://doi.org/10.1186/s12911-020-01143-9

Mukherjee S (2019) Predictive analytics and predictive modeling in healthcare. Univ Cumberl. https://doi.org/10.2139/ssrn.3403900

Nacheva R (2022) Emotions mining research framework: higher education in the pandemic context. In: Terzioğlu MK (eds) Advances in econometrics, operational research, data science and actuarial studies, pp 299–310. https://doi.org/10.1007/978-3-030-85254-2_18

Nacheva R, Koleva V (2022) Exploring gender pay gap in the IT sector. In: Proceedings of international scientific-practical conference human resource management, pp 210–224

Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform 2(3):159–173. https://doi.org/10.1007/s13721-013-0034-x

Narayana GS, Kolli K, Ansari MD, Gunjan VK (2020) A traditional analysis for efficient data mining with integrated association mining into regression techniques, pp 1393–1404. https://doi.org/10.1007/978-981-15-7961-5_127

Nikolov N (2023) Understanding student motivation in digital education. In: 2023 31st national conference with international participation (TELECOM), Sofia, Bulgaria, pp 1–5. https://doi.org/10.1109/TELECOM59629.2023.10409667

Nivethithaa KK, Vijayalakshmi S (2021) Survey on data mining techniques, process and algorithms. J Phys 197(1):012052. https://doi.org/10.1088/1742-6596/1947/1/012052

Noura M, Gyrard A, Heil S, Gaedke M (2019) Automatic knowledge extraction to build semantic web of things applications. IEEE Internet Things J 6(5):8447–8454. https://doi.org/10.1109/jiot.2019.2918327

Noura M, Wang Y, Heil S, Gaedke M (2021) OntoSpect: IoT ontology inspection by concept extraction and natural language generation. In: Brambilla M, Chbeir R, Frasincar F, Manolescu I (eds) Web engineering. ICWE 2021. Lecture notes in computer science, vol 12706, pp 37–52. https://doi.org/10.1007/978-3-030-74296-6_4

Olson D, Delen D (2008) Advanced data mining techniques. Springer, Berlin. https://doi.org/10.1007/978-3-540-76917-0

Book   Google Scholar  

Omisore MO (2015) A classification model for mining research publications from crowdsourced data. In: IEEE tech. comm. digit. libr. https://bulletin.jcdl.org/Bulletin/v11n3/papers/154-Omisore.pdf . Accessed 30 Jan 2024

Orange (2023) Preprocess text. https://orangedatamining.com/widget-catalog/text-mining/preprocesstext/ . Accessed 30 Jan 2024

Rahman N (2018) Data mining techniques and applications. Int J Strateg Inf Technol Appl 9(1):78–97. https://doi.org/10.4018/ijsita.2018010104

Rak T, Żyła R (2022) Using data mining techniques for detecting dependencies in the outcoming data of a Web-Based system. Appl Sci 12(12):6115. https://doi.org/10.3390/app12126115

Sarker IH (2021) Machine learning: algorithms, real-world applications and research directions. SN Comput Sci 2(3):160. https://doi.org/10.1007/s42979-021-00592-x

Scimago Lab (2020) Scimago journal country rank. https://www.scimagojr.com/countryrank.php?year=2021 . Accessed 30 Jan 2024

Stamenova S (2023) Improving the process of training staff in software companies through specialized software. In: 2023 international conference automatics and informatics (ICAI), pp 341–345. https://doi.org/10.1109/ICAI58806.2023.10339020

Sulova S (2021) Big data processing in the logistics industry. Econ Comput Sci 7(1):6–19

Todoranova L, Penchev B (2023) Higher education—accessible for people with disabilities. J HR Technol 2:45–56

Torkayesh AE, Tirkolaee EB, Bahrini A, Pamucar D, Khakbaz A (2023) A systematic literature review of MABAC method and applications: an outlook for sustainability and circularity. Informatica. https://doi.org/10.15388/23-infor511

UNESCO (2023) 2021 science report: statistics and resources. https://www.unesco.org/reports/science/2021/en/statistics . Accessed 30 Jan 2024

Vasilev J, Iliev I (2023) Digital competences, dependencies between mental indicators and defensive tactical performance indicators for students playing basketball. TEM J 12(1):445–451

Download references

The project "Impact of digitalization on innovative approaches in human resources management" is implemented by the University of Economics—Varna, in the period 2022–2025. The authors express their gratitude to the Bulgarian Scientific Research Fund, Ministry of Education and Science of Bulgaria for the support provided in the implementation of the project "Impact of digitalization on innovative approaches in human resources management," Grant No. BG-175467353-2022-04/12-12-2022, contract No. KP-06-H-65/4 – 2022.

Author information

Authors and affiliations.

Department of Informatics, University of Economics – Varna, 9002, Varna, Bulgaria

Radka Nacheva & Pavel Petrov

Institute of Spatial Management and Socio-Economic Geography, University of Szczecin, 70-453, Szczecin, Poland

Maciej Czaplewski

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Radka Nacheva .

Ethics declarations

Conflict of interest.

The authors have no conflict of interest to declare that are relevant to the content of this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Nacheva, R., Czaplewski, M. & Petrov, P. Data mining model for scientific research classification: the case of digital workplace accessibility. Decision (2024). https://doi.org/10.1007/s40622-024-00378-z

Download citation

Accepted : 20 February 2024

Published : 26 March 2024

DOI : https://doi.org/10.1007/s40622-024-00378-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data mining
  • Research classification
  • Text mining
  • Workplace accessibility
  • Digital accessibility
  • Find a journal
  • Publish with us
  • Track your research

IMAGES

  1. (PDF) A CASE STUDY ON DATA MINING AND DATA WAREHOUSE

    case study data mining

  2. DNP 825 How Data mining Helps In Workflow Design DQ Data Mining in

    case study data mining

  3. (PDF) A CASE STUDY ON DATA MINING APPLICATIONS ON BANKING SECTOR

    case study data mining

  4. (PDF) Data Mining Application in Enrollment Management: A Case Study

    case study data mining

  5. Data Mining RapidMiner Case Study

    case study data mining

  6. Data Mining Case Study

    case study data mining

VIDEO

  1. Box Plot || Data Mining

  2. Lecture 16: Data Mining CSE 2020 Fall

  3. Lecture 15: Data Mining CSE 2020 Fall

  4. Study Data Mining with Python

  5. SH2 Domains as a Target Class in Cancer and Inflammation

  6. (Mastering JMP) Visualizing and Exploring Data

COMMENTS

  1. Data Mining Case Studies & Benefits

    The term "data-mining case studies" is a guide for modern businesses in the complex world of information exploration. At its core, data mining is the process of extracting valuable insights from large datasets and transforming them into actionable intelligence. This article takes a journey through the applications, significance and success stories of data mining to demonstrate its ...

  2. 5 Data Mining Use Cases

    Read the PBS, LunaMetrics, and Google Analytics case study. 5. The Pegasus Group. Cyber attackers compromised and targeted the data mining system (DMS) of a major network client of The Pegasus Group and launched a distributed denial-of-service (DDoS) attack against 1,500 services. Under extreme time pressure, The Pegasus Group needed to find a ...

  3. A CASE STUDY ON DATA MINING APPLICATIONS ON BANKING SECTOR

    PDF | On Oct 31, 2018, M.V. Jisha and others published A CASE STUDY ON DATA MINING APPLICATIONS ON BANKING SECTOR | Find, read and cite all the research you need on ResearchGate

  4. Data Mining Case Studies

    Data Mining Case Studies and Practice Prize is an international peer-reviewed workshop highlighting successful real-world applications of data mining. DMCS applications are wide-ranging: (a) data mining systems that have uncovered massive tax fraud rings (MITRE) (b) identification of patients at risk of heart disease, and detection of breast ...

  5. TOP-10 DATA MINING CASE STUDIES

    Abstract. We report on the panel discussion held at the ICDM'10 conference on the top 10 data mining case studies in order to provide a snapshot of where and how data mining techniques have made significant real-world impact. The tasks covered by 10 case studies range from the detection of anomalies such as cancer, fraud, and system failures to ...

  6. Data mining for the online retail industry: A case study of RFM model

    A case study has been presented in this article to demonstrate how customer-centric business intelligence for online retailers can be created by means of data mining techniques. The distinct customer groups characterized in the case study can help the business better understand its customers in terms of their profitability, and accordingly ...

  7. Data mining tools -a case study for network intrusion detection

    Data mining or Knowledge Discovery from Data (KDD) tools allows us to analyze large datasets to solve decision problems. The data mining tools use historical information to build a model to predict customer's behavior e.g., which customers are likely to respond to a new product.

  8. PDF R and Data Mining: Examples and Case Studies

    process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described. 1.1 Data Mining Data mining is the process to discover interesting knowledge from large amounts of data [Han 1 R

  9. Lessons learned: A case study using data mining in the newspaper

    Many organisations across a variety of industries are engaging in the process of data mining as part of an overall strategy for business intelligence, customer relationship management (CRM), including churn prevention. This paper provides an overview of the data mining process and illustrates a case study in which data mining is utilised as a churn prevention tool for a major Midwest USA ...

  10. Web Data Mining: A Case Study

    Web Data Mining: A Case Study. Samia Jones Galveston College, Galveston, TX 77550 Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 [email protected]. Abstract. With an enormous amount of data stored in databases and data warehouses, it is increasingly important to develop powerful tools for analysis of such data and mining ...

  11. PDF MobileMiner: A Real World Case Study of Data Mining in Mobile

    Built on the state-of-the-art data mining techniques, Mo-bileMiner presents a real case study on how to integrate data mining techniques into a business solution. In a large mobile communication company like China Mo-bile Communication Corporation, there are many analytical tasks where data mining can help to address the business interests of ...

  12. Data Mining use cases & benefits

    It is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. It is a process used by companies to turn raw data into useful information. The data mining process breaks down into five steps: 1. Organizations collect data and load it into their data warehouses. 2.

  13. PDF Data Mining Case Studies

    The Data Mining Practice Prize will be awarded to work that has had a significant and quantitative impact in the application in which it was applied, or has significantly benefited humanity. All papers submitted to Data Mining Case Studies will be eligible for the Data Mining Practice Prize, with the exception of members of the Prize Committee.

  14. Introduction to Data Mining With Case Studies

    The field of data mining provides techniques for automated discovery of valuable information from the accumulated data of computerized operations of enterprises. This book offers a clear and comprehensive introduction to both data mining theory and practice. It is written primarily as a textbook for the students of computer science, management, computer applications, and information technology.

  15. Data mining in clinical big data: the frequently used databases, steps

    Data mining is a multidisciplinary field at the intersection of database technology, statistics, ML, and pattern recognition that profits from all these disciplines [].Although this approach is not yet widespread in the field of medical research, several studies have demonstrated the promise of data mining in building disease-prediction models, assessing patient risk, and helping physicians ...

  16. Case Study

    Airline Data Mining Case Study. The airline industry is experiencing rapid evolution, driven by technological advancements, centralized planning, and the entry of new industry players. Air travel ...

  17. (PDF) Educational Data Mining: a Case Study.

    Educational Data Mining: a Case Study . Agathe MERCERON * and Kalina YACEF + * ESILV - Pôle Universitaire Léonard de Vinci, France + School of Information Technologies - University of Sydney ...

  18. Text and data mining: Case studies

    Text and data mining: Case studies. This page outlines different case studies and use cases. The librarian-researcher case studies highlight the interaction between library professionals, researchers, scholarly resources and tools, while the external case studies focus on the research impact from text and data mining activities.

  19. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    The Harvard University offers online data mining courses and has a Center for Healthcare Data Analytics created by the need to analyze data in large public or private data sets. Harvard research includes funding and providing healthcare, quality of care, studies on special and disadvantaged populations, and access to care.

  20. Case study: how to apply data mining techniques in a healthcare data

    Case study: how to apply data mining techniques in a healthcare data warehouse. ,,, C Herman , S B Dolins M J O'Shea. Healthcare provider organizations are faced with a rising number of financial pressures. Both administrators and physicians need help analyzing large numbers of clinical and financial data when making decisions.

  21. Data Mining Methodologies in the Banking Domain: A Systematic

    The main research objective of this paper is to study how data mining methodologies are applied in the banking domain. We apply systematic literature review (SLR) method as it ensures trustworthy, rigorous, and auditable methodology, as well as supports synthesis of existing evidence, identification of research gaps, and provides framework to appropriately position new research activities [].

  22. PDF Case Study on Data Mining Application in Health Care Monitoring Systems

    Case Study on Data Mining Application in Health Care Monitoring Systems 82 Data mining applications in healthcare sector: Healthcare sector nowadays creates a large amounts of complex data about patients, hospital resources, disease diagnosis, electronic patient records and various types of medical devices . Larger amounts of data are a

  23. PDF R and Data Mining: Examples and Case Studies

    Case studies: The case studies are not included in this oneline version. They are reserved ex- ... Data mining is the process to discover interesting knowledge from large amounts of data [Han and Kamber, 2000]. It is an interdisciplinary eld with contributions from many areas, such as

  24. Process-Aware Analysis of Treatment Paths in Heart Failure Patients: A

    Process mining in healthcare presents a range of challenges when working with different types of data within the healthcare domain. There is high diversity considering the variety of data collected from healthcare processes: operational processes given by claims data, a collection of events during surgery, data related to pre-operative and post-operative care, and high-level data collections ...

  25. Frontiers

    Yet, there is still no established workflow for efficient marine image analysis using deep learning-based computer vision systems for the task of fauna detection and classification.In this case study, a dataset of 2100 images from the deep-sea polymetallic nodule fields of the eastern Clarion-Clipperton Fracture zone from the SO268 expedition ...

  26. Data mining model for scientific research classification: the case of

    Research classification is an important aspect of conducting research projects because it allows researchers to efficiently identify papers that are in line with the latest research in each field and relevant to projects. There are different approaches to the classification of research papers, such as subject-based, methodology-based, text-based, and machine learning-based. Each approach has ...