Journal of Big Data

Journal of Big Data Cover Image

Featured Collections on Computationally Intensive Problems in General Math and Engineering

This two-part special issue covers computationally intensive problems in engineering and focuses on mathematical mechanisms of interest for emerging problems such as Partial Difference Equations, Tensor Calculus, Mathematical Logic, and Algorithmic Enhancements based on Artificial Intelligence. Applications of the research highlighted in the collection include, but are not limited to: Earthquake Engineering, Spatial Data Analysis, Geo Computation, Geophysics, Genomics and Simulations for Nature Based Construction, and Aerospace Engineering. Featured lead articles are co-authored by three esteemed Nobel laureates: Jean-Marie Lehn, Konstantin Novoselov, and Dan Shechtman.

Open Special Issues

Big Data and Artificial Intelligence in Emerging Scientific Fields Submission Deadline: 4 May 2024 

Big Data and Artificial Intelligence in Emerging Engineering Fields Submission Deadline: 4 May 2024 

View our collection of open and closed special issues

  • Most accessed

The role of classifiers and data complexity in learned Bloom filters: insights and recommendations

Authors: Dario Malchiodi, Davide Raimondi, Giacomo Fumagalli, Raffaele Giancarlo and Marco Frasca

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Authors: Huanjing Wang, Qianxin Liang, John T. Hancock and Taghi M. Khoshgoftaar

Multi-sample \(\zeta \) -mixup: richer, more realistic synthetic samples from a p -series interpolant

Authors: Kumar Abhishek, Colin J. Brown and Ghassan Hamarneh

Learning manifolds from non-stationary streams

Authors: Suchismit Mahapatra and Varun Chandola

An adaptive hybrid african vultures-aquila optimizer with Xgb-Tree algorithm for fake news detection

Authors: Amr A. Abd El-Mageed, Amr A. Abohany, Asmaa H. Ali and Khalid M. Hosny

Most recent articles RSS

View all articles

A survey on Image Data Augmentation for Deep Learning

Authors: Connor Shorten and Taghi M. Khoshgoftaar

Big data in healthcare: management, analysis and future prospects

Authors: Sabyasachi Dash, Sushil Kumar Shakyawar, Mohit Sharma and Sandeep Kaushik

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Authors: Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al-Dujaili, Ye Duan, Omran Al-Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al-Amidie and Laith Farhan

Deep learning applications and challenges in big data analytics

Authors: Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar, Naeem Seliya, Randall Wald and Edin Muharemagic

Short-term stock market price trend prediction using a comprehensive deep learning system

Authors: Jingyi Shen and M. Omair Shafiq

Most accessed articles RSS

Aims and scope

Latest tweets.

Your browser needs to have JavaScript enabled to view this timeline

  • Editorial Board
  • Sign up for article alerts and news from this journal
  • Follow us on Twitter

Annual Journal Metrics

2022 Citation Impact 8.1 - 2-year Impact Factor 5.095 - SNIP (Source Normalized Impact per Paper) 2.714 - SJR (SCImago Journal Rank)

2023 Speed 56 days submission to first editorial decision for all manuscripts (Median) 205 days submission to accept (Median)

2023 Usage  2,559,548 downloads 280 Altmetric mentions

  • More about our metrics
  • ISSN: 2196-1115 (electronic)

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Back to Entry
  • Entry Contents
  • Entry Bibliography
  • Academic Tools
  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Notes to Scientific Research and Big Data

1. When a data collection can or should be regarded as “big data”, and the significance of this particular label for research, is discussed at length in Leonelli (2016), Kitchin and McArdle (2016) and Aronova, van Oertzen, and Sepkoski (2017).

2. This understanding of scientific knowledge is also embedded within publishing practices. As exemplified by the use of impact factors, scientific excellence is evaluated on the strength of authorship of articles, thus placing the production of scientific claims at the pinnacle of knowledge creation. Researchers whose activities focus away from writing theoretical statements—such as data curators or software developers—are often viewed as technicians with a lower status. The emergence of big data is challenging these habits and perceptions, for instance through the rise of Open Science practices, but it is no wonder that within this landscape, philosophers have focused their attention on models and theories as central outputs of research, leaving data behind.

Copyright © 2020 by Sabina Leonelli < s . leonelli @ exeter . ac . uk >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Advertisement

Advertisement

Data science: a game changer for science and innovation

  • Regular Paper
  • Open access
  • Published: 19 April 2021
  • Volume 11 , pages 263–278, ( 2021 )

Cite this article

You have full access to this open access article

  • Valerio Grossi 1 ,
  • Fosca Giannotti 1 ,
  • Dino Pedreschi 2 ,
  • Paolo Manghi 3 ,
  • Pasquale Pagano 3 &
  • Massimiliano Assante 3  

13k Accesses

17 Citations

57 Altmetric

Explore all metrics

This paper shows data science’s potential for disruptive innovation in science, industry, policy, and people’s lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e-infrastructure as useful tools for supporting ethical data science and training new generations of data scientists. Finally, this work outlines SoBigData Research Infrastructure as an easy-to-access platform for executing complex data science processes. The services proposed by SoBigData are aimed at using data science to understand the complexity of our contemporary, globally interconnected society.

Similar content being viewed by others

big data scientific research

What Is Data Science?

Introduction to applied data science, data science.

Avoid common mistakes on your manuscript.

1 Introduction: from data to knowledge

Data science is an interdisciplinary and pervasive paradigm where different theories and models are combined to transform data into knowledge (and value). Experiments and analyses over massive datasets are functional not only to the validation of existing theories and models but also to the data-driven discovery of patterns emerging from data, which can help scientists in the design of better theories and models, yielding a deeper understanding of the complexity of the social, economic, biological, technological, cultural, and natural phenomenon. The products of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. All these aspects are producing a change in the scientific method, in research and in the way our society makes decisions [ 2 ].

Data science emerges to concurring facts: (i) the advent of big data that provides the critical mass of actual examples to learn from, (ii) the advances in data analysis and learning techniques that can produce predictive models and behavioral patterns from big data, and (iii) the advances in high-performance computing infrastructures that make it possible to ingest and manage big data and perform complex analysis [ 16 ].

Paper organization Section 2 discusses how data science impacts our science and society at large in the coming years. Section 3 outlines the main issues related to the ethical problems in studying human behaviors that data science introduces. In Sect.  4 , we show how concepts such as open science and e-infrastructure are effective tools for supporting, disseminating ethical uses of the data, and training new generations of data scientists. We will illustrate the importance of an open data science with examples provided later in the paper. Finally, we show some use cases of data science through thematic environments that bind the datasets with social mining methods.

2 Data science for society, science, industry and business

figure 1

Data science as an ecosystem: on the left, the figure shows the main components enabling data science (data, analytical methods, and infrastructures). On the right, we can find the impact of data science into society, science, and business. All the activities related to data science should be done under rigid ethical principles

The quality of business decision making, government administration, and scientific research can potentially be improved by analyzing data. Data science offers important insights into many complicated issues, in many instances, with remarkable accuracy and timeliness.

figure 2

The data science pipeline starts with raw data and transforms them into data used for analytics. The next step is to transform these data into knowledge through analytical methods and then provide results and evaluation measures

As shown in Fig.  1 , data science is an ecosystem where the following scientific, technological, and socioeconomic factors interact:

Data Availability of data and access to data sources;

Analytics & computing infrastructures Availability of high performance analytical processing and open-source analytics;

Skills Availability of highly and rightly skilled data scientists and engineers;

Ethical & legal aspects Availability of regulatory environments for data ownership and usage, data protection and privacy, security, liability, cybercrime, and intellectual property rights;

Applications Business and market ready applications;

Social aspects Focus on major societal global challenges.

Data science envisioned as the intersection between data mining, big data analytics, artificial intelligence, statistical modeling, and complex systems is capable of monitoring data quality and analytical processes results transparently. If we want data science to face the global challenges and become a determinant factor of sustainable development, it is necessary to push towards an open global ecosystem for science, industrial, and societal innovation [ 48 ]. We need to build an ecosystem of socioeconomic activities, where each new idea, product, and service create opportunities for further purposes, and products. An open data strategy, innovation, interoperability, and suitable intellectual property rights can catalyze such an ecosystem and boost economic growth and sustainable development. This strategy also requires a “networked thinking” and a participatory, inclusive approach.

Data are relevant in almost all the scientific disciplines, and a data-dominated science could lead to the solution of problems currently considered hard or impossible to tackle. It is impossible to cover all the scientific sectors where a data-driven revolution is ongoing; here, we shall only provide just a few examples.

The Sloan Digital Sky Survey Footnote 1 has become a central resource for astronomers over the world. Astronomy is being transformed from the one where taking pictures of the sky was a large part of an astronomer’s job, to the one where the images are already in a database, and the astronomer’s task is to find interesting objects and phenomenon in the database. In biological sciences, data are stored in public repositories. There is an entire discipline of bioinformatics that is devoted to the analysis of such data. Footnote 2 Data-centric approaches based on personal behaviors can also support medical applications analyzing data at both human behavior levels and lower molecular ones. For example, integrating genome data of medical reactions with the habits of the users, enabling a computational drug science for high-precision personalized medicine. In humans, as in other organisms, most cellular components exert their functions through interactions with other cellular components. The totality of these interactions (representing the human “interactome”) is a network with hundreds of thousand nodes and a much larger number of links. A disease is rarely a consequence of an abnormality in a single gene. Instead, the disease phenotype is a reflection of various pathological processes that interact in a complex network. Network-based approaches can have multiple biological and clinical applications, especially in revealing the mechanisms behind complex diseases [ 6 ].

Now, we illustrate the typical data science pipeline [ 50 ]. People, machines, systems, factories, organizations, communities, and societies produce data. Data are collected in every aspect of our life, when: we submit a tax declaration; a customer orders an item online; a social media user posts a comment; a X-ray machine is used to take a picture; a traveler sends a review on a restaurant; a sensor in a supply chain sends an alert; or a scientist conducts an experiment. This huge and heterogeneous quantity of data needs to be extracted, loaded, understood, transformed, and in many cases, anonymized before they may be used for analysis. Analysis results include routines, automated decisions, predictions, and recommendations, and outcomes that need to be interpreted to produce actions and feedback. Furthermore, this scenario must also consider ethical problems in managing social data. Figure 2 depicts the data science pipeline. Footnote 3 Ethical aspects are important in the application of data science in several sectors, and they are addressed in Sect.  3 .

2.1 Impact on society

Data science is an opportunity for improving our society and boosting social progress. It can support policymaking; it offers novel ways to produce high-quality and high-precision statistical information and empower citizens with self-awareness tools. Furthermore, it can help to promote ethical uses of big data.

Modern cities are perfect environments densely traversed by large data flows. Using traffic monitoring systems, environmental sensors, GPS individual traces, and social information, we can organize cities as a collective sharing of resources that need to be optimized, continuously monitored, and promptly adjusted when needed. It is easy to understand the potentiality of data science by introducing terms such as urban planning , public transportation , reduction of energy consumption , ecological sustainability, safety , and management of mass events. These terms represent only the front line of topics that can benefit from the awareness that big data might provide to the city stakeholders [ 22 , 27 , 29 ]. Several methods allowing human mobility analysis and prediction are available in the literature: MyWay [ 47 ] exploits individual systematic behaviors to predict future human movements by combining individual and collective learned models. Carpooling [ 22 ] is based on mobility data from travelers in a given territory and constructs a network of potential carpooling users, by exploiting topological properties, highlighting sub-populations with higher chances to create a carpooling community and the propensity of users to be either drivers or passengers in a shared car. Event attendance prediction [ 13 ] analyzes users’ call habits and classifies people into behavioral categories, dividing them among residents, commuters, and visitors and allows to observe the variety of behaviors of city users and the attendance in big events in cities.

Electric mobility is expected to gain importance for the world. The impact of a complete switch to electric mobility is still under investigation, and what appears to be critical is the intensity of flows due to charge (and fast recharge) systems that may challenge the stability of the power network. To avoid instabilities regarding the charging infrastructure, an accurate prediction of power flows associated with mobility is needed. The use of personal mobility data can estimate the mobility flow and simulate the impact of different charging behavioral patterns to predict power flows and optimize the position of the charging infrastructures [ 25 , 49 ]. Lorini et al. [ 26 ] is an example of an urban flood prediction that integrates data provided by CEM system Footnote 4 and Twitter data. Twitter data are processed using massive multilingual approaches for classification. The model is a supervised model which requires a careful data collection and validation of ground truth about confirmed floods from multiple sources.

Another example of data science for society can be found in the development of applications with functions aimed directly at the individual. In this context, concepts such as personal data stores and personal data analytics are aimed at implementing a new deal on personal data, providing a user-centric view where data are collected, integrated and analyzed at the individual level, and providing the user with better awareness of own behavioral, health, and consumer profiles. Within this user-centric perspective, there is room for an even broader market of business applications, such as high-precision real-time targeted marketing, e.g., self-organizing decision making to preserve desired global properties, and sustainability of the transportation or the healthcare system. Such contexts emphasize two essential aspects of data science: the need for creativeness to exploit and combine the several data sources in novel ways and the need to give awareness and control of the personal data to the users that generate them, to sustain a transparent, trust-based, crowd-sourced data ecosystem [ 19 ].

The impact of online social networks in our society has changed the mechanisms behind information spreading and news production. The transformation of media ecosystems and news consumption are having consequences in several fields. A relevant example is the impact of misinformation on society, as for the Brexit referendum when the massive diffusion of fake news has been considered one of the most relevant factors of the outcome of this political event. Examples of achievements are provided by the results regarding the influence of external news media on polarization in online social networks. These achievements indicate that users are highly polarized towards news sources, i.e., they cite (and tend to cite) sources that they identify as ideologically similar to them. Other results regard echo chambers and the role of social media users: there is a strong correlation between the orientation of the content produced and consumed. In other words, an opinion “echoes” back to the user when others are sharing it in the “chamber” (i.e., the social network around the user) [ 36 ]. Other results worth mentioning regard efforts devoted to uncovering spam and bot activities in stock microblogs on Twitter: taking inspiration from biological DNA, the idea is to model the online users’ behavior through strings of characters representing sequences of online users’ actions. As a result of the following papers, [ 11 , 12 ] report that 71% of suspicious users were classified as bots; furthermore, 37% of them also got suspended by Twitter few months after our investigation. Several approaches can be found in the literature. However, they generally display some limitations. Some of them work only on some of the features of the diffusion of misinformation (bot detections, segregation of users due to their opinions or other social analysis), or there is a lack of comprehensive frameworks for interpreting results. While the former case is somehow due to the innovation of the research field and it is explainable, the latter showcases a more fundamental need, as, without strict statistical validation, it is hard to state which are the crucial elements that permit a well-grounded description of a system. For avoiding fake news diffusion, we can state that building a comprehensive fake news dataset providing all information about publishers, shared contents, and the engagements of users over space and time, together with their profile stories, can help the development of innovative and effective learning models. Both unsupervised and supervised methods will work together to identify misleading information. Multidisciplinary teams made up of journalists, linguists, and behavioral scientists and similar will be needed to identify what amounts to information warfare campaigns. Cyberwarfare and information warfare will be two of the biggest threats the world will face in the 21st Century.

Social sensing methods collect data produced by digital citizens, by either opportunistic or participatory crowd-sensing, depending on users’ awareness of their involvement. These approaches present a variety of technological and ethical challenges. An example is represented by Twitter Monitor [ 10 ], that is crowd-sensing tool designed to access Twitter streams through the Twitter Streaming API. It allows launching parallel listening for collecting different sets of data. Twitter Monitor represents a tool for creating services for listening campaigns regarding relevant events such as political elections, natural and human-made disasters, popular national events, etc. [ 11 ]. This campaign can be carried out, specifying keywords, accounts, and geographical areas of interest.

Nowcasting Footnote 5 financial and economic indicators focus on the potential of data science as a proxy for well-being and socioeconomic applications. The development of innovative research methods has demonstrated that poverty indicators can be approximated by social and behavioral mobility metrics extracted from mobile phone data and GPS data [ 34 ]; and the Gross Domestic Product can be accurately nowcasted by using retail supermarket market data [ 18 ]. Furthermore, nowcasting of demographic aspects of territory based on Twitter data [ 1 ] can support official statistics, through the estimation of location, occupation, and semantics. Networks are a convenient way to represent the complex interaction among the elements of a large system. In economics, networks are gaining increasing attention because the underlying topology of a networked system affects the aggregate output, the propagation of shocks, or financial distress; or the topology allows us to learn something about a node by looking at the properties of its neighbors. Among the most investigated financial and economic networks, we cite a work that analyzes the interbank systems, the payment networks between firms, the banks-firms bipartite networks, and the trading network between investors [ 37 ]. Another interesting phenomenon is the advent of blockchain technology that has led to the innovation of bitcoin crypto-currency [ 31 ].

Data science is an excellent opportunity for policy, data journalism, and marketing. The online media arena is now available as a real-time experimenting society for understanding social mechanisms, like harassment, discrimination, hate, and fake news. In our vision, the use of data science approaches is necessary for better governance. These new approaches integrate and change the Official Statistics representing a cheaper and more timely manner of computing them. The impact of data science-driven applications can be particularly significant when the applications help to build new infrastructures or new services for the population.

The availability of massive data portraying soccer performance has facilitated recent advances in soccer analytics. Rossi et al. [ 42 ] proposed an innovative machine learning approach to the forecasting of non-contact injuries for professional soccer players. In [ 3 ], we can find the definition of quantitative measures of pressing in defensive phases in soccer. Pappalardo et al. [ 33 ] outlined the automatic and data-driven evaluation of performance in soccer, a ranking system for soccer teams. Sports data science is attracting much interest and is now leading to the release of a large and public dataset of sports events.

Finally, data science has unveiled a shift from population statistics to interlinked entities statistics, connected by mutual interactions. This change of perspective reveals universal patterns underlying complex social, economic, technological, and biological systems. It is helpful to understand the dynamics of how opinions, epidemics, or innovations spread in our society, as well as the mechanisms behind complex systemic diseases, such as cancer and metabolic disorders revealing hidden relationships between them. Considering diffusive models and dynamic networks, NDlib [ 40 ] is a Python package for the description, simulation, and observation of diffusion processes in complex networks. It collects diffusive models from epidemics and opinion dynamics and allows a scientist to compare simulation over synthetic systems. For community discovery, two tools are available for studying the structure of a community and understand its habits: Demon [ 9 ] extracts ego networks (i.e., the set of nodes connected to an ego node) and identifies the real communities by adopting a democratic, bottom-up merging approach of such structures. Tiles [ 41 ] is dedicated to dynamic network data and extracts overlapping communities and tracks their evolution in time following an online iterative procedure.

2.2 Impact on industry and business

Data science can create an ecosystem of novel data-driven business opportunities. As a general trend across all sectors, massive quantities of data will be made accessible to everybody, allowing entrepreneurs to recognize and to rank shortcomings in business processes, to spot potential threads and win-win situations. Ideally, every citizen could establish from these patterns new business ideas. Co-creation enables data scientists to design innovative products and services. The value of joining different datasets is much larger than the sum of the value of the separated datasets by sharing data of various nature and provenance.

The gains from data science are expected across all sectors, from industry and production to services and retail. In this context, we cite several macro-areas where data science applications are especially promising. In energy and environment , the digitization of the energy systems (from production to distribution) enables the acquisition of real-time, high-resolution data. Coupled with other data sources, such as weather data, usage patterns, and market data (accompanied by advanced analytics), efficiency levels can be increased immensely. The positive impact to the environment is also enhanced by geospatial data that help to understand how our planet and its climate are changing and to confront major issues such as global warming, preservation of the species, the role and effects of human activities.

The manufacturing and production sector with the growing investments into Industry 4.0 and smart factories with sensor-equipped machinery that are both intelligent and networked (see internet of things . Cyber-physical systems ) will be one of the major producers of data in the world. The application of data science into this sector will bring efficiency gains and predictive maintenance. Entirely new business models are expected since the mass production of individualized products becomes possible where consumers may have direct access to influence and control.

As already stated in Sect.  2.1 , data science will contribute to increasing efficiency in public administrations processes and healthcare. In the physical and the cyber-domain, security will be enhanced. From financial fraud to public security, data science will contribute to establishing a framework that enables a safe and secure digital economy. Big data exploitation will open up opportunities for innovative, self-organizing ways of managing logistical business processes. Deliveries could be based on predictive monitoring, using data from stores, semantic product memories, internet forums, and weather forecasts, leading to both economic and environmental savings. Let us also consider the impact of personalized services for creating real experiences for tourists. The analysis of real-time and context-aware data (with the help of historical and cultural heritage data) will provide customized information to each tourist, and it will contribute to the better and more efficient management of the whole tourism value chain.

3 Data science ethics

Data science creates great opportunities but also new risks. The use of advanced tools for data analysis could expose sensitive knowledge of individual persons and could invade their privacy. Data science approaches require access to digital records of personal activities that contain potentially sensitive information. Personal information can be used to discriminate people based on their presumed characteristics. Data-driven algorithms yield classification and prediction models of behavioral traits of individuals, such as credit score, insurance risk, health status, personal preferences, and religious, ethnic, or political orientation, based on personal data disseminated in the digital environment by users (with or often without their awareness). The achievements of data science are the result of re-interpreting available data for analysis goals that differ from the original reasons motivating data collection. For example, mobile phone call records are initially collected by telecom operators for billing and operational aims, but they can be used for accurate and timely demography and human mobility analysis at a country or regional scale. This re-purposing of data clearly shows the importance of legal compliance and data ethics technologies and safeguards to protect privacy and anonymity; to secure data; to engage users; to avoid discrimination and misuse; to account for transparency; and to the purpose of seizing the opportunities of data science while controlling the associated risks.

Several aspects should be considered to avoid to harm individual privacy. Ethical elements should include the: (i) monitoring of the compliance of experiments, research protocols, and applications with ethical and juridical standards; (ii) developing of big data analytics and social mining tools with value-sensitive design and privacy-by-design methodologies; (iii) boosting of excellence and international competitiveness of Europe’s big data research in safe and fair use of big data for research. It is essential to highlight that data scientists using personal and social data also through infrastructures have the responsibility to get acquainted with the fundamental ethical aspects relating to becoming a “data controller.” This aspect has to be considered to define courses for informing and training data scientists about the responsibilities, the possibilities, and the boundaries they have in data manipulation.

Recalling Fig.  2 , it is crucial to inject into the data science pipeline the ethical values of fairness : how to avoid unfair and discriminatory decisions; accuracy : how to provide reliable information; confidentiality : how to protect the privacy of the involved people and transparency : how to make models and decisions comprehensible to all stakeholders. This value-sensitive design has to be aimed at boosting widespread social acceptance of data science, without inhibiting its power. Finally, it is essential to consider also the impact of the General Data Protection Regulation (GDPR) on (i) companies’ duties and how these European companies should comply with the limits in data manipulation the Regulation requires; and on (ii) researchers’ duties and to highlight articles and recitals which specifically mention and explain how research is intended in GDPR’s legal system.

figure 3

The relationship between big and open data and how they relate to the broad concept of open government

We complete this section with another important aspect related to open data, i.e., accessible public data that people, companies, and organizations can use to launch new ventures, analyze patterns and trends, make data-driven decisions, and solve complex problems. All the definitions of open data include two features: (i) the data must be publicly available for anyone to use, and (ii) data must be licensed in a way that allows for its reuse. All over the world, initiatives are launched to make data open by government agencies and public organizations; listing them is impossible, but an UN initiative has to be mentioned. Global Pulse Footnote 6 meant to implement the vision for a future in which big data is harnessed safely and responsibly as a public good.

Figure 3 shows the relationships between open data and big data. Currently, the problem is not only that government agencies (and some business companies) are collecting personal data about us, but also that we do not know what data are being collected and we do not have access to the information about ourselves. As reported by the World Economic forum in 2013, it is crucial to understand the value of personal data to let the users make informed decisions. A new branch of philosophy and ethics is emerging to handle personal data related issues. On the one hand, in all cases where the data might be used for the social good (i.e., medical research, improvement of public transports, contrasting epidemics), and understanding the personal data value means to correctly evaluate the balance between public benefits and personal loss of protection. On the other hand, when data are aimed to be used for commercial purposes, the value mentioned above might instead translate into simple pricing of personal information that the user might sell to a company for its business. In this context, discrimination discovery consists of searching for a-priori unknown contexts of suspect discrimination against protected-by-law social groups, by analyzing datasets of historical decision records. Machine learning and data mining approaches may be affected by discrimination rules, and these rules may be deeply hidden within obscure artificial intelligence models. Thus, discrimination discovery consists of understanding whether a predictive model makes direct or indirect discrimination. DCube [ 43 ] is a tool for data-driven discrimination discovery, a library of methods on fairness analysis.

It is important to evaluate how a mining model or algorithm takes its decision. The growing field of methods for explainable machine learning provides and continuously expands a set of comprehensive tool-kits [ 21 ]. For example, X-Lib is a library containing state-of-the-art explanation methods organized within a hierarchical structure and wrapped in a similar fashion way such that they can be easily accessed and used from different users. The library provides support for explaining classification on tabular data and images and for explaining the logic of complex decision systems. X-Lib collects, among the others, the following collection of explanation methods: LIME [ 38 ], Anchor [ 39 ], DeepExplain that includes Saliency maps [ 44 ], Gradient * Input, Integrated Gradients, and DeepLIFT [ 46 ]. Saliency method is a library containing code for SmoothGrad [ 45 ], as well as implementations of several other saliency techniques: Vanilla Gradients, Guided Backpropogation, and Grad-CAM. Another improvement in this context is the use of robotics and AI in data preparation, curation, and in detecting bias in data, information and knowledge as well as in the misuse and abuse of these assets when it comes to legal, privacy, and ethical issues and when it comes to transparency and trust. We cannot rely on human beings to do these tasks. We need to exploit the power of robotics and AI to help provide the protections required. Data and information lawyers will play a key role in legal and privacy issues, ethical use of these assets, and the problem of bias in both algorithms and the data, information, and knowledge used to develop analytics solutions. Finally, we can state that data science can help to fill the gap between legislators and technology.

4 Big data ecosystem: the role of research infrastructures

Research infrastructures (RIs) play a crucial role in the advent and development of data science. A social mining experiment exploits the main components of data science depicted in Fig.  1 (i.e., data, infrastructures, analytical methods) to enable multidisciplinary scientists and innovators to extract knowledge and to make the experiment reusable by the scientific community, innovators providing an impact on science and society.

Resources such as data and methods help domain and data scientists to transform research or an innovation question into a responsible data-driven analytical process. This process is executed onto the platform, thus supporting experiments that yield scientific output, policy recommendations, or innovative proofs-of-concept. Furthermore, an operational ethical board’s stewardship is a critical factor in the success of a RI.

An infrastructure typically offers easy-to-use means to define complex analytical processes and workflows , thus bridging the gap between domain experts and analytical technology. In many instances, domain experts may become a reference for their scientific communities, thus facilitating new users engagement within the RI activities. As a collateral feedback effect, experiments will generate new relevant data, methods, and workflows that can be integrated into the platform by data scientists, contributing to the resource expansion of the RI. An experiment designed in a node of the RI and executed on the platform returns its results to the entire RI community.

Well defined thematic environments amplify new experiments achievements towards the vertical scientific communities (and potential stakeholders) by activating appropriate dissemination channels.

4.1 The SoBigData Research Infrastructure

The SoBigData Research Infrastructure Footnote 7 is an ecosystem of human and digital resources, comprising data scientists, analytics, and processes. As shown in Fig.  4 , SoBigData is designed to enable multidisciplinary scientists and innovators to realize social mining experiments and to make them reusable by the scientific communities. All the components have been introduced for implementing data science from raw data management to knowledge extraction, with particular attention to legal and ethical aspects as reported in Fig.  1 . SoBigData supports data science serving a cross-disciplinary community of data scientists studying all the elements of societal complexity from a data- and model-driven perspective.

Currently, SoBigData includes scientific, industrial, and other stakeholders. In particular, our stakeholders are data analysts and researchers (35.6%), followed by companies (33.3%) and policy and lawmakers (20%). The following sections provide a short but comprehensive overview of the services provided by SoBigData RI with special attention on supporting ethical and open data science [ 15 , 16 ].

4.1.1 Resources, facilities, and access opportunities

Over the past decade, Europe has developed world-leading expertise in building and operating e-infrastructures. They are large-scale, federated and distributed online research environments through which researchers can share access to scientific resources (including data, instruments, computing, and communications), regardless of their location. They are meant to support unprecedented scales of international collaboration in science, both within and across disciplines, investing in economy-of-scale and common behavior, policies, best practices, and standards. They shape up a common environment where scientists can create , validate , assess , compare , and share their digital results of science, such as research data and research methods, by using a common “digital laboratory” consisting of agreed-on services and tools.

figure 4

The SoBigData Research Infrastructure: an ecosystem of human and digital resources, comprising data scientists, analytical methods, and processes. SoBigData enables multidisciplinary scientists and innovators to carry out experiments and to make them reusable by the community

However, the implementation of workflows, possibly following Open Science principles of reproducibility and transparency, is hindered by a multitude of real-world problems. One of the most prominent is that e-infrastructures available to research communities today are far from being well-designed and consistent digital laboratories, neatly designed to share and reuse resources according to common policies, data models, standards, language platforms, and APIs. They are instead “patchworks of systems,” assembling online tools, services, and data sources and evolving to match the requirements of the scientific process, to include new solutions. The degree of heterogeneity excludes the adoption of uniform workflow management systems, standard service-oriented approaches, routine monitoring and accounting methods. The realization of scientific workflows is typically realized by writing ad hoc code, manipulating data on desktops, alternating the execution of online web services, sharing software libraries implementing research methods in different languages, desktop tools, web-accessible execution engines (e.g., Taverna, Knime, Galaxy).

The SoBigData e-infrastructure is based on D4Science services, which provides researchers and practitioners with a working environment where open science practices are transparently promoted, and data science practices can be implemented by minimizing the technological integration cost highlighted above.

D4Science is a deployed instance of the gCube Footnote 8 technology [ 4 ], a software conceived to facilitate the integration of web services, code, and applications as resources of different types in a common framework, which in turn enables the construction of Virtual Research Environments (VREs) [ 7 ] as combinations of such resources (Fig.  5 ). As there is no common framework that can be trusted enough, sustained enough, to convince resource providers that converging to it would be a worthwhile effort, D4Science implements a “system of systems.” In such a framework, resources are integrated with minimal cost, to gain in scalability, performance, accounting, provenance tracking, seamless integration with other resources, visibility to all scientists. The principle is that the cost of “participation” to the framework is on the infrastructure rather than on resource providers. The infrastructure provides the necessary bridges to include and combine resources that would otherwise be incompatible.

figure 5

D4Science: resources from external systems, virtual research environments, and communities

More specifically, via D4Science, SoBigData scientists can integrate and share resources such as datasets, research methods, web services via APIs, and web applications via Portlets. Resources can then be integrated, combined, and accessed via VREs, intended as web-based working environments tailored to support the needs of their designated communities, each working on a research question. Research methods are integrated as executable code, implementing WPS APIs in different programming languages (e.g., Java, Python, R, Knime, Galaxy), which can be executed via the Data Miner analytics platform in parallel, transparently to the users, over powerful and extensible clusters, and via simple VRE user interfaces. Scientists using Data Miner in the context of a VRE can select and execute the available methods and share the results with other scientists, who can repeat or reproduce the experiment with a simple click.

D4Science VREs are equipped with core services supporting data analysis and collaboration among its users: ( i ) a shared workspace to store and organize any version of a research artifact; ( ii ) a social networking area to have discussions on any topic (including working version and released artifacts) and be informed on happenings; ( iii ) a Data Miner analytics platform to execute processing tasks (research methods) either natively provided by VRE users or borrowed from other VREs to be applied to VRE users’ cases and datasets; and iv ) a catalogue-based publishing platform to make the existence of a certain artifact public and disseminated. Scientists operating within VREs use such facilities continuously and transparently track the record of their research activities (actions, authorship, provenance), as well as products and links between them (lineage) resulting from every phase of the research life cycle, thus facilitating publishing of science according to Open Science principles of transparency and reproducibility [ 5 ].

Today, SoBigData integrates the resources in Table  1 . By means of such resources, SoBigData scientists have created VREs to deliver the so-called SoBigData exploratories : Explainable Machine Learning , Sports Data Science , Migration Studies , Societal Debates , Well-being & Economy , and City of Citizens . Each exploratory includes the resources required to perform Data science workflows in a controlled and shared environment. Resources range from data to methods, described more in detail in the following, together with their exploitation within the exploratories.

All the resources and instruments integrate into SoBigData RI are structured in such a way as to operate within the confines of the current data protection law with the focus on General Data Protection Regulation (GDPR) and ethical analysis of the fundamental values involved in social mining and AI. Each item into the catalogue has specific fields for managing ethical issues (e.g., if a dataset contains personal info) and fields for describing and managing intellectual properties.

4.1.2 Data resources: social mining and big data ecosystem

SoBigData RI defines policies supporting users in the collection, description, preservation, and sharing of their data sets. It implements data science making such data available for collaborative research by adopting various strategies, ranging from sharing the open data sets with the scientific community at large, to share the data with disclosure restriction allowing data access within secure environments.

Several big data sets are available through SoBigData RI including network graphs from mobile phone call data; networks crawled from many online social networks, including Facebook and Flickr, transaction micro-data from diverse retailers, query logs both from search engines and e-commerce, society-wide mobile phone call data records, GPS tracks from personal navigation devices, survey data about customer satisfaction or market research, extensive web archives, billions of tweets, and data from location-aware social networks.

4.1.3 Data science through SoBigData exploratories

Exploratories are thematic environments built on top of the SoBigData RI. An exploratory binds datasets with social mining methods providing the research context for supporting specific data science applications by: (i) providing the scientific context for performing the application. This context can be considered a container for binding specific methods, applications, services, and datasets; (ii) stimulating communities on the effectiveness of the analytical process related to the analysis, promoting scientific dissemination, result sharing, and reproducibility. The use of exploratories promotes the effectiveness of the data science trough research infrastructure services. The following sections report a short description of the six SoBigData exploratories. Figure 6 shows the main thematic areas covered by each exploratory. Due to its nature, Explainable Machine Learning exploratory can be applied to each sector where a black-box machine learning approach is used. The list of exploratories (and the data and methods inside them) are updated continuously and continue to grow over time. Footnote 9

figure 6

SoBigData covers six thematic areas listed horizontally. Each exploratory covers more than one thematic area

City of citizens. This exploratory aims to collect data science applications and methods related to geo-referenced data. The latter describes the movements of citizens in a city, a territory, or an entire region. There are several studies and different methods that employ a wide variety of data sources to build models about the mobility of people and city characteristics in the scientific literature [ 30 , 32 ]. Like ecosystems, cities are open systems that live and develop utilizing flows of energy, matter, and information. What distinguishes a city from a colony is the human component (i.e., the process of transformation by cultural and technological evolution). Through this combination, cities are evolutionary systems that develop and co-evolve continuously with their inhabitants [ 24 ]. Cities are kaleidoscopes of information generated by a myriad of digital devices weaved into the urban fabric. The inclusion of tracking technologies in personal devices enabled the analysis of large sets of mobility data like GPS traces and call detail records.

Data science applied to human mobility is one of the critical topics investigated in SoBigData thanks to the decennial experience of partners in European projects. The study of human mobility led to the integration into the SoBigData of unique Global Positioning System (GPS) and call detail record (CDR) datasets of people and vehicle movements, and geo-referenced social network data as well as several mobility services: O/D (origin-destination) matrix computation, Urban Mobility Atlas Footnote 10 (a visual interface to city mobility patterns), GeoTopics Footnote 11 (for exploring patterns of urban activity from Foursquare), and predictive models: MyWay Footnote 12 (trajectory prediction), TripBuilder Footnote 13 (tourists to build personalized tours of a city). In human mobility, research questions come from geographers, urbanists, complexity scientists, data scientists, policymakers, and Big Data providers, as well as innovators aiming to provide applications for any service for the smart city ecosystem. The idea is to investigate the impact of political events on the well-being of citizens. This exploratory supports the development of “happiness” and “peace” indicators through text mining/opinion mining pipeline on repositories of online news. These indicators reveal that the level of crime of a territory can be well approximated by analyzing the news related to that territory. Generally, we study the impact of the economy on well-being and vice versa, e.g., also considering the propagation of shocks of financial distress in an economic or financial system crucially depends on the topology of the network interconnecting the different elements.

Well-being and economy. This exploratory tests the hypothesis that well-being is correlated to the business performance of companies. The idea is to combine statistical methods and traditional economic data (typically at low-frequency) with high-frequency data from non-traditional sources, such as, i.e., web, supermarkets, for now-casting economic, socioeconomic and well-being indicators. These indicators allow us to study and measure real-life costs by studying price variation and socioeconomic status inference. Furthermore, this activity supports studies on the correlation between people’s well-being and their social and mobility data. In this context, some basic hypothesis can be summarized as: (i) there are curves of age- and gender-based segregation distribution in boards of companies, which are characteristic to mean credit risk of companies in a region; (ii) low mean credit risk of companies in a region has a positive correlation to well-being; (iii) systemic risk correlates highly with well-being indices at a national level. The final aim is to provide a set of guidelines to national governments, methods, and indices for decision making on regulations affecting companies to improve well-being in the country, also considering effective policies to reduce operational risks such as credit risk, and external threats of companies [ 17 ].

Big Data, analyzed through the lenses of data science, provides means to understand our complex socioeconomic and financial systems. On the one hand, this offers new opportunities to measure the patterns of well-being and poverty at a local and global scale, empowering governments and policymakers with the unprecedented opportunity to nowcast relevant economic quantities and compare different countries, regions, and cities. On the other hand, this allows us to investigate the network underlying the complex systems of economy and finance, and it affects the aggregate output, the propagation of shocks or financial distress and systemic risk.

Societal debates. This exploratory employs data science approaches to answer research questions such as who is participating in public debates? What is the “big picture” response from citizens to a policy, election, referendum, or other political events? This kind of analysis allows scientists, policymakers, and citizens to understand the online discussion surrounding polarized debates [ 14 ]. The personal perception of online discussions on social media is often biased by the so-called filter bubble, in which automatic curation of content and relationships between users negatively affects the diversity of opinions available to them. Making a complete analysis of online polarized debates enables the citizens to be better informed and prepared for political outcomes. By analyzing content and conversations on social media and newspaper articles, data scientists study public debates and also assess public sentiment around debated topics, opinion diffusion dynamics, echo chambers formation and polarized discussions, fake news analysis, and propaganda bots. Misinformation is often the result of a distorted perception of concepts that, although unrelated, suddenly appear together in the same narrative. Understanding the details of this process at an early stage may help to prevent the birth and the diffusion of fake news. The misinformation fight includes the development of dynamical models of misinformation diffusion (possibly in contrast to the spread of mainstream news) as well as models of how attention cycles are accelerated and amplified by the infrastructures of online media.

Another important topic covered by this exploratory concerns the analysis of how social bots activity affects fake news diffusion. Determining whether a human or a bot controls a user account is a complex task. To the best of our knowledge, the only openly accessible solution to detect social bots is Botometer, an API that allows us to interact with an underlying machine learning system. Although Botometer has been proven to be entirely accurate in detecting social bots, it has limitations due to the Twitter API features: hence, an algorithm overcoming the barriers of current recipes is needed.

The resources related to Societal Debates exploratory, especially in the domain of media ecology and the fight against misinformation online, provide easy-to-use services to public bodies, media outlets, and social/political scientists. Furthermore, SoBigData supports new simulation models and experimental processes to validate in vivo the algorithms for fighting misinformation, curbing the pathological acceleration and amplification of online attention cycles, breaking the bubbles, and explore alternative media and information ecosystems.

Migration studies. Data science is also useful to understand the migration phenomenon. Knowledge about the number of immigrants living in a particular region is crucial to devise policies that maximize the benefits for both locals and immigrants. These numbers can vary rapidly in space and time, especially in periods of crisis such as wars or natural disasters.

This exploratory provides a set of data and tools for trying to answer some questions about migration flows. Through this exploratory, a data scientist studies economic models of migration and can observe how migrants choose their destination countries. A scientist can discover what is the meaning of “opportunities” that a country provides to migrants, and whether there are correlations between the number of incoming migrants and opportunities in the host countries [ 8 ]. Furthermore, this exploratory tries to understand how public perception of migration is changing using an opinion mining analysis. For example, social network analysis enables us to analyze the migrant’s social network and discover the structure of the social network for people who decided to start a new life in a different country [ 28 ].

Finally, we can also evaluate current integration indices based on official statistics and survey data, which can be complemented by Big Data sources. This exploratory aims to build combined integration indexes that take into account multiple data sources to evaluate integration on various levels. Such integration includes mobile phone data to understand patterns of communication between immigrants and natives; social network data to assess sentiment towards immigrants and immigration; professional network data (such as LinkedIn) to understand labor market integration, and local data to understand to what extent moving across borders is associated with a change in the cultural norms of the migrants. These indexes are fundamental to evaluate the overall social and economic effects of immigration. The new integration indexes can be applied with various space and time resolutions (small area methods) to obtain a complete image of integration, and complement official index.

Sports data science. The proliferation of new sensing technologies that provide high-fidelity data streams extracted from every game, is changing the way scientists, fans and practitioners conceive sports performance. The combination of these (big) data with the tools of data science provides the possibility to unveil complex models underlying sports performance and enables to perform many challenging tasks: from automatic tactical analysis to data-driven performance ranking; game outcome prediction, and injury forecasting. The idea is to foster research on sports data science in several directions. The application of explainable AI and deep learning techniques can be hugely beneficial to sports data science. For example, by using adversarial learning, we can modify the training plans of players that are associated with high injury risk and develop training plans that maximize the fitness of players (minimizing their injury risk). The use of gaming, simulation, and modeling is another set of tools that can be used by coaching staff to test tactics that can be employed against a competitor. Furthermore, by using deep learning on time series, we can forecast the evolution of the performance of players and search for young talents.

This exploratory examines the factors influencing sports success and how to build simulation tools for boosting both individual and collective performance. Furthermore, this exploratory describes performances employing data, statistics, and models, allowing coaches, fans, and practitioners to understand (and boost) sports performance [ 42 ].

Explainable machine learning. Artificial Intelligence, increasingly based on Big Data analytics, is a disruptive technology of our times. This exploratory provides a forum for studying effects of AI on the future society. In this context, SoBigData studies the future of labor and the workforce, also through data- and model-driven analysis, simulations, and the development of methods that construct human understandable explanations of AI black-box models [ 20 ].

Black box systems for automated decision making map a user’s features into a class that predicts the behavioral traits of individuals, such as credit risk, health status, without exposing the reasons why. Most of the time, the internal reasoning of these algorithms is obscure even to their developers. For this reason, the last decade has witnessed the rise of a black box society. This exploratory is developing a set of techniques and tools which allow data analysts to understand why an algorithm produce a decision. These approaches are designed not for discovering a lack of transparency but also for discovering possible biases inherited by the algorithms from human prejudices and artefacts hidden in the training data (which may lead to unfair or wrong decisions) [ 35 ].

5 Conclusions: individual and collective intelligence

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s [ 23 ]. Since 2012, every day 2.5 exabytes (2.5 \(\times \) 10 \(^18\) bytes) of data were created; as of 2014, every day 2.3 zettabytes (2.3 \(\times \) 10 \(^21\) bytes) of data were generated by Super-power high-tech Corporation worldwide. Soon zettabytes of useful public and private data will be widely and openly available. In the next years, smart applications such as smart grids, smart logistics, smart factories, and smart cities will be widely deployed across the continent and beyond. Ubiquitous broadband access, mobile technology, social media, services, and internet of think on billions of devices will have contributed to the explosion of generated data to a total global estimate of 40 zettabytes.

In this work, we have introduced data science as a new challenge and opportunity for the next years. In this context, we have tried to summarize in a concise way several aspects related to data science applications and their impacts on society, considering both the new services available and the new job perspectives. We have also introduced issues in managing data representing human behavior and showed how difficult it is to preserve personal information and privacy. With the introduction of SoBigData RI and exploratories, we have provided virtual environments where it is possible to understand the potentiality of data science in different research contexts.

Concluding, we can state that social dilemmas occur when there is a conflict between the individual and public interest. Such problems also appear in the ecosystem of distributed AI systems (based on data science tools) and humans, with additional difficulties due: on the one hand, to the relative rigidity of the trained AI systems and the necessity of achieving social benefit, and, on the other hand, to the necessity of keeping individuals interested. What are the principles and solutions for individual versus social optimization using AI, and how can an optimum balance be achieved? The answer is still open, but these complex systems have to work on fulfilling collective goals, and requirements, with the challenge that human needs change over time and move from one context to another. Every AI system should operate within an ethical and social framework in understandable, verifiable, and justifiable way. Such systems must, in any case, work within the bounds of the rule of law, incorporating protection of fundamental rights into the AI infrastructure. In other words, the challenge is to develop mechanisms that will result in the system converging to an equilibrium that complies with European values and social objectives (e.g., social inclusion) but without unnecessary losses of efficiency.

Interestingly, data science can play a vital role in enhancing desirable behaviors in the system, e.g., by supporting coordination and cooperation that is, more often than not, crucial to achieving any meaningful improvements. Our ultimate goal is to build the blueprint of a sociotechnical system in which AI not only cooperates with humans but, if necessary, helps them to learn how to collaborate, as well as other desirable behaviors. In this context, it is also essential to understand how to achieve robustness of the human and AI ecosystems in respect of various types of malicious behaviors, such as abuse of power and exploitation of AI technical weaknesses.

We conclude by paraphrasing Stephen Hawking in his Brief Answers to the Big Questions: the availability of data on its own will not take humanity to the future, but its intelligent and creative use will.

http://www.sdss3.org/collaboration/ .

e.g., https://www.nature.com/sdata/policies/repositories .

Responsible Data Science program: https://redasci.org/ .

https://emergency.copernicus.eu/ .

Nowcasting in economics is the prediction of the present, the very near future, and the very recent past state of an economic indicator.

https://www.unglobalpulse.org/ .

http://sobigdata.eu .

https://www.gcube-system.org/ .

https://sobigdata.d4science.org/catalogue-sobigdata .

http://www.sobigdata.eu/content/urban-mobility-atlas .

http://data.d4science.org/ctlg/ResourceCatalogue/geotopics_-_a_method_and_system_to_explore_urban_activity .

http://data.d4science.org/ctlg/ResourceCatalogue/myway_-_trajectory_prediction .

http://data.d4science.org/ctlg/ResourceCatalogue/tripbuilder .

Abitbol, J.L., Fleury, E., Karsai, M.: Optimal proxy selection for socioeconomic status inference on twitter. Complexity 2019 , 60596731–605967315 (2019). https://doi.org/10.1155/2019/6059673

Article   Google Scholar  

Amato, G., Candela, L., Castelli, D., Esuli, A., Falchi, F., Gennaro, C., Giannotti, F., Monreale, A., Nanni, M., Pagano, P., Pappalardo, L., Pedreschi, D., Pratesi, F., Rabitti, F., Rinzivillo, S., Rossetti, G., Ruggieri, S., Sebastiani, F., Tesconi, M.: How data mining and machine learning evolved from relational data base to data science. In: Flesca, S., Greco, S., Masciari, E., Saccà, D. (eds.) A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, Studies in Big Data, vol. 31, pp. 287–306. Springer, Berlin (2018). https://doi.org/10.1007/978-3-319-61893-7_17

Chapter   Google Scholar  

Andrienko, G.L., Andrienko, N.V., Budziak, G., Dykes, J., Fuchs, G., von Landesberger, T., Weber, H.: Visual analysis of pressure in football. Data Min. Knowl. Discov. 31 (6), 1793–1839 (2017). https://doi.org/10.1007/s10618-017-0513-2

Article   MathSciNet   Google Scholar  

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Marioli, V., Pagano, P., Panichi, G., Perciante, C., Sinibaldi, F.: The gcube system: delivering virtual research environments as-a-service. Future Gener. Comput. Syst. 95 , 445–453 (2019). https://doi.org/10.1016/j.future.2018.10.035

Assante, M., Candela, L., Castelli, D., Cirillo, R., Coro, G., Frosini, L., Lelii, L., Mangiacrapa, F., Pagano, P., Panichi, G., Sinibaldi, F.: Enacting open science by d4science. Future Gener. Comput. Syst. (2019). https://doi.org/10.1016/j.future.2019.05.063

Barabasi, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nature reviews. Genetics 12 , 56–68 (2011). https://doi.org/10.1038/nrg2918

Candela, L., Castelli, D., Pagano, P.: Virtual research environments: an overview and a research agenda. Data Sci. J. 12 , GRDI75–GRDI81 (2013). https://doi.org/10.2481/dsj.GRDI-013

Coletto, M., Esuli, A., Lucchese, C., Muntean, C.I., Nardini, F.M., Perego, R., Renso, C.: Sentiment-enhanced multidimensional analysis of online social networks: perception of the mediterranean refugees crisis. In: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM’16, pp. 1270–1277. IEEE Press, Piscataway, NJ, USA (2016). http://dl.acm.org/citation.cfm?id=3192424.3192657

Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a local-first approach. TKDD 9 (1), 6:1–6:27 (2014). https://doi.org/10.1145/2629511

Cresci, S., Minutoli, S., Nizzoli, L., Tardelli, S., Tesconi, M.: Enriching digital libraries with crowdsensed data. In: P. Manghi, L. Candela, G. Silvello (eds.) Digital Libraries: Supporting Open Science—15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, 31 Jan–1 Feb 2019, Proceedings, Communications in Computer and Information Science, vol. 988, pp. 144–158. Springer (2019). https://doi.org/10.1007/978-3-030-11226-4_12

Cresci, S., Petrocchi, M., Spognardi, A., Tognazzi, S.: Better safe than sorry: an adversarial approach to improve social bot detection. In: P. Boldi, B.F. Welles, K. Kinder-Kurlanda, C. Wilson, I. Peters, W.M. Jr. (eds.) Proceedings of the 11th ACM Conference on Web Science, WebSci 2019, Boston, MA, USA, June 30–July 03, 2019, pp. 47–56. ACM (2019). https://doi.org/10.1145/3292522.3326030

Cresci, S., Pietro, R.D., Petrocchi, M., Spognardi, A., Tesconi, M.: Social fingerprinting: detection of spambot groups through dna-inspired behavioral modeling. IEEE Trans. Dependable Sec. Comput. 15 (4), 561–576 (2018). https://doi.org/10.1109/TDSC.2017.2681672

Furletti, B., Trasarti, R., Cintia, P., Gabrielli, L.: Discovering and understanding city events with big data: the case of rome. Information 8 (3), 74 (2017). https://doi.org/10.3390/info8030074

Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Reducing controversy by connecting opposing views. In: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, WSDM’17, pp. 81–90. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3018661.3018703

Giannotti, F., Trasarti, R., Bontcheva, K., Grossi, V.: Sobigdata: social mining & big data ecosystem. In: P. Champin, F.L. Gandon, M. Lalmas, P.G. Ipeirotis (eds.) Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23–27, 2018, pp. 437–438. ACM (2018). https://doi.org/10.1145/3184558.3186205

Grossi, V., Rapisarda, B., Giannotti, F., Pedreschi, D.: Data science at sobigdata: the european research infrastructure for social mining and big data analytics. I. J. Data Sci. Anal. 6 (3), 205–216 (2018). https://doi.org/10.1007/s41060-018-0126-x

Grossi, V., Romei, A., Ruggieri, S.: A case study in sequential pattern mining for it-operational risk. In: W. Daelemans, B. Goethals, K. Morik (eds.) Machine Learning and Knowledge Discovery in Databases, European Conference, ECML/PKDD 2008, Antwerp, Belgium, 15–19 Sept 2008, Proceedings, Part I, Lecture Notes in Computer Science, vol. 5211, pp. 424–439. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_46

Guidotti, R., Coscia, M., Pedreschi, D., Pennacchioli, D.: Going beyond GDP to nowcast well-being using retail market data. In: A. Wierzbicki, U. Brandes, F. Schweitzer, D. Pedreschi (eds.) Advances in Network Science—12th International Conference and School, NetSci-X 2016, Wroclaw, Poland, 11–13 Jan 2016, Proceedings, Lecture Notes in Computer Science, vol. 9564, pp. 29–42. Springer (2016). https://doi.org/10.1007/978-3-319-28361-6_3

Guidotti, R., Monreale, A., Nanni, M., Giannotti, F., Pedreschi, D.: Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 Aug 2017, pp. 195–204. ACM (2017). https://doi.org/10.1145/3097983.3098034

Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D.: A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), 93:1–93:42 (2019). https://doi.org/10.1145/3236009

Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR abs/1802.01933 (2018). arxiv: 1802.01933

Guidotti, R., Nanni, M., Rinzivillo, S., Pedreschi, D., Giannotti, F.: Never drive alone: boosting carpooling with network analysis. Inf. Syst. 64 , 237–257 (2017). https://doi.org/10.1016/j.is.2016.03.006

Hilbert, M., Lopez, P.: The world’s technological capacity to store, communicate, and compute information. Science 332 (6025), 60–65 (2011)

Kennedy, C.A., Stewart, I., Facchini, A., Cersosimo, I., Mele, R., Chen, B., Uda, M., Kansal, A., Chiu, A., Kim, K.g., Dubeux, C., Lebre La Rovere, E., Cunha, B., Pincetl, S., Keirstead, J., Barles, S., Pusaka, S., Gunawan, J., Adegbile, M., Nazariha, M., Hoque, S., Marcotullio, P.J., González Otharán, F., Genena, T., Ibrahim, N., Farooqui, R., Cervantes, G., Sahin, A.D., : Energy and material flows of megacities. Proc. Nat. Acad. Sci. 112 (19), 5985–5990 (2015). https://doi.org/10.1073/pnas.1504315112

Korjani, S., Damiano, A., Mureddu, M., Facchini, A., Caldarelli, G.: Optimal positioning of storage systems in microgrids based on complex networks centrality measures. Sci. Rep. (2018). https://doi.org/10.1038/s41598-018-35128-6

Lorini, V., Castillo, C., Dottori, F., Kalas, M., Nappo, D., Salamon, P.: Integrating social media into a pan-european flood awareness system: a multilingual approach. In: Z. Franco, J.J. González, J.H. Canós (eds.) Proceedings of the 16th International Conference on Information Systems for Crisis Response and Management, València, Spain, 19–22 May 2019. ISCRAM Association (2019). http://idl.iscram.org/files/valeriolorini/2019/1854-_ValerioLorini_etal2019.pdf

Lulli, A., Gabrielli, L., Dazzi, P., Dell’Amico, M., Michiardi, P., Nanni, M., Ricci, L.: Scalable and flexible clustering solutions for mobile phone-based population indicators. Int. J. Data Sci. Anal. 4 (4), 285–299 (2017). https://doi.org/10.1007/s41060-017-0065-y

Moise, I., Gaere, E., Merz, R., Koch, S., Pournaras, E.: Tracking language mobility in the twitter landscape. In: C. Domeniconi, F. Gullo, F. Bonchi, J. Domingo-Ferrer, R.A. Baeza-Yates, Z. Zhou, X. Wu (eds.) IEEE International Conference on Data Mining Workshops, ICDM Workshops 2016, 12–15 Dec 2016, Barcelona, Spain., pp. 663–670. IEEE Computer Society (2016). https://doi.org/10.1109/ICDMW.2016.0099

Nanni, M.: Advancements in mobility data analysis. In: F. Leuzzi, S. Ferilli (eds.) Traffic Mining Applied to Police Activities—Proceedings of the 1st Italian Conference for the Traffic Police (TRAP-2017), Rome, Italy, 25–26 Oct 2017, Advances in Intelligent Systems and Computing, vol. 728, pp. 11–16. Springer (2017). https://doi.org/10.1007/978-3-319-75608-0_2

Nanni, M., Trasarti, R., Monreale, A., Grossi, V., Pedreschi, D.: Driving profiles computation and monitoring for car insurance crm. ACM Trans. Intell. Syst. Technol. 8 (1), 14:1–14:26 (2016). https://doi.org/10.1145/2912148

Pappalardo, G., di Matteo, T., Caldarelli, G., Aste, T.: Blockchain inefficiency in the bitcoin peers network. EPJ Data Sci. 7 (1), 30 (2018). https://doi.org/10.1140/epjds/s13688-018-0159-3

Pappalardo, L., Barlacchi, G., Pellungrini, R., Simini, F.: Human mobility from theory to practice: Data, models and applications. In: S. Amer-Yahia, M. Mahdian, A. Goel, G. Houben, K. Lerman, J.J. McAuley, R.A. Baeza-Yates, L. Zia (eds.) Companion of The 2019 World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019., pp. 1311–1312. ACM (2019). https://doi.org/10.1145/3308560.3320099

Pappalardo, L., Cintia, P., Ferragina, P., Massucco, E., Pedreschi, D., Giannotti, F.: Playerank: data-driven performance evaluation and player ranking in soccer via a machine learning approach. ACM TIST 10 (5), 59:1–59:27 (2019). https://doi.org/10.1145/3343172

Pappalardo, L., Vanhoof, M., Gabrielli, L., Smoreda, Z., Pedreschi, D., Giannotti, F.: An analytical framework to nowcast well-being using mobile phone data. CoRR abs/1606.06279 (2016). arxiv: 1606.06279

Pasquale, F.: The Black Box Society: The Secret Algorithms That Control Money and Information. Harvard University Press, Cambridge (2015)

Book   Google Scholar  

Piškorec, M., Antulov-Fantulin, N., Miholić, I., Šmuc, T., Šikić, M.: Modeling peer and external influence in online social networks: Case of 2013 referendum in croatia. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & Their Applications VI. Springer, Cham (2018)

Google Scholar  

Ranco, G., Aleksovski, D., Caldarelli, G., Mozetic, I.: Investigating the relations between twitter sentiment and stock prices. CoRR abs/1506.02431 (2015). arxiv: 1506.02431

Ribeiro, M.T., Singh, S., Guestrin, C.: “why should I trust you?”: Explaining the predictions of any classifier. In: B. Krishnapuram, M. Shah, A.J. Smola, C.C. Aggarwal, D. Shen, R. Rastogi (eds.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 Aug 2016, pp. 1135–1144. ACM (2016). https://doi.org/10.1145/2939672.2939778

Ribeiro, M.T., Singh, S., Guestrin, C.: Anchors: High-precision model-agnostic explanations. In: S.A. McIlraith, K.Q. Weinberger (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, 2–7 Feb 2018, pp. 1527–1535. AAAI Press (2018). https://www.aaai.org/ocs/index.php/AAAI/AAAI18/-paper/view/16982

Rossetti, G., Milli, L., Rinzivillo, S., Sîrbu, A., Pedreschi, D., Giannotti, F.: Ndlib: a python library to model and analyze diffusion processes over complex networks. Int. J. Data Sci. Anal. 5 (1), 61–79 (2018). https://doi.org/10.1007/s41060-017-0086-6

Rossetti, G., Pappalardo, L., Pedreschi, D., Giannotti, F.: Tiles: an online algorithm for community discovery in dynamic social networks. Mach. Learn. 106 (8), 1213–1241 (2017). https://doi.org/10.1007/s10994-016-5582-8

Rossi, A., Pappalardo, L., Cintia, P., Fernández, J., Iaia, M.F., Medina, D.: Who is going to get hurt? predicting injuries in professional soccer. In: J. Davis, M. Kaytoue, A. Zimmermann (eds.) Proceedings of the 4th Workshop on Machine Learning and Data Mining for Sports Analytics co-located with 2017 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2017), Skopje, Macedonia, 18 Sept 2017., CEUR Workshop Proceedings, vol. 1971, pp. 21–30. CEUR-WS.org (2017). http://ceur-ws.org/Vol-1971/paper-04.pdf

Ruggieri, S., Pedreschi, D., Turini, F.: DCUBE: discrimination discovery in databases. In: A.K. Elmagarmid, D. Agrawal (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, 6–10 June 2010, pp. 1127–1130. ACM (2010). https://doi.org/10.1145/1807167.1807298

Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR abs/1312.6034 (2013). http://dblp.uni-trier.de/db/journals/corr/corr1312.html#SimonyanVZ13

Smilkov, D., Thorat, N., Kim, B., Viégas, F.B., Wattenberg, M.: Smoothgrad: removing noise by adding noise. CoRR abs/1706.03825 (2017). arxiv: 1706.03825

Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: D. Precup, Y.W. Teh (eds.) Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 70, pp. 3319–3328. PMLR, International Convention Centre, Sydney, Australia (2017). http://proceedings.mlr.press/v70/sundararajan17a.html

Trasarti, R., Guidotti, R., Monreale, A., Giannotti, F.: Myway: location prediction via mobility profiling. Inf. Syst. 64 , 350–367 (2017). https://doi.org/10.1016/j.is.2015.11.002

Traub, J., Quiané-Ruiz, J., Kaoudi, Z., Markl, V.: Agora: Towards an open ecosystem for democratizing data science & artificial intelligence. CoRR abs/1909.03026 (2019). arxiv: 1909.03026

Vazifeh, M.M., Zhang, H., Santi, P., Ratti, C.: Optimizing the deployment of electric vehicle charging stations using pervasive mobility data. Transp Res A Policy Practice 121 (C), 75–91 (2019). https://doi.org/10.1016/j.tra.2019.01.002

Vermeulen, A.F.: Practical Data Science: A Guide to Building the Technology Stack for Turning Data Lakes into Business Assets, 1st edn. Apress, New York (2018)

Download references

Acknowledgements

This work is supported by the European Community’s H2020 Program under the scheme ‘INFRAIA-1-2014-2015: Research Infrastructures’, grant agreement #654024 ‘SoBigData: Social Mining and Big Data Ecosystem’ and the scheme ‘INFRAIA-01-2018-2019: Research and Innovation action’, grant agreement #871042 ’SoBigData \(_{++}\) : European Integrated Infrastructure for Social Mining and Big Data Analytics’

Open access funding provided by Università di Pisa within the CRUI-CARE Agreement.

Author information

Authors and affiliations.

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, KDDLab, Pisa, Italy

Valerio Grossi & Fosca Giannotti

Department of Computer Science, University of Pisa, Pisa, Italy

Dino Pedreschi

CNR - Istituto Scienza e Tecnologia dell’Informazione A. Faedo, NeMIS, Pisa, Italy

Paolo Manghi, Pasquale Pagano & Massimiliano Assante

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dino Pedreschi .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Grossi, V., Giannotti, F., Pedreschi, D. et al. Data science: a game changer for science and innovation. Int J Data Sci Anal 11 , 263–278 (2021). https://doi.org/10.1007/s41060-020-00240-2

Download citation

Received : 13 July 2019

Accepted : 15 December 2020

Published : 19 April 2021

Issue Date : May 2021

DOI : https://doi.org/10.1007/s41060-020-00240-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Responsible data science
  • Research infrastructure
  • Social mining
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Int J Environ Res Public Health

Logo of ijerph

Reproducibility and Scientific Integrity of Big Data Research in Urban Public Health and Digital Epidemiology: A Call to Action

Ana cecilia quiroga gutierrez.

1 Department of Health Sciences and Medicine, University of Lucerne, 6002 Luzern, Switzerland

Daniel J. Lindegger

2 Institute of Global Health, University of Geneva, 1211 Geneva, Switzerland

Ala Taji Heravi

3 CLEAR Methods Center, Department of Clinical Research, Division of Clinical Epidemiology, University Hospital Basel and University of Basel, 4031 Basel, Switzerland

Thomas Stojanov

4 Department of Orthopaedic Surgery and Traumatology, University Hospital of Basel, 4031 Basel, Switzerland

Martin Sykora

5 School of Business and Economics, Centre for Information Management, Loughborough University, Loughborough LE11 3TU, UK

Suzanne Elayan

Stephen j. mooney.

6 Department of Epidemiology, University of Washington, Seattle, WA 98195, USA

John A. Naslund

7 Department of Global Health and Social Medicine, Harvard Medical School, Boston, MA 02115, USA

Marta Fadda

8 Institute of Public Health, Università Della Svizzera Italiana, 6900 Lugano, Switzerland

Oliver Gruebner

9 Epidemiology, Biostatistics and Prevention Institute, University of Zurich, 8001 Zurich, Switzerland

10 Department of Geography, University of Zurich, 8057 Zurich, Switzerland

Associated Data

Not applicable.

The emergence of big data science presents a unique opportunity to improve public-health research practices. Because working with big data is inherently complex, big data research must be clear and transparent to avoid reproducibility issues and positively impact population health. Timely implementation of solution-focused approaches is critical as new data sources and methods take root in public-health research, including urban public health and digital epidemiology. This commentary highlights methodological and analytic approaches that can reduce research waste and improve the reproducibility and replicability of big data research in public health. The recommendations described in this commentary, including a focus on practices, publication norms, and education, are neither exhaustive nor unique to big data, but, nonetheless, implementing them can broadly improve public-health research. Clearly defined and openly shared guidelines will not only improve the quality of current research practices but also initiate change at multiple levels: the individual level, the institutional level, and the international level.

1. Introduction

Research comprises “creative and systematic work undertaken in order to increase the stock of knowledge” [ 1 , 2 ]. Research waste, or research whose results offer no social benefit [ 3 ], was characterized in a landmark series of papers in the Lancet in 2014 [ 4 , 5 ]. The underlying drivers of research waste range from methodological weaknesses in specific studies to systemic shortcomings within the broader research ecosystem, notably including a reward system that incentivises quantity over quality and incentivizes exploring new hypotheses over confirming old ones [ 4 , 5 , 6 , 7 , 8 ].

Published research that cannot be reproduced is wasteful due to doubts about its quality and reliability. Lack of reproducibility is a concern in all scientific research, and it is especially significant in the field of public health, where research aims to improve treatment practices and policies that have widespread implications. In this commentary, we highlight the urgency of improving norms for reproducibility and scientific integrity in urban public health and digital epidemiology and discuss potential approaches. We first discuss some examples of big data sources and their uses in urban public health, digital epidemiology, and other fields, and consider the limitations with the use of big data. We then provide an overview of relevant solutions to address the key challenges to reproducibility and scientific integrity. Finally, we consider some of their expected outcomes, challenges, and implications.

Unreliable research findings also represent a serious challenge in public-health research. While the peer-review process is designed to ensure the quality and integrity of scientific publications, the implementation of peer review varies between journals and disciplines and does not guarantee that the data used are properly collected or employed. As a result, reproducibility remains a challenge. This is also true in the context of the emerging field of big data science. This is largely driven by the characteristics of big data, such as their volume, variety, and velocity, as well as the novelty and excitement surrounding new data science methods, lack of established reporting standards, and a nascent field that continues to change rapidly in parallel to the development of new technological and analytic innovations. Recent reports have uncovered that most research is not reproducible, with findings casting doubt on the scientific integrity of much of the current research landscape [ 6 , 9 , 10 , 11 , 12 ]. At the bottom of this reproducibility crisis lies growing pressure to publish not only novel, but more importantly, statistically significant results at an accelerated pace [ 13 , 14 ], increasing the use of low standards of evidence and disregarding pragmatic metrics, such as clinical or practical significance [ 15 ]. Consequently, the credibility of scientific findings is decreasing, potentially leading to cynicism or reputational damage to the research community [ 16 , 17 ]. Addressing the reproducibility crisis is not only one step towards restoring the public’s trust in scientific research, but also a necessary foundation for future research, as well as guiding evidence-based public-health initiatives and policies [ 18 ], facilitating translation and implementation of research findings [ 19 , 20 ], and accelerating scientific discovery [ 21 ].

While failure to fully document the scientific steps taken in a research project is a fundamental challenge across all research, big data research is additionally burdened by the technical and computational complexities of handling and analysing large datasets. The challenge of ensuring computational capacity, including memory and processing power, to handle the data, as well as statistical and subject matter expertise accounting for data heterogeneity, can lead to reproducibility issues at a more pragmatic level. For example, large datasets derived from social media platforms require data analysis infrastructure, software, and technical skills, which are not always accessible to every research team [ 22 , 23 ]. Likewise, studies involving big data create new methodological challenges for researchers as the complexity for analysis and reporting increases [ 24 ]. This complexity not only requires sophisticated statistical skills but also new guidelines that define how data should be processed, shared, and communicated to guarantee reproducibility and maintain scientific integrity, while protecting private and sensitive information. Some of these challenges lie beyond the abilities and limitations of individual researchers and even institutions, requiring cultural and systematic changes to improve not only the reproducibility but also transparency and quality of big data research in public health.

Importantly, through concerted efforts and collaboration across disciplines, there are opportunities to systematically identify and address this reproducibility crisis and to specifically apply these approaches to big data research in public health. Below, we discuss methodological and analytical approaches to address the previously discussed issues, reduce waste, and improve the reproducibility and replicability of big data research in public health.

Specifically, we focus on approaches to improve reproducibility, which is distinct from replicability. While both are important with regards to research ethics, replicability is about “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”, whereas reproducibility refers to “obtaining consistent results using the same input data’ computational steps, methods and code, and conditions of analysis” [ 25 ]. Though we mention “reproducibility” throughout this commentary, some of the arguments presented may apply to replicability as well. This is particularly true when it comes to transparency when reporting sampling, data collection, aggregation, inference methods, and study context; these affect both replication and reproduction [ 26 ].

2. Big Data Sources and Uses in Urban Public Health and Digital Epidemiology

Big data, as well as relevant methods and analytical approaches, have gained increasing popularity in recent years. This is reflected in the growing number of publications and research studies that have implemented big data methods across a variety of fields and sectors, such as manufacturing [ 27 ], supply-chain management [ 28 ], sports [ 29 ], education [ 30 ], and public health [ 31 ].

Public health, including urban health and epidemiological research, is a field where studies increasingly rely on big data methods, such as in the relatively new field of digital epidemiology [ 32 ]. The use of big data in public-health research is often characterized by the ‘3Vs’: variety in types of data as well as purposes; volume or amount of data; and velocity, referring to the speed at which the data are generated [ 33 ]. Because large datasets can invariably produce statistically significant findings but systematic biases are unaffected by data scale, big data studies are at greater risk of producing inaccurate results [ 34 , 35 , 36 , 37 ].

Big data sources that are used or could be potentially used in fields, such as urban public health and digital epidemiology, can be divided into two main categories. First, those that are collected or generated with health as a main focus and, second, those that are generated out of this scope but that can be associated with or impact public health ( Figure 1 ) [ 32 ].

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g001.jpg

Data sources used in urban public health and digital epidemiology research can broadly be organized along a continuum of health orientation of the process that generated them.

Data sources generated within the context of public health include large datasets captured within health systems or government health services at the population level, such as the case of Electronic Health Records (EHRs), Electronic Medical Records (EMRs), or personal health records (PHRs) [ 38 ]. Other examples include pharmacy and insurance records, omics data, as well as data collected by sensors and devices that are part of the internet of things (IoT) and are used for health purposes, ranging from smart continuous glucose monitors (CGMs) [ 39 ] to activity and sleep trackers.

In contrast, big data sources generated outside the public-health scope are virtually unlimited and ever-growing, covering virtually all domains in society. As a result, we will focus on selected and non-conclusive examples to illustrate and exemplify the diverse sources of big data that are used or could potentially be used in urban public health and digital epidemiology. Notably, social media have become an important source of big data used for research in different fields, including digital epidemiology. Twitter data have proven to be useful for collecting public-health information, for example, to measure mental health in different patient subgroups [ 40 ]. Examples of big data collected on Twitter that can be used in the context of public-health research are the Harvard CGA Geotweet Archive [ 41 ] or the University of Zurich Social Media Mental Health Surveillance project with their Geotweet Repository for the wider European Region [ 42 ]. Other initiatives, such as the SoBigData Research Infrastructure (RI), aim to foster reproducible and ethical research through the creation of a ‘Social Mining & Big Data Ecosystem’, allowing for the comparison, re-use, and integration of big data, methods, and services into research [ 22 ].

Cities increasingly use technological solutions, including IoT and multiple sensors, to monitor the urban environment, transitioning into Smart Cities with the objective of improving citizens’ quality of life [ 43 , 44 ]. Data stemming from Smart City applications have been used, for example, to predict air quality [ 45 ], analyse transportation to improve road safety [ 46 ], and have the potential to inform urban planning and policy design to build healthier and more sustainable cities [ 47 ].

Data mining techniques also allow for large datasets to be used in the context of urban public health and digital epidemiology. For example, a project using administrative data and data mining techniques in El Salvador identified anomalous spatiotemporal patterns of sexual violence and informed ways in which such analysis can be conducted in real time to allow for local law enforcement agencies and policy makers to respond appropriately [ 48 , 49 ]. Other large-dataset sources, such as transaction data [ 50 ], have been used to investigate the effect of sugar taxes [ 51 ] or labelling [ 52 ] on the consumption of healthy or unhealthy beverages and food products, which can eventually help model their potential impact on health outcomes.

3. Approaches to Improving Reproducibility and Scientific Integrity

Big data science has brought on new challenges, to which the scientific community needs to adapt by applying adequate ethical, methodological, and technological frameworks to cope with the increasing amount of data produced [ 53 ]. As a result, the timely adoption of approaches to address reproducibility and scientific integrity issues is imperative to ensure quality research and outcomes. A timely adoption is relevant not only for the scientific community but also for the general public that can potentially benefit from knowledge and advancements resulting from the use of big data research. This is particularly important in the context of urban public health and digital epidemiology, as the use of big data in these fields can help answer highly relevant and pressing descriptive (what is happening), predictive (what could happen), and prescriptive (why it happened) research questions [ 54 ]. A brief summary of the main points discussed in this section can be found in Figure 2 . We divide our proposed solutions in this commentary into three main domains: (1) good research practice, (2) scientific communication and publication, and (3) education.

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g002.jpg

Approaches that address good research practice, scientific communication, and education are important to improve reproducibility and scientific integrity.

3.1. Good Research Practice

Practices, such as pre-registration of protocols, predefining research questions and hypotheses, publicly sharing data analysis plans, and communicating through reporting guidelines, can improve the quality and reliability of research and results [ 55 , 56 ]. For experimental studies, clear and complete reporting and documentation are essential to allow for reproduction. Observational studies can also be registered on well-established registries, such as on clinicaltrials.gov. Importantly, pre-registration does not preclude publishing exploratory results; rather, it encourages such endeavours to be explicitly described as exploratory, with defined hypotheses and expected outcomes, which is appropriate [ 35 , 37 ].

Lack of data access is another key challenge to reproducibility. Adoption of open-science practices, including sharing of data and code, represents a partial solution to this issue [ 57 , 58 ], acknowledging that not all data can be shared openly owing to privacy concerns. Similarly, transparent descriptions of data collection and analytic methods are necessary for reproduction [ 59 ]. For example, in the analysis of human mobility, which has applications in a wide range of fields, including public health and digital epidemiology [ 60 , 61 ], the inference of ‘meaningful’ locations [ 62 ] from mobility data has been approached with a multitude of methods, some of which lack sufficient documentation. Whereas a research project using an undocumented method to identify subject homes cannot be reproduced, a project using Chen and Poorthius’s [ 63 ] R package ‘homelocator‘, which is open source and freely available, could be.

Likewise, a case could be made to collaboratively share big data within research networks and IT infrastructures. An example of a project tackling this issue in the context of public health is currently being developed by the Swiss Learning Health System (SLHS), focusing on the design and implementation of a metadata repository with the goal of developing Integrated Health Information Systems (HISs) in the Swiss context [ 64 , 65 ]. The implementation of such repositories and data-management systems allows for retrieval of and access to information; nevertheless, as information systems develop, new challenges arise, particularly when it comes to infrastructure as well as legal and ethical issues, such as data privacy. Solutions are currently in development; it is likely that decentralised data architectures based on blockchain will play an important role in integrated care and health information models [ 66 ]. We briefly expand on this topic in the Anticipated Challenges section below.

The adoption of appropriate big data handling techniques and analytical methods is also important to ensure the findability, accessibility, interoperability, and reusability (FAIR) [ 67 ] of both data and research outcomes [ 68 ]. Such characteristics allow for different stakeholders to use and reuse data and research outcomes for further research, replication, or even implementation purposes.

Complete and standardised reporting of aspects discussed in this section, for instance, in Reproducibility Network Groups, allows for meta-research and meta-analyses, the detection and minimization of publication bias, and the evaluation of the adherence of researchers to guidelines focused on ensuring scientific integrity. The use of checklists by individual researchers, research groups, departments, or even institutions can motivate the implementation of good research practices as well as clear and transparent reporting, ultimately improving research integrity [ 69 ]. Such checklists can serve as training tools for younger researchers, as well as offer practice guidelines to ensure quality research.

Senior researchers and research institutions are vital when it comes to tackling these challenges as well. The adoption of principles for research conduct, such as the Hong Kong principles, can help minimise the use of questionable research practices [ 70 ]. These principles are to: (1) assess responsible research practices; (2) value complete reporting; (3) reward the practice of open science; (4) acknowledge a broad range of research activities; and (5) recognise essential other tasks, such as peer review and mentoring [ 71 ]. The promotion of these principles by mentors and institutions is a cornerstone of good research practices for younger researchers.

3.2. Scientific Communication

Scientific communication, not only between researchers but also between institutions, should be promoted. Recently, requirements for researchers to make data public or open source have grown popular among journals and major funding agencies in the US, Europe, and globally; this is an important catalyst for open science and addressing issues such as reproducibility [ 72 ].

Likewise, publication and sharing of protocols, data, code, analysis, and tools are important. This not only facilitates reproducibility but also promotes openness and transparency [ 73 ]. For example, the Journal of Memory and Language adopted a mandatory data-sharing policy in 2019. An evaluation of this policy implementation found that data sharing increased more than 50% and the strongest predictor for reproducibility was the sharing of analysis code, increasing the probability of reproducibility by 40% [ 57 ]. Such practices are also fostered by the creation and use of infrastructure, such as the aforementioned SoBigData, and reproducibility network groups, such as the Swiss Reproducibility Network, a peer-lead group that aims to improve both replicability and reproducibility [ 74 ], improve communication, collaboration, and encourage the use of rigorous research practices.

When publishing or communicating their work, researchers should also keep in mind that transparency regarding whether studies are exploratory (hypothesis forming) or confirmatory (hypothesis testing) is important to distinguish from testing newly formed hypotheses and the testing of existing ones [ 75 ]; this is particularly important for informing future research. Journal reviewers and referees should also motivate researchers to accurately report this.

Similarly, when publishing results, the quality, impact, and relevance of a publication should be valued more than scores, such as the impact factor, to avoid “publishing for numbers” [ 76 ]. This would, of course, require a shift in the priorities and views shared within the research community and may be a challenging change to effect.

Academic editors can also play an important role by avoiding practices, such as ‘cherry-picking’ publications, either because of statistical significance of results or notoriety of the authors. Instead, practical significance, topic relevance, and replication studies should be important factors to consider, as well as valuing the reporting of negative results. It is important to acknowledge, though, that scientific publication structures face an important number of challenges that hinder the implementation of these practices. Some of these points are mentioned in the Challenges section that follows.

3.3. Education

Academic institutions have the responsibility to educate researchers in an integral way, covering not only the correct implementation of methodological approaches and appropriate reporting but also how to conduct research in an ethical way.

First, competence and capacity building should be addressed explicitly through courses, workshops, and competence-building programs aimed at developing technical skills, good research practices, and adequate application of methods and analytical tools. Other activities such as journal clubs can allow researchers to exchange and become familiar with different methodologies, stay up to date with current knowledge and ongoing research, and develop critical thinking skills [ 77 , 78 ], while fostering a mindset for continuous growth and improvement.

Second, by incorporating practice-based education, particularly with research groups that already adhere to best practices, such as the Hong Kong principles, institutions can foster norms valuing reproducibility implicitly as an aspect of researcher education.

4. Expected Outcomes

Ideally, successful implementation of the approaches proposed in Figure 2 , and the methodological and analytical approaches, such as the standardised protocols that were suggested by Simera et al. [ 55 ] and the Equator Network reporting guidelines [ 79 ], can potentially lead to a cultural shift in the research community. This, in turn, can enhance transparency and the quality of public-health research using big data by fostering interdisciplinary programs and worldwide cooperation among different health-related stakeholders, such as researchers, policy makers, clinicians, providers, and the public. Improving research quality can lead to greater value and reliability, while decreasing research waste, thus, improving the cost–value ratio and trust between stakeholders [ 80 , 81 ], and as previously stated, facilitating translation and implementation of research findings [ 18 ].

Just in the way replicability is fundamental in engineering to create functioning and reliable products or systems, replicability is also necessary for modelling and simulation in the fields of urban public health and digital epidemiology [ 82 ]. Simulation approaches built upon reproducible research allow for the construction of accurate prediction models with important implications for healthcare [ 83 ] and public health [ 84 ]. In the same way, reproduction and replication of results for model validation are essential [ 85 , 86 , 87 ].

The importance of reducing research waste and ensuring the value of health-related research is reflected in the existence of initiatives, such as the AllTrials Campaign, EQUATOR (enhancing the quality and transparency of health research), and EVBRES (evidence-based research), which promote protocol registration, full methods, and result reporting, and new studies that build on an existing evidence base [ 79 , 88 , 89 , 90 ].

Changes in editorial policies and practices can improve critical reflection on research quality by the authors. Having researchers, editors, and reviewers use guidelines [ 91 ], such as ARRIVE [ 92 ] in the case of pre-clinical animal studies or STROBE [ 93 ] for observational studies in epidemiology, can significantly improve reporting and transparency. For example, an observational cohort study analysing the effects of a change in the editorial policy of Nature , which introduced a checklist for manuscript preparation, demonstrated that reporting risk of bias improved substantially as a consequence [ 94 ].

A valuable outcome of adopting open science approaches that could result in improved communication, shared infrastructure, open data, and collaboration between researchers and even institutions is the implementation of competitions, challenges, or even ‘hackathons’. These events are already common among other disciplines, such as computer science, the digital tech sector, and social media research, and are becoming increasingly popular in areas related to public health. Some examples include the Big Data Hackathon San Diego, where the theme for 2022 was ‘Tackling Real-world Challenges in Healthcare’ [ 95 ], and the Yale CBIT Healthcare Hackathon of 2021, which aimed to build solutions to challenges faced in healthcare [ 96 ]. In addition to tackling issues in innovative ways, hackathons and other similar open initiatives invite the public to learn about and engage with science [ 97 ] and can be powerful tools for engaging diverse stakeholders and training beyond the classroom [ 98 ].

5. Anticipated Challenges

While the implementation of the approaches discussed ( Figure 2 ) will ideally translate to a significant reduction in research waste and improvement in scientific research through standardization and transparency, there are also substantial challenges to consider ( Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is ijerph-20-01473-g003.jpg

Examples of challenges to expect when implementing approaches aimed at improving reproducibility.

First, not all researchers have the adequate resources or opportunities to take advantage of new data that can be used to prevent, monitor, and improve population health. Early career researchers in low-resource settings may be at a particular disadvantage. Among these researchers, barriers to access and adequately using big data may not only be financial, when funding is not available, but also technical, when the knowledge and tools required are not available.

Similarly, events and activities among young researchers can facilitate technical development, networking, and knowledge acquisition, ultimately improving research quality and outcomes. Those who live in environments with limited resources, who are physically isolated, or have limited mobility may not have access to these opportunities. It might be possible to overcome some of these limitations with accessible digital solutions.

Much needed shifts in the research and publishing culture that currently enable Questionable Research Practices (QRPs), such as cherry picking (presenting favourable evidence or results while hiding unfavourable ones), p-hacking (misusing data through relentless analysis in order to obtain statistically significant results), HARKing (Hypothesizing After the Results are Known), among others [ 59 , 99 , 100 ]. To overcome these particular challenges embedded in modern day research, it is necessary to educate researchers about the scope of misconduct, create structures to avoid it from happening, and scrutinize cases in which these instances may be apparent to determine the actual motive [ 101 ].

Conventional data storing and handling strategies are not sufficient when working with big data, as these often impose additional monetary and computational costs. Some solutions are available to tackle these issues, such as cloud computing and platforms that allow end users to access shared resources over the internet [ 102 ]; Data Lakes, consisting of centralized repositories that allow for data storage and analysis [ 103 ]; and Data Mesh, a platform architecture that distributes data among several nodes [ 104 ]. Unfortunately, these solutions are not always easily accessible. Additionally, use of these platforms has given rise to important debates concerning issues, such as data governance and security [ 105 ].

The use of big data, and especially the use of personal and health information, raises privacy issues. The availability of personal and health information that results from the digital transformation represents a constant challenge when it comes to drawing a line between public and private, sensitive and non-sensitive information, and adherence to ethical research practices [ 106 ].

Ethical concerns are not limited to privacy; while big data entails the use of increasingly complex analytical methods that require expertise in order to deal with noise and uncertainty, there are several additional factors that may affect the accuracy of research results [ 107 ]. For example, when using machine learning approaches to analyse big data, methods should be cautiously chosen to avoid issues, such as undesired data-driven variable selection, algorithmic biases, and overfitting the analytic models [ 108 ]. Complexity increases the need for collaboration, which makes “team science” and other collaborative problem-solving events (such as Hackathons) increasingly popular. This leads to new requirements to adequately value and acknowledge contributorship [ 109 ].

Because statistical methods are becoming increasingly complex and the quantity of data is becoming greater, the number of scientific publications is also increasing, making it challenging for already-flawed peer-review systems to keep up by providing high-quality reviews to more and more complex research. Currently, there are mainly four expectations from peer-review processes: (i) assure quality and accuracy of research, (ii) establish a hierarchy of published work, (iii) provide fair and equal opportunities, and (iv) assure fraud-free research [ 110 ]; however, it is not certain whether current peer-review procedures achieve or are capable of delivering such expectations. Some solutions have been proposed to address these issues, such as the automation of peer-review processes [ 111 ] and the implementation of open review guidelines [ 112 , 113 , 114 ].

6. Conclusions

Big data research presents a unique opportunity for a cultural shift in the way public-health research is conducted today. At the same time, big data use will only result in a beneficial impact to the field if used adequately, taking the appropriate measures so that their full potential can be harnessed. The inherent complexity in working with large data quantities requires a clear and transparent framework at multiple levels, ranging from protocols and methods used by individual scientists to institution’s guiding dogma, research, and publishing practices.

The solutions summarized in this commentary are aimed at enhancing results, reproducibility, and scientific integrity; however, we acknowledge that these solutions are not exhaustive and there may be many other promising approaches to improve the integrity of big data research as it applies to public health. The solutions described in this commentary are in line with “a manifesto for reproducible science” published in the Nature Human Behavior journal [ 101 ]. Importantly, reproducibility is only of value if the findings are expected to have an important impact on science, health, and society. Reproducibility of results is highly relevant for funding agencies and governments, who often recognize the importance of research projects with well-structured study designs, defined data-processing steps, and transparent analysis plans (e.g., statistical analysis plans) [ 115 , 116 ]. For imaging data, such as radiologic images, analysis pipelines have been shown to be suitable to structure the analysis pathway [ 117 ]. This is specifically important for big data analysis where interdisciplinarity and collaboration become increasingly important. The development and use of statistical and reporting guidelines support researchers in making their projects more reproducible [ 118 ].

Transparency in all the study-design steps (i.e., from hypothesis generation to availability of collected data and code) is specifically relevant for public health and epidemiological research in order to encourage funding agencies, the public, and other researchers and relevant stakeholders to trust research results [ 119 ]. Similarly, as globalization and digitalization increase the diffusion of infectious diseases [ 120 ] and behavioural risks [ 121 ], research practices that foster reproducible results are imperative to implement and diffuse interventions more swiftly.

We believe that these recommendations outlined in this commentary are not unique to big data and that the entire research community could benefit from the use of these approaches [ 122 , 123 , 124 , 125 , 126 , 127 ]. However, what has been detailed in this commentary is specifically pertinent for big data, as an increase in the volume and complexity of data produced requires more structure and consequent data handling to avoid research waste. With clearly defined and openly shared guidelines, we may strengthen the quality of current research and initiate a shift on multiple levels: at the individual level, the institutional level, and the international level. Some challenges are to be expected, particularly when it comes to finding the right incentives for these changes to stick, but we are confident that with the right effort, we can put scientific integrity back at the forefront of researchers’ minds and, ultimately, strengthen the trust of the population in public-health research and, specifically, public-health research leveraging big data for urban public health and digital epidemiology.

The timely implementation of these solutions is highly relevant, not only to ensure the quality of research and scientific output, but also to potentially allow for the use of data sources that originated without public health in mind, spanning various fields that are relevant to urban public health and digital epidemiology. As outlined in this commentary, such data can originate from multiple sources, such as social media, mobile technologies, urban sensors, and GIS, to mention a few. As such data sources grow and become more readily available, it is important for researchers and the scientific community to be prepared to use these valuable and diverse data sources in innovative ways to advance research and practice. This would allow for the expanded use of big data to inform evidence-based decision making to positively impact public health.

Acknowledgments

We are grateful for the support of the Swiss School of Public Health (SSPH+) and, in particular, to all lecturers and participants of the class of 2021 Big Data in Public Health course. We would also like to extend our gratitude to the reviewers for their excellent feedback and suggestions.

Funding Statement

Swiss School of Public Health (SSPH+) to O.G. This commentary is an outcome of an SSPH+ PhD course on Big Data in Public Health (website: https://ssphplus.ch/en/graduate-campus/en/graduate-campus/course-program/ (accessed on 10 October 2022)).

Author Contributions

Conceptualization, A.C.Q.G., D.J.L., A.T.H., T.S. and O.G.; writing—original draft preparation, A.C.Q.G., D.J.L., A.T.H. and T.S.; writing—review and editing, A.C.Q.G., D.J.L., M.S., S.E., S.J.M., J.A.N., M.F. and O.G.; supervision, O.G. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Informed consent statement, data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Data Science

Research Areas

Main navigation.

The world is being transformed by data and data-driven analysis is rapidly becoming an integral part of science and society. Stanford Data Science is a collaborative effort across many departments in all seven schools. We strive to unite existing data science research initiatives and create interdisciplinary collaborations, connecting the data science and related methodologists with disciplines that are being transformed by data science and computation.

Our work supports research in a variety of fields where incredible advances are being made through the facilitation of meaningful collaborations between domain researchers, with deep expertise in societal and fundamental research challenges, and methods researchers that are developing next-generation computational tools and techniques, including:

Data Science for Wildland Fire Research

In recent years, wildfire has gone from an infrequent and distant news item to a centerstage isssue spanning many consecutive weeks for urban and suburban communities. Frequent wildfires are changing everyday lives for California in numerous ways -- from public safety power shutoffs to hazardous air quality -- that seemed inconceivable as recently as 2015. Moreover, elevated wildfire risk in the western United States (and similar climates globally) is here to stay into the foreseeable future. There is a plethora of problems that need solutions in the wildland fire arena; many of them are well suited to a data-driven approach.

Seminar Series

Data Science for Physics

Astrophysicists and particle physicists at Stanford and at the SLAC National Accelerator Laboratory are deeply engaged in studying the Universe at both the largest and smallest scales, with state-of-the-art instrumentation at telescopes and accelerator facilities

Data Science for Economics

Many of the most pressing questions in empirical economics concern causal questions, such as the impact, both short and long run, of educational choices on labor market outcomes, and of economic policies on distributions of outcomes. This makes them conceptually quite different from the predictive type of questions that many of the recently developed methods in machine learning are primarily designed for.

Data Science for Education

Educational data spans K-12 school and district records, digital archives of instructional materials and gradebooks, as well as student responses on course surveys. Data science of actual classroom interaction is also of increasing interest and reality.

Data Science for Human Health

It is clear that data science will be a driving force in transitioning the world’s healthcare systems from reactive “sick-based” care to proactive, preventive care.

Data Science for Humanity

Our modern era is characterized by massive amounts of data documenting the behaviors of individuals, groups, organizations, cultures, and indeed entire societies. This wealth of data on modern humanity is accompanied by massive digitization of historical data, both textual and numeric, in the form of historic newspapers, literary and linguistic corpora, economic data, censuses, and other government data, gathered and preserved over centuries, and newly digitized, acquired, and provisioned by libraries, scholars, and commercial entities.

Data Science for Linguistics

The impact of data science on linguistics has been profound. All areas of the field depend on having a rich picture of the true range of variation, within dialects, across dialects, and among different languages. The subfield of corpus linguistics is arguably as old as the field itself and, with the advent of computers, gave rise to many core techniques in data science.

Data Science for Nature and Sustainability

Many key sustainability issues translate into decision and optimization problems and could greatly benefit from data-driven decision making tools. In fact, the impact of modern information technology has been highly uneven, mainly benefiting large firms in profitable sectors, with little or no benefit in terms of the environment. Our vision is that data-driven methods can — and should — play a key role in increasing the efficiency and effectiveness of the way we manage and allocate our natural resources.

Ethics and Data Science

With the emergence of new techniques of machine learning, and the possibility of using algorithms to perform tasks previously done by human beings, as well as to generate new knowledge, we again face a set of new ethical questions.

The Science of Data Science

The practice of data analysis has changed enormously. Data science needs to find new inferential paradigms that allow data exploration prior to the formulation of hypotheses.

big data scientific research

Do scientists respond faster than Google trends in discussing COVID-19 issues? A new approach to textual big data

A study in Health Data Science introduces an advanced research framework to dissect the vast textual landscape surrounding COVID-19. This methodology leverages keywords from Google Trends alongside research abstracts from the WHO COVID-19 database, offering a nuanced understanding of the pandemic's discourse dynamics.

Throughout the pandemic, research has underpinned effective policy-making, yet tools like Google Trends (GT) often miss critical nuances captured in scholarly investigations. This study juxtaposes GT data with academic research, illuminating the breadth and depth of scientific engagement with COVID-19 issues compared to public inquiries.

Benson Shu Yan Lam, Associate Professor at The Hang Seng University of Hong Kong, highlights the endeavor to scrutinize the timeliness and interconnectivity of these data repositories, assessing whether GT can reliably signal emerging public concerns or if academic discourses provide earlier, more substantive insights.

A significant finding, as outlined by Amanda Man Ying Chu, Assistant Professor at The Education University of Hong Kong, is the academic community's precedence in addressing COVID-19 topics. Research abstracts not only broached these subjects before they surfaced in GT searches but also delved deeper, offering a richer, more detailed perspective beneficial for policy formulation.

The study also introduces the Coherent Topic Clustering (CTC) method, an innovative text-mining approach that effectively organizes key phrases from voluminous research abstracts, outperforming BERTopic, a contemporary deep learning-based model, in extracting pertinent themes.

Looking ahead, Mike Ka Pui So, Professor at The Hong Kong University of Science and Technology, envisions broadening this framework's utility beyond health science, particularly into financial news analysis. This expansion underscores the framework's capacity to meld qualitative and quantitative insights, a crucial frontier in contemporary financial scrutiny.

Reflecting on the implications, Professor So elucidates the prospective integration of diverse data types, marrying textual analysis with traditional numerical datasets to enrich financial analytics, demonstrating the research's interdisciplinary potential.

More information: Benson Shu Yan Lam et al, Do Scholars Respond Faster Than Google Trends in Discussing COVID-19 Issues? An Approach to Textual Big Data, Health Data Science (2024). DOI: 10.34133/hds.0116

Provided by Health Data Science

Network diagram of the GT keywords. Credit: Health Data Science (2024). DOI: 10.34133/hds.0116

NASA Logo

There is unequivocal evidence that Earth is warming at an unprecedented rate. Human activity is the principal cause.

big data scientific research

  • While Earth’s climate has changed throughout its history , the current warming is happening at a rate not seen in the past 10,000 years.
  • According to the Intergovernmental Panel on Climate Change ( IPCC ), "Since systematic scientific assessments began in the 1970s, the influence of human activity on the warming of the climate system has evolved from theory to established fact." 1
  • Scientific information taken from natural sources (such as ice cores, rocks, and tree rings) and from modern equipment (like satellites and instruments) all show the signs of a changing climate.
  • From global temperature rise to melting ice sheets, the evidence of a warming planet abounds.

The rate of change since the mid-20th century is unprecedented over millennia.

Earth's climate has changed throughout history. Just in the last 800,000 years, there have been eight cycles of ice ages and warmer periods, with the end of the last ice age about 11,700 years ago marking the beginning of the modern climate era — and of human civilization. Most of these climate changes are attributed to very small variations in Earth’s orbit that change the amount of solar energy our planet receives.

CO2_graph

The current warming trend is different because it is clearly the result of human activities since the mid-1800s, and is proceeding at a rate not seen over many recent millennia. 1 It is undeniable that human activities have produced the atmospheric gases that have trapped more of the Sun’s energy in the Earth system. This extra energy has warmed the atmosphere, ocean, and land, and widespread and rapid changes in the atmosphere, ocean, cryosphere, and biosphere have occurred.

Earth-orbiting satellites and new technologies have helped scientists see the big picture, collecting many different types of information about our planet and its climate all over the world. These data, collected over many years, reveal the signs and patterns of a changing climate.

Scientists demonstrated the heat-trapping nature of carbon dioxide and other gases in the mid-19th century. 2 Many of the science instruments NASA uses to study our climate focus on how these gases affect the movement of infrared radiation through the atmosphere. From the measured impacts of increases in these gases, there is no question that increased greenhouse gas levels warm Earth in response.

Scientific evidence for warming of the climate system is unequivocal.

big data scientific research

Intergovernmental Panel on Climate Change

Ice cores drawn from Greenland, Antarctica, and tropical mountain glaciers show that Earth’s climate responds to changes in greenhouse gas levels. Ancient evidence can also be found in tree rings, ocean sediments, coral reefs, and layers of sedimentary rocks. This ancient, or paleoclimate, evidence reveals that current warming is occurring roughly 10 times faster than the average rate of warming after an ice age. Carbon dioxide from human activities is increasing about 250 times faster than it did from natural sources after the last Ice Age. 3

The Evidence for Rapid Climate Change Is Compelling:

Sunlight over a desert-like landscape.

Global Temperature Is Rising

The planet's average surface temperature has risen about 2 degrees Fahrenheit (1 degrees Celsius) since the late 19th century, a change driven largely by increased carbon dioxide emissions into the atmosphere and other human activities. 4 Most of the warming occurred in the past 40 years, with the seven most recent years being the warmest. The years 2016 and 2020 are tied for the warmest year on record. 5 Image credit: Ashwin Kumar, Creative Commons Attribution-Share Alike 2.0 Generic.

Colonies of “blade fire coral” that have lost their symbiotic algae, or “bleached,” on a reef off of Islamorada, Florida.

The Ocean Is Getting Warmer

The ocean has absorbed much of this increased heat, with the top 100 meters (about 328 feet) of ocean showing warming of 0.67 degrees Fahrenheit (0.33 degrees Celsius) since 1969. 6 Earth stores 90% of the extra energy in the ocean. Image credit: Kelsey Roberts/USGS

Aerial view of ice sheets.

The Ice Sheets Are Shrinking

The Greenland and Antarctic ice sheets have decreased in mass. Data from NASA's Gravity Recovery and Climate Experiment show Greenland lost an average of 279 billion tons of ice per year between 1993 and 2019, while Antarctica lost about 148 billion tons of ice per year. 7 Image: The Antarctic Peninsula, Credit: NASA

Glacier on a mountain.

Glaciers Are Retreating

Glaciers are retreating almost everywhere around the world — including in the Alps, Himalayas, Andes, Rockies, Alaska, and Africa. 8 Image: Miles Glacier, Alaska Image credit: NASA

Image of snow from plane

Snow Cover Is Decreasing

Satellite observations reveal that the amount of spring snow cover in the Northern Hemisphere has decreased over the past five decades and the snow is melting earlier. 9 Image credit: NASA/JPL-Caltech

Norfolk flooding

Sea Level Is Rising

Global sea level rose about 8 inches (20 centimeters) in the last century. The rate in the last two decades, however, is nearly double that of the last century and accelerating slightly every year. 10 Image credit: U.S. Army Corps of Engineers Norfolk District

Arctic sea ice.

Arctic Sea Ice Is Declining

Both the extent and thickness of Arctic sea ice has declined rapidly over the last several decades. 11 Credit: NASA's Scientific Visualization Studio

Flooding in a European city.

Extreme Events Are Increasing in Frequency

The number of record high temperature events in the United States has been increasing, while the number of record low temperature events has been decreasing, since 1950. The U.S. has also witnessed increasing numbers of intense rainfall events. 12 Image credit: Régine Fabri,  CC BY-SA 4.0 , via Wikimedia Commons

Unhealthy coral.

Ocean Acidification Is Increasing

Since the beginning of the Industrial Revolution, the acidity of surface ocean waters has increased by about 30%. 13 , 14 This increase is due to humans emitting more carbon dioxide into the atmosphere and hence more being absorbed into the ocean. The ocean has absorbed between 20% and 30% of total anthropogenic carbon dioxide emissions in recent decades (7.2 to 10.8 billion metric tons per year). 1 5 , 16 Image credit: NOAA

1. IPCC Sixth Assessment Report, WGI, Technical Summary . B.D. Santer et.al., “A search for human influences on the thermal structure of the atmosphere.” Nature 382 (04 July 1996): 39-46. https://doi.org/10.1038/382039a0. Gabriele C. Hegerl et al., “Detecting Greenhouse-Gas-Induced Climate Change with an Optimal Fingerprint Method.” Journal of Climate 9 (October 1996): 2281-2306. https://doi.org/10.1175/1520-0442(1996)009<2281:DGGICC>2.0.CO;2. V. Ramaswamy, et al., “Anthropogenic and Natural Influences in the Evolution of Lower Stratospheric Cooling.” Science 311 (24 February 2006): 1138-1141. https://doi.org/10.1126/science.1122587. B.D. Santer et al., “Contributions of Anthropogenic and Natural Forcing to Recent Tropopause Height Changes.” Science 301 (25 July 2003): 479-483. https://doi.org/10.1126/science.1084123. T. Westerhold et al., "An astronomically dated record of Earth’s climate and its predictability over the last 66 million years." Science 369 (11 Sept. 2020): 1383-1387. https://doi.org/10.1126/science.1094123

2. In 1824, Joseph Fourier calculated that an Earth-sized planet, at our distance from the Sun, ought to be much colder. He suggested something in the atmosphere must be acting like an insulating blanket. In 1856, Eunice Foote discovered that blanket, showing that carbon dioxide and water vapor in Earth's atmosphere trap escaping infrared (heat) radiation. In the 1860s, physicist John Tyndall recognized Earth's natural greenhouse effect and suggested that slight changes in the atmospheric composition could bring about climatic variations. In 1896, a seminal paper by Swedish scientist Svante Arrhenius first predicted that changes in atmospheric carbon dioxide levels could substantially alter the surface temperature through the greenhouse effect. In 1938, Guy Callendar connected carbon dioxide increases in Earth’s atmosphere to global warming. In 1941, Milutin Milankovic linked ice ages to Earth’s orbital characteristics. Gilbert Plass formulated the Carbon Dioxide Theory of Climate Change in 1956.

3. IPCC Sixth Assessment Report, WG1, Chapter 2 Vostok ice core data; NOAA Mauna Loa CO2 record O. Gaffney, W. Steffen, "The Anthropocene Equation." The Anthropocene Review 4, issue 1 (April 2017): 53-61. https://doi.org/abs/10.1177/2053019616688022.

4. https://www.ncei.noaa.gov/monitoring https://crudata.uea.ac.uk/cru/data/temperature/ http://data.giss.nasa.gov/gistemp

5. https://www.giss.nasa.gov/research/news/20170118/

6. S. Levitus, J. Antonov, T. Boyer, O Baranova, H. Garcia, R. Locarnini, A. Mishonov, J. Reagan, D. Seidov, E. Yarosh, M. Zweng, " NCEI ocean heat content, temperature anomalies, salinity anomalies, thermosteric sea level anomalies, halosteric sea level anomalies, and total steric sea level anomalies from 1955 to present calculated from in situ oceanographic subsurface profile data (NCEI Accession 0164586), Version 4.4. (2017) NOAA National Centers for Environmental Information. https://www.nodc.noaa.gov/OC5/3M_HEAT_CONTENT/index3.html K. von Schuckmann, L. Cheng, L,. D. Palmer, J. Hansen, C. Tassone, V. Aich, S. Adusumilli, H. Beltrami, H., T. Boyer, F. Cuesta-Valero, D. Desbruyeres, C. Domingues, A. Garcia-Garcia, P. Gentine, J. Gilson, M. Gorfer, L. Haimberger, M. Ishii, M., G. Johnson, R. Killick, B. King, G. Kirchengast, N. Kolodziejczyk, J. Lyman, B. Marzeion, M. Mayer, M. Monier, D. Monselesan, S. Purkey, D. Roemmich, A. Schweiger, S. Seneviratne, A. Shepherd, D. Slater, A. Steiner, F. Straneo, M.L. Timmermans, S. Wijffels. "Heat stored in the Earth system: where does the energy go?" Earth System Science Data 12, Issue 3 (07 September 2020): 2013-2041. https://doi.org/10.5194/essd-12-2013-2020.

7. I. Velicogna, Yara Mohajerani, A. Geruo, F. Landerer, J. Mouginot, B. Noel, E. Rignot, T. Sutterly, M. van den Broeke, M. Wessem, D. Wiese, "Continuity of Ice Sheet Mass Loss in Greenland and Antarctica From the GRACE and GRACE Follow-On Missions." Geophysical Research Letters 47, Issue 8 (28 April 2020): e2020GL087291. https://doi.org/10.1029/2020GL087291.

8. National Snow and Ice Data Center World Glacier Monitoring Service

9. National Snow and Ice Data Center D.A. Robinson, D. K. Hall, and T. L. Mote, "MEaSUREs Northern Hemisphere Terrestrial Snow Cover Extent Daily 25km EASE-Grid 2.0, Version 1 (2017). Boulder, Colorado USA. NASA National Snow and Ice Data Center Distributed Active Archive Center. doi: https://doi.org/10.5067/MEASURES/CRYOSPHERE/nsidc-0530.001 . http://nsidc.org/cryosphere/sotc/snow_extent.html Rutgers University Global Snow Lab. Data History

10. R.S. Nerem, B.D. Beckley, J. T. Fasullo, B.D. Hamlington, D. Masters, and G.T. Mitchum, "Climate-change–driven accelerated sea-level rise detected in the altimeter era." PNAS 15, no. 9 (12 Feb. 2018): 2022-2025. https://doi.org/10.1073/pnas.1717312115.

11. https://nsidc.org/cryosphere/sotc/sea_ice.html Pan-Arctic Ice Ocean Modeling and Assimilation System (PIOMAS, Zhang and Rothrock, 2003) http://psc.apl.washington.edu/research/projects/arctic-sea-ice-volume-anomaly/ http://psc.apl.uw.edu/research/projects/projections-of-an-ice-diminished-arctic-ocean/

12. USGCRP, 2017: Climate Science Special Report: Fourth National Climate Assessment, Volume I [Wuebbles, D.J., D.W. Fahey, K.A. Hibbard, D.J. Dokken, B.C. Stewart, and T.K. Maycock (eds.)]. U.S. Global Change Research Program, Washington, DC, USA, 470 pp, https://doi.org/10.7930/j0j964j6 .

13. http://www.pmel.noaa.gov/co2/story/What+is+Ocean+Acidification%3F

14. http://www.pmel.noaa.gov/co2/story/Ocean+Acidification

15. C.L. Sabine, et al., “The Oceanic Sink for Anthropogenic CO2.” Science 305 (16 July 2004): 367-371. https://doi.org/10.1126/science.1097403.

16. Special Report on the Ocean and Cryosphere in a Changing Climate , Technical Summary, Chapter TS.5, Changing Ocean, Marine Ecosystems, and Dependent Communities, Section 5.2.2.3. https://www.ipcc.ch/srocc/chapter/technical-summary/

Header image shows clouds imitating mountains as the sun sets after midnight as seen from Denali's backcountry Unit 13 on June 14, 2019. Credit: NPS/Emily Mesner Image credit in list of evidence: Ashwin Kumar, Creative Commons Attribution-Share Alike 2.0 Generic.

Discover More Topics From NASA

Explore Earth Science

big data scientific research

Earth Science in Action

Earth Action

Earth Science Data

The sum of Earth's plants, on land and in the ocean, changes slightly from year to year as weather patterns shift.

Facts About Earth

big data scientific research

UCI MIND

  • Dr. Craig Stark takes UCI’s women’s health research to new heights with the Ann S. Bowers Women’s Brain Health Initiative

As we celebrate Women’s History Month this March, we also find ourselves at a historic moment in scientific inquiry for women’s health research. Today, nearly two-thirds of Americans with Alzheimer’s disease are women , but the underlying cause of this sex disparity is still poorly understood. For decades, research focusing on women’s health has been inadequate, with a mere 0.5% of all neuroimaging studies conducted over the past 25 years focusing on women’s health . The potential to reach new heights in our understanding of the brain, especially today in the era of “big data” and artificial intelligence, is promising, but requires the right combination of expertise on women’s health and large-scale coordinated efforts to take meaningful steps forward.  

The Ann S. Bowers Women’s Brain Health Initiative (WBHI), a new brain imaging consortium officially launched in November, represents a first-of-its-kind effort spanning multiple institutions within the UC system specifically aimed at accelerating transformative discoveries in women’s health through deeply collaborative science. The namesake of the WBHI honors the legacy of the late Ann S. Bowers, a trailblazer known for her leading roles in Intel, Apple, and the Noyce Foundation who dedicated much of her life to philanthropy and advancing efforts in technology and innovation.  

  One of the distinguishing features of the WBHI is its commitment to collaboration and data sharing. “Often in the neuroimaging community, we collect small amounts of data, and we do that repetitively. We have a siloed science model of how neuroimaging research is done. It overlooks one of the core features of the UC — we are one university system with campuses spread out across a geographically and demographically diverse state,” says the Director of the WBHI, Emily Jacobs, PhD, a cognitive neuroscientist from UC Santa Barbara (UCSB). Dr. Jacobs has spent her career conducting cutting-edge neuroimaging research for understanding the role of hormone action on the human brain. Her visionary work led to Dr. Jacobs being invited to the White House in March, where she represented the Ann S. Bowers WBHI at the signing of the historic executive order calling for $12B in funding for women’s health research .  

big data scientific research

President Biden signs the executive order to fund women’s health research. Photo credit: Emily Jacobs, PhD

UCI MIND faculty member and Professor of Neurobiology and Behavior, Craig Stark, PhD, is the Director of the UCI Campus Center for NeuroImaging (CCNI) and is a founding member of the WBHI. With UCI as a WBHI data collection site, Dr. Stark and his team of researchers are spearheading data sharing efforts that will support the WBHI vision of generating the most diverse and comprehensive collection of data ever acquired for women’s brain health. Furthermore, UCI Associate Professor of Neurobiology and Behavior, Elizabeth Chrastil, PhD, is co-leading one of the WBHI’s inaugural studies, the Maternal Brain Project along with Dr. Jacobs, where their labs are using precision imaging to map the maternal brain starting pre-conception through one year postpartum. Enrollment for this study at each site is currently ongoing.  

  Together, UCSB and UC Irvine, along with UC Berkeley, UC S an Francisco , UC San Diego, and UC Riverside, will harness the collective expertise and resources of the world-class imaging centers at each institution. The WBHI’s Data Coordinating Center at Stanford University, led by Russell Poldrack, PhD, will also play a key role in leading the effort to organize and prepare the data, which will be housed open-access on OpenNeuro . WBHI’s AI Core, a joint endeavor between Cornell University (Co-Director, Amy Kuceyeski, PhD) and UCSB (Co-Director, Nina Miolane, PhD), brings together leaders in computational science from industry and the academy to drive discovery for women’s brain health.  

To stay updated on WBHI’s latest news and research, continue to follow this blog and visit the Ann S. Bowers WBHI website where you can find information about participating in research studies and registering for their monthly webinar series .  

Related posts:

  • UCI MIND faculty member speaks at local Alzheimer’s event Dr. Frank LaFerla, Dean of the School of Biological Sciences and UCI MIND faculty member speaks at local educational event for the Alzheimer’s Foundation of America. Read about the event and the prevalence of Alzheimer’s disease in California here....
  • Taking Medication For High Blood Pressure May Lower Your Dementia Risk Treating high blood pressure in older adults reduces their risk of dementia, an analysis of previous research shows, providing more evidence that heart health and brain health are intimately connected. In a new meta-analysis, published Sept. 12 in JAMA Network Open, researchers found that older adults with untreated high blood pressure were 42% more...
  • How Old is Too Old to Govern? Most Americans favor an age limit for the president and other politicians. But some ethicists and scientists argue that’s ageist and scientifically unsound. Barring a considerable shift in the political winds, the next US president will be either 82 or 78 years old on Inauguration Day. Meanwhile, Senate Minority Leader...

Sign me up for weekly blog post notifications

Email address:

I have read and agree to the terms & conditions

Recent Posts

  • Dr. Ahmad Sajjadi comments on the donanemab decision
  • FDA delays decision on donanemab
  • MIND Matters | Quarterly Newsletter | Winter 2024
  • Carousel Slider
  • Community Events
  • Event Slider
  • In the News
  • Participants

Search Website

big data scientific research

  • Dr. Craig Stark takes UCI’s women’s health research to new heights with the Ann S. Bowers Women’s Brain Health Initiative March 28, 2024
  • Dr. Ahmad Sajjadi comments on the donanemab decision March 12, 2024
  • (no title) March 8, 2024
  • FDA delays decision on donanemab March 8, 2024
  • MIND Matters | Quarterly Newsletter | Winter 2024 March 1, 2024

Participate

CLINICAL TRIALS

Phone: 949.824.0008 Email: [email protected]

UCI C2C REGISTRY www.c2c.uci.edu

ADMINISTRATION

2642 Biological Sciences III Irvine, CA 92697-4545

Phone: 949.824.3253 Fax: 949.824.0885

RESEARCH CLINIC

1100 Gottschalk Medical Plaza Irvine, CA 92697-4285

Phone: 949.824.2382 Fax: 949.824.3049

© 2024 UCI MIND.

  • Message from the Director
  • Affiliated Members
  • Executive Committee
  • Leadership Council
  • Annual Update
  • Institute Highlights
  • Privacy Policy
  • About the ADRC
  • Administrative Core
  • Clinical Core
  • Data Management and Statistics Core
  • Neuropathology Core
  • Outreach, Recruitment, and Engagement Core
  • Induced Pluripotent Stem Cell Core
  • Biomarker Core
  • Down Syndrome Core
  • Research Education Component
  • Participate in a Study
  • UCI C2C Registry
  • Participant Hub
  • About Brain Donation
  • Featured Volunteers
  • Down Syndrome Program
  • The 90+ Study
  • SoCal Alzheimer’s Disease Research Conference
  • Distinguished Lecture on the Brain
  • Ask the Doc Panels
  • About Dementia
  • Mild Cognitive Impairment
  • Alzheimer’s Disease
  • Vascular Dementia
  • Lewy Body Dementia
  • Frontotemporal Dementia
  • Huntington’s Disease
  • Supportive Services
  • Diagnostic Toolkit
  • Toolkit Manual
  • Clinical Care Guidelines
  • Assessment of Cognitive Complaints Toolkit
  • Assessment of Cognitive Complaints Pocket Guide
  • Communicating MCI Diagnoses
  • Job Opportunities
  • UCI MINDcast
  • Event Calendar
  • Make A Gift
  • Gift Designations
  • Planned Gifts
  • Memorial Donations
  • A December to Remember Gala

Privacy Overview

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 26 March 2024

Predicting and improving complex beer flavor through machine learning

  • Michiel Schreurs   ORCID: orcid.org/0000-0002-9449-5619 1 , 2 , 3   na1 ,
  • Supinya Piampongsant 1 , 2 , 3   na1 ,
  • Miguel Roncoroni   ORCID: orcid.org/0000-0001-7461-1427 1 , 2 , 3   na1 ,
  • Lloyd Cool   ORCID: orcid.org/0000-0001-9936-3124 1 , 2 , 3 , 4 ,
  • Beatriz Herrera-Malaver   ORCID: orcid.org/0000-0002-5096-9974 1 , 2 , 3 ,
  • Christophe Vanderaa   ORCID: orcid.org/0000-0001-7443-5427 4 ,
  • Florian A. Theßeling 1 , 2 , 3 ,
  • Łukasz Kreft   ORCID: orcid.org/0000-0001-7620-4657 5 ,
  • Alexander Botzki   ORCID: orcid.org/0000-0001-6691-4233 5 ,
  • Philippe Malcorps 6 ,
  • Luk Daenen 6 ,
  • Tom Wenseleers   ORCID: orcid.org/0000-0002-1434-861X 4 &
  • Kevin J. Verstrepen   ORCID: orcid.org/0000-0002-3077-6219 1 , 2 , 3  

Nature Communications volume  15 , Article number:  2368 ( 2024 ) Cite this article

36k Accesses

741 Altmetric

Metrics details

  • Chemical engineering
  • Gas chromatography
  • Machine learning
  • Metabolomics
  • Taste receptors

The perception and appreciation of food flavor depends on many interacting chemical compounds and external factors, and therefore proves challenging to understand and predict. Here, we combine extensive chemical and sensory analyses of 250 different beers to train machine learning models that allow predicting flavor and consumer appreciation. For each beer, we measure over 200 chemical properties, perform quantitative descriptive sensory analysis with a trained tasting panel and map data from over 180,000 consumer reviews to train 10 different machine learning models. The best-performing algorithm, Gradient Boosting, yields models that significantly outperform predictions based on conventional statistics and accurately predict complex food features and consumer appreciation from chemical profiles. Model dissection allows identifying specific and unexpected compounds as drivers of beer flavor and appreciation. Adding these compounds results in variants of commercial alcoholic and non-alcoholic beers with improved consumer appreciation. Together, our study reveals how big data and machine learning uncover complex links between food chemistry, flavor and consumer perception, and lays the foundation to develop novel, tailored foods with superior flavors.

Similar content being viewed by others

big data scientific research

BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules

Rudraksh Tuwani, Somin Wadhwa & Ganesh Bagler

big data scientific research

Sensory lexicon and aroma volatiles analysis of brewing malt

Xiaoxia Su, Miao Yu, … Tianyi Du

big data scientific research

Predicting odor from molecular structure: a multi-label classification approach

Kushagra Saini & Venkatnarayan Ramanathan

Introduction

Predicting and understanding food perception and appreciation is one of the major challenges in food science. Accurate modeling of food flavor and appreciation could yield important opportunities for both producers and consumers, including quality control, product fingerprinting, counterfeit detection, spoilage detection, and the development of new products and product combinations (food pairing) 1 , 2 , 3 , 4 , 5 , 6 . Accurate models for flavor and consumer appreciation would contribute greatly to our scientific understanding of how humans perceive and appreciate flavor. Moreover, accurate predictive models would also facilitate and standardize existing food assessment methods and could supplement or replace assessments by trained and consumer tasting panels, which are variable, expensive and time-consuming 7 , 8 , 9 . Lastly, apart from providing objective, quantitative, accurate and contextual information that can help producers, models can also guide consumers in understanding their personal preferences 10 .

Despite the myriad of applications, predicting food flavor and appreciation from its chemical properties remains a largely elusive goal in sensory science, especially for complex food and beverages 11 , 12 . A key obstacle is the immense number of flavor-active chemicals underlying food flavor. Flavor compounds can vary widely in chemical structure and concentration, making them technically challenging and labor-intensive to quantify, even in the face of innovations in metabolomics, such as non-targeted metabolic fingerprinting 13 , 14 . Moreover, sensory analysis is perhaps even more complicated. Flavor perception is highly complex, resulting from hundreds of different molecules interacting at the physiochemical and sensorial level. Sensory perception is often non-linear, characterized by complex and concentration-dependent synergistic and antagonistic effects 15 , 16 , 17 , 18 , 19 , 20 , 21 that are further convoluted by the genetics, environment, culture and psychology of consumers 22 , 23 , 24 . Perceived flavor is therefore difficult to measure, with problems of sensitivity, accuracy, and reproducibility that can only be resolved by gathering sufficiently large datasets 25 . Trained tasting panels are considered the prime source of quality sensory data, but require meticulous training, are low throughput and high cost. Public databases containing consumer reviews of food products could provide a valuable alternative, especially for studying appreciation scores, which do not require formal training 25 . Public databases offer the advantage of amassing large amounts of data, increasing the statistical power to identify potential drivers of appreciation. However, public datasets suffer from biases, including a bias in the volunteers that contribute to the database, as well as confounding factors such as price, cult status and psychological conformity towards previous ratings of the product.

Classical multivariate statistics and machine learning methods have been used to predict flavor of specific compounds by, for example, linking structural properties of a compound to its potential biological activities or linking concentrations of specific compounds to sensory profiles 1 , 26 . Importantly, most previous studies focused on predicting organoleptic properties of single compounds (often based on their chemical structure) 27 , 28 , 29 , 30 , 31 , 32 , 33 , thus ignoring the fact that these compounds are present in a complex matrix in food or beverages and excluding complex interactions between compounds. Moreover, the classical statistics commonly used in sensory science 34 , 35 , 36 , 37 , 38 , 39 require a large sample size and sufficient variance amongst predictors to create accurate models. They are not fit for studying an extensive set of hundreds of interacting flavor compounds, since they are sensitive to outliers, have a high tendency to overfit and are less suited for non-linear and discontinuous relationships 40 .

In this study, we combine extensive chemical analyses and sensory data of a set of different commercial beers with machine learning approaches to develop models that predict taste, smell, mouthfeel and appreciation from compound concentrations. Beer is particularly suited to model the relationship between chemistry, flavor and appreciation. First, beer is a complex product, consisting of thousands of flavor compounds that partake in complex sensory interactions 41 , 42 , 43 . This chemical diversity arises from the raw materials (malt, yeast, hops, water and spices) and biochemical conversions during the brewing process (kilning, mashing, boiling, fermentation, maturation and aging) 44 , 45 . Second, the advent of the internet saw beer consumers embrace online review platforms, such as RateBeer (ZX Ventures, Anheuser-Busch InBev SA/NV) and BeerAdvocate (Next Glass, inc.). In this way, the beer community provides massive data sets of beer flavor and appreciation scores, creating extraordinarily large sensory databases to complement the analyses of our professional sensory panel. Specifically, we characterize over 200 chemical properties of 250 commercial beers, spread across 22 beer styles, and link these to the descriptive sensory profiling data of a 16-person in-house trained tasting panel and data acquired from over 180,000 public consumer reviews. These unique and extensive datasets enable us to train a suite of machine learning models to predict flavor and appreciation from a beer’s chemical profile. Dissection of the best-performing models allows us to pinpoint specific compounds as potential drivers of beer flavor and appreciation. Follow-up experiments confirm the importance of these compounds and ultimately allow us to significantly improve the flavor and appreciation of selected commercial beers. Together, our study represents a significant step towards understanding complex flavors and reinforces the value of machine learning to develop and refine complex foods. In this way, it represents a stepping stone for further computer-aided food engineering applications 46 .

To generate a comprehensive dataset on beer flavor, we selected 250 commercial Belgian beers across 22 different beer styles (Supplementary Fig.  S1 ). Beers with ≤ 4.2% alcohol by volume (ABV) were classified as non-alcoholic and low-alcoholic. Blonds and Tripels constitute a significant portion of the dataset (12.4% and 11.2%, respectively) reflecting their presence on the Belgian beer market and the heterogeneity of beers within these styles. By contrast, lager beers are less diverse and dominated by a handful of brands. Rare styles such as Brut or Faro make up only a small fraction of the dataset (2% and 1%, respectively) because fewer of these beers are produced and because they are dominated by distinct characteristics in terms of flavor and chemical composition.

Extensive analysis identifies relationships between chemical compounds in beer

For each beer, we measured 226 different chemical properties, including common brewing parameters such as alcohol content, iso-alpha acids, pH, sugar concentration 47 , and over 200 flavor compounds (Methods, Supplementary Table  S1 ). A large portion (37.2%) are terpenoids arising from hopping, responsible for herbal and fruity flavors 16 , 48 . A second major category are yeast metabolites, such as esters and alcohols, that result in fruity and solvent notes 48 , 49 , 50 . Other measured compounds are primarily derived from malt, or other microbes such as non- Saccharomyces yeasts and bacteria (‘wild flora’). Compounds that arise from spices or staling are labeled under ‘Others’. Five attributes (caloric value, total acids and total ester, hop aroma and sulfur compounds) are calculated from multiple individually measured compounds.

As a first step in identifying relationships between chemical properties, we determined correlations between the concentrations of the compounds (Fig.  1 , upper panel, Supplementary Data  1 and 2 , and Supplementary Fig.  S2 . For the sake of clarity, only a subset of the measured compounds is shown in Fig.  1 ). Compounds of the same origin typically show a positive correlation, while absence of correlation hints at parameters varying independently. For example, the hop aroma compounds citronellol, and alpha-terpineol show moderate correlations with each other (Spearman’s rho=0.39 and 0.57), but not with the bittering hop component iso-alpha acids (Spearman’s rho=0.16 and −0.07). This illustrates how brewers can independently modify hop aroma and bitterness by selecting hop varieties and dosage time. If hops are added early in the boiling phase, chemical conversions increase bitterness while aromas evaporate, conversely, late addition of hops preserves aroma but limits bitterness 51 . Similarly, hop-derived iso-alpha acids show a strong anti-correlation with lactic acid and acetic acid, likely reflecting growth inhibition of lactic acid and acetic acid bacteria, or the consequent use of fewer hops in sour beer styles, such as West Flanders ales and Fruit beers, that rely on these bacteria for their distinct flavors 52 . Finally, yeast-derived esters (ethyl acetate, ethyl decanoate, ethyl hexanoate, ethyl octanoate) and alcohols (ethanol, isoamyl alcohol, isobutanol, and glycerol), correlate with Spearman coefficients above 0.5, suggesting that these secondary metabolites are correlated with the yeast genetic background and/or fermentation parameters and may be difficult to influence individually, although the choice of yeast strain may offer some control 53 .

figure 1

Spearman rank correlations are shown. Descriptors are grouped according to their origin (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)), and sensory aspect (aroma, taste, palate, and overall appreciation). Please note that for the chemical compounds, for the sake of clarity, only a subset of the total number of measured compounds is shown, with an emphasis on the key compounds for each source. For more details, see the main text and Methods section. Chemical data can be found in Supplementary Data  1 , correlations between all chemical compounds are depicted in Supplementary Fig.  S2 and correlation values can be found in Supplementary Data  2 . See Supplementary Data  4 for sensory panel assessments and Supplementary Data  5 for correlation values between all sensory descriptors.

Interestingly, different beer styles show distinct patterns for some flavor compounds (Supplementary Fig.  S3 ). These observations agree with expectations for key beer styles, and serve as a control for our measurements. For instance, Stouts generally show high values for color (darker), while hoppy beers contain elevated levels of iso-alpha acids, compounds associated with bitter hop taste. Acetic and lactic acid are not prevalent in most beers, with notable exceptions such as Kriek, Lambic, Faro, West Flanders ales and Flanders Old Brown, which use acid-producing bacteria ( Lactobacillus and Pediococcus ) or unconventional yeast ( Brettanomyces ) 54 , 55 . Glycerol, ethanol and esters show similar distributions across all beer styles, reflecting their common origin as products of yeast metabolism during fermentation 45 , 53 . Finally, low/no-alcohol beers contain low concentrations of glycerol and esters. This is in line with the production process for most of the low/no-alcohol beers in our dataset, which are produced through limiting fermentation or by stripping away alcohol via evaporation or dialysis, with both methods having the unintended side-effect of reducing the amount of flavor compounds in the final beer 56 , 57 .

Besides expected associations, our data also reveals less trivial associations between beer styles and specific parameters. For example, geraniol and citronellol, two monoterpenoids responsible for citrus, floral and rose flavors and characteristic of Citra hops, are found in relatively high amounts in Christmas, Saison, and Brett/co-fermented beers, where they may originate from terpenoid-rich spices such as coriander seeds instead of hops 58 .

Tasting panel assessments reveal sensorial relationships in beer

To assess the sensory profile of each beer, a trained tasting panel evaluated each of the 250 beers for 50 sensory attributes, including different hop, malt and yeast flavors, off-flavors and spices. Panelists used a tasting sheet (Supplementary Data  3 ) to score the different attributes. Panel consistency was evaluated by repeating 12 samples across different sessions and performing ANOVA. In 95% of cases no significant difference was found across sessions ( p  > 0.05), indicating good panel consistency (Supplementary Table  S2 ).

Aroma and taste perception reported by the trained panel are often linked (Fig.  1 , bottom left panel and Supplementary Data  4 and 5 ), with high correlations between hops aroma and taste (Spearman’s rho=0.83). Bitter taste was found to correlate with hop aroma and taste in general (Spearman’s rho=0.80 and 0.69), and particularly with “grassy” noble hops (Spearman’s rho=0.75). Barnyard flavor, most often associated with sour beers, is identified together with stale hops (Spearman’s rho=0.97) that are used in these beers. Lactic and acetic acid, which often co-occur, are correlated (Spearman’s rho=0.66). Interestingly, sweetness and bitterness are anti-correlated (Spearman’s rho = −0.48), confirming the hypothesis that they mask each other 59 , 60 . Beer body is highly correlated with alcohol (Spearman’s rho = 0.79), and overall appreciation is found to correlate with multiple aspects that describe beer mouthfeel (alcohol, carbonation; Spearman’s rho= 0.32, 0.39), as well as with hop and ester aroma intensity (Spearman’s rho=0.39 and 0.35).

Similar to the chemical analyses, sensorial analyses confirmed typical features of specific beer styles (Supplementary Fig.  S4 ). For example, sour beers (Faro, Flanders Old Brown, Fruit beer, Kriek, Lambic, West Flanders ale) were rated acidic, with flavors of both acetic and lactic acid. Hoppy beers were found to be bitter and showed hop-associated aromas like citrus and tropical fruit. Malt taste is most detected among scotch, stout/porters, and strong ales, while low/no-alcohol beers, which often have a reputation for being ‘worty’ (reminiscent of unfermented, sweet malt extract) appear in the middle. Unsurprisingly, hop aromas are most strongly detected among hoppy beers. Like its chemical counterpart (Supplementary Fig.  S3 ), acidity shows a right-skewed distribution, with the most acidic beers being Krieks, Lambics, and West Flanders ales.

Tasting panel assessments of specific flavors correlate with chemical composition

We find that the concentrations of several chemical compounds strongly correlate with specific aroma or taste, as evaluated by the tasting panel (Fig.  2 , Supplementary Fig.  S5 , Supplementary Data  6 ). In some cases, these correlations confirm expectations and serve as a useful control for data quality. For example, iso-alpha acids, the bittering compounds in hops, strongly correlate with bitterness (Spearman’s rho=0.68), while ethanol and glycerol correlate with tasters’ perceptions of alcohol and body, the mouthfeel sensation of fullness (Spearman’s rho=0.82/0.62 and 0.72/0.57 respectively) and darker color from roasted malts is a good indication of malt perception (Spearman’s rho=0.54).

figure 2

Heatmap colors indicate Spearman’s Rho. Axes are organized according to sensory categories (aroma, taste, mouthfeel, overall), chemical categories and chemical sources in beer (malt (blue), hops (green), yeast (red), wild flora (yellow), Others (black)). See Supplementary Data  6 for all correlation values.

Interestingly, for some relationships between chemical compounds and perceived flavor, correlations are weaker than expected. For example, the rose-smelling phenethyl acetate only weakly correlates with floral aroma. This hints at more complex relationships and interactions between compounds and suggests a need for a more complex model than simple correlations. Lastly, we uncovered unexpected correlations. For instance, the esters ethyl decanoate and ethyl octanoate appear to correlate slightly with hop perception and bitterness, possibly due to their fruity flavor. Iron is anti-correlated with hop aromas and bitterness, most likely because it is also anti-correlated with iso-alpha acids. This could be a sign of metal chelation of hop acids 61 , given that our analyses measure unbound hop acids and total iron content, or could result from the higher iron content in dark and Fruit beers, which typically have less hoppy and bitter flavors 62 .

Public consumer reviews complement expert panel data

To complement and expand the sensory data of our trained tasting panel, we collected 180,000 reviews of our 250 beers from the online consumer review platform RateBeer. This provided numerical scores for beer appearance, aroma, taste, palate, overall quality as well as the average overall score.

Public datasets are known to suffer from biases, such as price, cult status and psychological conformity towards previous ratings of a product. For example, prices correlate with appreciation scores for these online consumer reviews (rho=0.49, Supplementary Fig.  S6 ), but not for our trained tasting panel (rho=0.19). This suggests that prices affect consumer appreciation, which has been reported in wine 63 , while blind tastings are unaffected. Moreover, we observe that some beer styles, like lagers and non-alcoholic beers, generally receive lower scores, reflecting that online reviewers are mostly beer aficionados with a preference for specialty beers over lager beers. In general, we find a modest correlation between our trained panel’s overall appreciation score and the online consumer appreciation scores (Fig.  3 , rho=0.29). Apart from the aforementioned biases in the online datasets, serving temperature, sample freshness and surroundings, which are all tightly controlled during the tasting panel sessions, can vary tremendously across online consumers and can further contribute to (among others, appreciation) differences between the two categories of tasters. Importantly, in contrast to the overall appreciation scores, for many sensory aspects the results from the professional panel correlated well with results obtained from RateBeer reviews. Correlations were highest for features that are relatively easy to recognize even for untrained tasters, like bitterness, sweetness, alcohol and malt aroma (Fig.  3 and below).

figure 3

RateBeer text mining results can be found in Supplementary Data  7 . Rho values shown are Spearman correlation values, with asterisks indicating significant correlations ( p  < 0.05, two-sided). All p values were smaller than 0.001, except for Esters aroma (0.0553), Esters taste (0.3275), Esters aroma—banana (0.0019), Coriander (0.0508) and Diacetyl (0.0134).

Besides collecting consumer appreciation from these online reviews, we developed automated text analysis tools to gather additional data from review texts (Supplementary Data  7 ). Processing review texts on the RateBeer database yielded comparable results to the scores given by the trained panel for many common sensory aspects, including acidity, bitterness, sweetness, alcohol, malt, and hop tastes (Fig.  3 ). This is in line with what would be expected, since these attributes require less training for accurate assessment and are less influenced by environmental factors such as temperature, serving glass and odors in the environment. Consumer reviews also correlate well with our trained panel for 4-vinyl guaiacol, a compound associated with a very characteristic aroma. By contrast, correlations for more specific aromas like ester, coriander or diacetyl are underrepresented in the online reviews, underscoring the importance of using a trained tasting panel and standardized tasting sheets with explicit factors to be scored for evaluating specific aspects of a beer. Taken together, our results suggest that public reviews are trustworthy for some, but not all, flavor features and can complement or substitute taste panel data for these sensory aspects.

Models can predict beer sensory profiles from chemical data

The rich datasets of chemical analyses, tasting panel assessments and public reviews gathered in the first part of this study provided us with a unique opportunity to develop predictive models that link chemical data to sensorial features. Given the complexity of beer flavor, basic statistical tools such as correlations or linear regression may not always be the most suitable for making accurate predictions. Instead, we applied different machine learning models that can model both simple linear and complex interactive relationships. Specifically, we constructed a set of regression models to predict (a) trained panel scores for beer flavor and quality and (b) public reviews’ appreciation scores from beer chemical profiles. We trained and tested 10 different models (Methods), 3 linear regression-based models (simple linear regression with first-order interactions (LR), lasso regression with first-order interactions (Lasso), partial least squares regressor (PLSR)), 5 decision tree models (AdaBoost regressor (ABR), extra trees (ET), gradient boosting regressor (GBR), random forest (RF) and XGBoost regressor (XGBR)), 1 support vector regression (SVR), and 1 artificial neural network (ANN) model.

To compare the performance of our machine learning models, the dataset was randomly split into a training and test set, stratified by beer style. After a model was trained on data in the training set, its performance was evaluated on its ability to predict the test dataset obtained from multi-output models (based on the coefficient of determination, see Methods). Additionally, individual-attribute models were ranked per descriptor and the average rank was calculated, as proposed by Korneva et al. 64 . Importantly, both ways of evaluating the models’ performance agreed in general. Performance of the different models varied (Table  1 ). It should be noted that all models perform better at predicting RateBeer results than results from our trained tasting panel. One reason could be that sensory data is inherently variable, and this variability is averaged out with the large number of public reviews from RateBeer. Additionally, all tree-based models perform better at predicting taste than aroma. Linear models (LR) performed particularly poorly, with negative R 2 values, due to severe overfitting (training set R 2  = 1). Overfitting is a common issue in linear models with many parameters and limited samples, especially with interaction terms further amplifying the number of parameters. L1 regularization (Lasso) successfully overcomes this overfitting, out-competing multiple tree-based models on the RateBeer dataset. Similarly, the dimensionality reduction of PLSR avoids overfitting and improves performance, to some extent. Still, tree-based models (ABR, ET, GBR, RF and XGBR) show the best performance, out-competing the linear models (LR, Lasso, PLSR) commonly used in sensory science 65 .

GBR models showed the best overall performance in predicting sensory responses from chemical information, with R 2 values up to 0.75 depending on the predicted sensory feature (Supplementary Table  S4 ). The GBR models predict consumer appreciation (RateBeer) better than our trained panel’s appreciation (R 2 value of 0.67 compared to R 2 value of 0.09) (Supplementary Table  S3 and Supplementary Table  S4 ). ANN models showed intermediate performance, likely because neural networks typically perform best with larger datasets 66 . The SVR shows intermediate performance, mostly due to the weak predictions of specific attributes that lower the overall performance (Supplementary Table  S4 ).

Model dissection identifies specific, unexpected compounds as drivers of consumer appreciation

Next, we leveraged our models to infer important contributors to sensory perception and consumer appreciation. Consumer preference is a crucial sensory aspects, because a product that shows low consumer appreciation scores often does not succeed commercially 25 . Additionally, the requirement for a large number of representative evaluators makes consumer trials one of the more costly and time-consuming aspects of product development. Hence, a model for predicting chemical drivers of overall appreciation would be a welcome addition to the available toolbox for food development and optimization.

Since GBR models on our RateBeer dataset showed the best overall performance, we focused on these models. Specifically, we used two approaches to identify important contributors. First, rankings of the most important predictors for each sensorial trait in the GBR models were obtained based on impurity-based feature importance (mean decrease in impurity). High-ranked parameters were hypothesized to be either the true causal chemical properties underlying the trait, to correlate with the actual causal properties, or to take part in sensory interactions affecting the trait 67 (Fig.  4A ). In a second approach, we used SHAP 68 to determine which parameters contributed most to the model for making predictions of consumer appreciation (Fig.  4B ). SHAP calculates parameter contributions to model predictions on a per-sample basis, which can be aggregated into an importance score.

figure 4

A The impurity-based feature importance (mean deviance in impurity, MDI) calculated from the Gradient Boosting Regression (GBR) model predicting RateBeer appreciation scores. The top 15 highest ranked chemical properties are shown. B SHAP summary plot for the top 15 parameters contributing to our GBR model. Each point on the graph represents a sample from our dataset. The color represents the concentration of that parameter, with bluer colors representing low values and redder colors representing higher values. Greater absolute values on the horizontal axis indicate a higher impact of the parameter on the prediction of the model. C Spearman correlations between the 15 most important chemical properties and consumer overall appreciation. Numbers indicate the Spearman Rho correlation coefficient, and the rank of this correlation compared to all other correlations. The top 15 important compounds were determined using SHAP (panel B).

Both approaches identified ethyl acetate as the most predictive parameter for beer appreciation (Fig.  4 ). Ethyl acetate is the most abundant ester in beer with a typical ‘fruity’, ‘solvent’ and ‘alcoholic’ flavor, but is often considered less important than other esters like isoamyl acetate. The second most important parameter identified by SHAP is ethanol, the most abundant beer compound after water. Apart from directly contributing to beer flavor and mouthfeel, ethanol drastically influences the physical properties of beer, dictating how easily volatile compounds escape the beer matrix to contribute to beer aroma 69 . Importantly, it should also be noted that the importance of ethanol for appreciation is likely inflated by the very low appreciation scores of non-alcoholic beers (Supplementary Fig.  S4 ). Despite not often being considered a driver of beer appreciation, protein level also ranks highly in both approaches, possibly due to its effect on mouthfeel and body 70 . Lactic acid, which contributes to the tart taste of sour beers, is the fourth most important parameter identified by SHAP, possibly due to the generally high appreciation of sour beers in our dataset.

Interestingly, some of the most important predictive parameters for our model are not well-established as beer flavors or are even commonly regarded as being negative for beer quality. For example, our models identify methanethiol and ethyl phenyl acetate, an ester commonly linked to beer staling 71 , as a key factor contributing to beer appreciation. Although there is no doubt that high concentrations of these compounds are considered unpleasant, the positive effects of modest concentrations are not yet known 72 , 73 .

To compare our approach to conventional statistics, we evaluated how well the 15 most important SHAP-derived parameters correlate with consumer appreciation (Fig.  4C ). Interestingly, only 6 of the properties derived by SHAP rank amongst the top 15 most correlated parameters. For some chemical compounds, the correlations are so low that they would have likely been considered unimportant. For example, lactic acid, the fourth most important parameter, shows a bimodal distribution for appreciation, with sour beers forming a separate cluster, that is missed entirely by the Spearman correlation. Additionally, the correlation plots reveal outliers, emphasizing the need for robust analysis tools. Together, this highlights the need for alternative models, like the Gradient Boosting model, that better grasp the complexity of (beer) flavor.

Finally, to observe the relationships between these chemical properties and their predicted targets, partial dependence plots were constructed for the six most important predictors of consumer appreciation 74 , 75 , 76 (Supplementary Fig.  S7 ). One-way partial dependence plots show how a change in concentration affects the predicted appreciation. These plots reveal an important limitation of our models: appreciation predictions remain constant at ever-increasing concentrations. This implies that once a threshold concentration is reached, further increasing the concentration does not affect appreciation. This is false, as it is well-documented that certain compounds become unpleasant at high concentrations, including ethyl acetate (‘nail polish’) 77 and methanethiol (‘sulfury’ and ‘rotten cabbage’) 78 . The inability of our models to grasp that flavor compounds have optimal levels, above which they become negative, is a consequence of working with commercial beer brands where (off-)flavors are rarely too high to negatively impact the product. The two-way partial dependence plots show how changing the concentration of two compounds influences predicted appreciation, visualizing their interactions (Supplementary Fig.  S7 ). In our case, the top 5 parameters are dominated by additive or synergistic interactions, with high concentrations for both compounds resulting in the highest predicted appreciation.

To assess the robustness of our best-performing models and model predictions, we performed 100 iterations of the GBR, RF and ET models. In general, all iterations of the models yielded similar performance (Supplementary Fig.  S8 ). Moreover, the main predictors (including the top predictors ethanol and ethyl acetate) remained virtually the same, especially for GBR and RF. For the iterations of the ET model, we did observe more variation in the top predictors, which is likely a consequence of the model’s inherent random architecture in combination with co-correlations between certain predictors. However, even in this case, several of the top predictors (ethanol and ethyl acetate) remain unchanged, although their rank in importance changes (Supplementary Fig.  S8 ).

Next, we investigated if a combination of RateBeer and trained panel data into one consolidated dataset would lead to stronger models, under the hypothesis that such a model would suffer less from bias in the datasets. A GBR model was trained to predict appreciation on the combined dataset. This model underperformed compared to the RateBeer model, both in the native case and when including a dataset identifier (R 2  = 0.67, 0.26 and 0.42 respectively). For the latter, the dataset identifier is the most important feature (Supplementary Fig.  S9 ), while most of the feature importance remains unchanged, with ethyl acetate and ethanol ranking highest, like in the original model trained only on RateBeer data. It seems that the large variation in the panel dataset introduces noise, weakening the models’ performances and reliability. In addition, it seems reasonable to assume that both datasets are fundamentally different, with the panel dataset obtained by blind tastings by a trained professional panel.

Lastly, we evaluated whether beer style identifiers would further enhance the model’s performance. A GBR model was trained with parameters that explicitly encoded the styles of the samples. This did not improve model performance (R2 = 0.66 with style information vs R2 = 0.67). The most important chemical features are consistent with the model trained without style information (eg. ethanol and ethyl acetate), and with the exception of the most preferred (strong ale) and least preferred (low/no-alcohol) styles, none of the styles were among the most important features (Supplementary Fig.  S9 , Supplementary Table  S5 and S6 ). This is likely due to a combination of style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original models, as well as the low number of samples belonging to some styles, making it difficult for the model to learn style-specific patterns. Moreover, beer styles are not rigorously defined, with some styles overlapping in features and some beers being misattributed to a specific style, all of which leads to more noise in models that use style parameters.

Model validation

To test if our predictive models give insight into beer appreciation, we set up experiments aimed at improving existing commercial beers. We specifically selected overall appreciation as the trait to be examined because of its complexity and commercial relevance. Beer flavor comprises a complex bouquet rather than single aromas and tastes 53 . Hence, adding a single compound to the extent that a difference is noticeable may lead to an unbalanced, artificial flavor. Therefore, we evaluated the effect of combinations of compounds. Because Blond beers represent the most extensive style in our dataset, we selected a beer from this style as the starting material for these experiments (Beer 64 in Supplementary Data  1 ).

In the first set of experiments, we adjusted the concentrations of compounds that made up the most important predictors of overall appreciation (ethyl acetate, ethanol, lactic acid, ethyl phenyl acetate) together with correlated compounds (ethyl hexanoate, isoamyl acetate, glycerol), bringing them up to 95 th percentile ethanol-normalized concentrations (Methods) within the Blond group (‘Spiked’ concentration in Fig.  5A ). Compared to controls, the spiked beers were found to have significantly improved overall appreciation among trained panelists, with panelist noting increased intensity of ester flavors, sweetness, alcohol, and body fullness (Fig.  5B ). To disentangle the contribution of ethanol to these results, a second experiment was performed without the addition of ethanol. This resulted in a similar outcome, including increased perception of alcohol and overall appreciation.

figure 5

Adding the top chemical compounds, identified as best predictors of appreciation by our model, into poorly appreciated beers results in increased appreciation from our trained panel. Results of sensory tests between base beers and those spiked with compounds identified as the best predictors by the model. A Blond and Non/Low-alcohol (0.0% ABV) base beers were brought up to 95th-percentile ethanol-normalized concentrations within each style. B For each sensory attribute, tasters indicated the more intense sample and selected the sample they preferred. The numbers above the bars correspond to the p values that indicate significant changes in perceived flavor (two-sided binomial test: alpha 0.05, n  = 20 or 13).

In a last experiment, we tested whether using the model’s predictions can boost the appreciation of a non-alcoholic beer (beer 223 in Supplementary Data  1 ). Again, the addition of a mixture of predicted compounds (omitting ethanol, in this case) resulted in a significant increase in appreciation, body, ester flavor and sweetness.

Predicting flavor and consumer appreciation from chemical composition is one of the ultimate goals of sensory science. A reliable, systematic and unbiased way to link chemical profiles to flavor and food appreciation would be a significant asset to the food and beverage industry. Such tools would substantially aid in quality control and recipe development, offer an efficient and cost-effective alternative to pilot studies and consumer trials and would ultimately allow food manufacturers to produce superior, tailor-made products that better meet the demands of specific consumer groups more efficiently.

A limited set of studies have previously tried, to varying degrees of success, to predict beer flavor and beer popularity based on (a limited set of) chemical compounds and flavors 79 , 80 . Current sensitive, high-throughput technologies allow measuring an unprecedented number of chemical compounds and properties in a large set of samples, yielding a dataset that can train models that help close the gaps between chemistry and flavor, even for a complex natural product like beer. To our knowledge, no previous research gathered data at this scale (250 samples, 226 chemical parameters, 50 sensory attributes and 5 consumer scores) to disentangle and validate the chemical aspects driving beer preference using various machine-learning techniques. We find that modern machine learning models outperform conventional statistical tools, such as correlations and linear models, and can successfully predict flavor appreciation from chemical composition. This could be attributed to the natural incorporation of interactions and non-linear or discontinuous effects in machine learning models, which are not easily grasped by the linear model architecture. While linear models and partial least squares regression represent the most widespread statistical approaches in sensory science, in part because they allow interpretation 65 , 81 , 82 , modern machine learning methods allow for building better predictive models while preserving the possibility to dissect and exploit the underlying patterns. Of the 10 different models we trained, tree-based models, such as our best performing GBR, showed the best overall performance in predicting sensory responses from chemical information, outcompeting artificial neural networks. This agrees with previous reports for models trained on tabular data 83 . Our results are in line with the findings of Colantonio et al. who also identified the gradient boosting architecture as performing best at predicting appreciation and flavor (of tomatoes and blueberries, in their specific study) 26 . Importantly, besides our larger experimental scale, we were able to directly confirm our models’ predictions in vivo.

Our study confirms that flavor compound concentration does not always correlate with perception, suggesting complex interactions that are often missed by more conventional statistics and simple models. Specifically, we find that tree-based algorithms may perform best in developing models that link complex food chemistry with aroma. Furthermore, we show that massive datasets of untrained consumer reviews provide a valuable source of data, that can complement or even replace trained tasting panels, especially for appreciation and basic flavors, such as sweetness and bitterness. This holds despite biases that are known to occur in such datasets, such as price or conformity bias. Moreover, GBR models predict taste better than aroma. This is likely because taste (e.g. bitterness) often directly relates to the corresponding chemical measurements (e.g., iso-alpha acids), whereas such a link is less clear for aromas, which often result from the interplay between multiple volatile compounds. We also find that our models are best at predicting acidity and alcohol, likely because there is a direct relation between the measured chemical compounds (acids and ethanol) and the corresponding perceived sensorial attribute (acidity and alcohol), and because even untrained consumers are generally able to recognize these flavors and aromas.

The predictions of our final models, trained on review data, hold even for blind tastings with small groups of trained tasters, as demonstrated by our ability to validate specific compounds as drivers of beer flavor and appreciation. Since adding a single compound to the extent of a noticeable difference may result in an unbalanced flavor profile, we specifically tested our identified key drivers as a combination of compounds. While this approach does not allow us to validate if a particular single compound would affect flavor and/or appreciation, our experiments do show that this combination of compounds increases consumer appreciation.

It is important to stress that, while it represents an important step forward, our approach still has several major limitations. A key weakness of the GBR model architecture is that amongst co-correlating variables, the largest main effect is consistently preferred for model building. As a result, co-correlating variables often have artificially low importance scores, both for impurity and SHAP-based methods, like we observed in the comparison to the more randomized Extra Trees models. This implies that chemicals identified as key drivers of a specific sensory feature by GBR might not be the true causative compounds, but rather co-correlate with the actual causative chemical. For example, the high importance of ethyl acetate could be (partially) attributed to the total ester content, ethanol or ethyl hexanoate (rho=0.77, rho=0.72 and rho=0.68), while ethyl phenylacetate could hide the importance of prenyl isobutyrate and ethyl benzoate (rho=0.77 and rho=0.76). Expanding our GBR model to include beer style as a parameter did not yield additional power or insight. This is likely due to style-specific chemical signatures, such as iso-alpha acids and lactic acid, that implicitly convey style information to the original model, as well as the smaller sample size per style, limiting the power to uncover style-specific patterns. This can be partly attributed to the curse of dimensionality, where the high number of parameters results in the models mainly incorporating single parameter effects, rather than complex interactions such as style-dependent effects 67 . A larger number of samples may overcome some of these limitations and offer more insight into style-specific effects. On the other hand, beer style is not a rigid scientific classification, and beers within one style often differ a lot, which further complicates the analysis of style as a model factor.

Our study is limited to beers from Belgian breweries. Although these beers cover a large portion of the beer styles available globally, some beer styles and consumer patterns may be missing, while other features might be overrepresented. For example, many Belgian ales exhibit yeast-driven flavor profiles, which is reflected in the chemical drivers of appreciation discovered by this study. In future work, expanding the scope to include diverse markets and beer styles could lead to the identification of even more drivers of appreciation and better models for special niche products that were not present in our beer set.

In addition to inherent limitations of GBR models, there are also some limitations associated with studying food aroma. Even if our chemical analyses measured most of the known aroma compounds, the total number of flavor compounds in complex foods like beer is still larger than the subset we were able to measure in this study. For example, hop-derived thiols, that influence flavor at very low concentrations, are notoriously difficult to measure in a high-throughput experiment. Moreover, consumer perception remains subjective and prone to biases that are difficult to avoid. It is also important to stress that the models are still immature and that more extensive datasets will be crucial for developing more complete models in the future. Besides more samples and parameters, our dataset does not include any demographic information about the tasters. Including such data could lead to better models that grasp external factors like age and culture. Another limitation is that our set of beers consists of high-quality end-products and lacks beers that are unfit for sale, which limits the current model in accurately predicting products that are appreciated very badly. Finally, while models could be readily applied in quality control, their use in sensory science and product development is restrained by their inability to discern causal relationships. Given that the models cannot distinguish compounds that genuinely drive consumer perception from those that merely correlate, validation experiments are essential to identify true causative compounds.

Despite the inherent limitations, dissection of our models enabled us to pinpoint specific molecules as potential drivers of beer aroma and consumer appreciation, including compounds that were unexpected and would not have been identified using standard approaches. Important drivers of beer appreciation uncovered by our models include protein levels, ethyl acetate, ethyl phenyl acetate and lactic acid. Currently, many brewers already use lactic acid to acidify their brewing water and ensure optimal pH for enzymatic activity during the mashing process. Our results suggest that adding lactic acid can also improve beer appreciation, although its individual effect remains to be tested. Interestingly, ethanol appears to be unnecessary to improve beer appreciation, both for blond beer and alcohol-free beer. Given the growing consumer interest in alcohol-free beer, with a predicted annual market growth of >7% 84 , it is relevant for brewers to know what compounds can further increase consumer appreciation of these beers. Hence, our model may readily provide avenues to further improve the flavor and consumer appreciation of both alcoholic and non-alcoholic beers, which is generally considered one of the key challenges for future beer production.

Whereas we see a direct implementation of our results for the development of superior alcohol-free beverages and other food products, our study can also serve as a stepping stone for the development of novel alcohol-containing beverages. We want to echo the growing body of scientific evidence for the negative effects of alcohol consumption, both on the individual level by the mutagenic, teratogenic and carcinogenic effects of ethanol 85 , 86 , as well as the burden on society caused by alcohol abuse and addiction. We encourage the use of our results for the production of healthier, tastier products, including novel and improved beverages with lower alcohol contents. Furthermore, we strongly discourage the use of these technologies to improve the appreciation or addictive properties of harmful substances.

The present work demonstrates that despite some important remaining hurdles, combining the latest developments in chemical analyses, sensory analysis and modern machine learning methods offers exciting avenues for food chemistry and engineering. Soon, these tools may provide solutions in quality control and recipe development, as well as new approaches to sensory science and flavor research.

Beer selection

250 commercial Belgian beers were selected to cover the broad diversity of beer styles and corresponding diversity in chemical composition and aroma. See Supplementary Fig.  S1 .

Chemical dataset

Sample preparation.

Beers within their expiration date were purchased from commercial retailers. Samples were prepared in biological duplicates at room temperature, unless explicitly stated otherwise. Bottle pressure was measured with a manual pressure device (Steinfurth Mess-Systeme GmbH) and used to calculate CO 2 concentration. The beer was poured through two filter papers (Macherey-Nagel, 500713032 MN 713 ¼) to remove carbon dioxide and prevent spontaneous foaming. Samples were then prepared for measurements by targeted Headspace-Gas Chromatography-Flame Ionization Detector/Flame Photometric Detector (HS-GC-FID/FPD), Headspace-Solid Phase Microextraction-Gas Chromatography-Mass Spectrometry (HS-SPME-GC-MS), colorimetric analysis, enzymatic analysis, Near-Infrared (NIR) analysis, as described in the sections below. The mean values of biological duplicates are reported for each compound.

HS-GC-FID/FPD

HS-GC-FID/FPD (Shimadzu GC 2010 Plus) was used to measure higher alcohols, acetaldehyde, esters, 4-vinyl guaicol, and sulfur compounds. Each measurement comprised 5 ml of sample pipetted into a 20 ml glass vial containing 1.75 g NaCl (VWR, 27810.295). 100 µl of 2-heptanol (Sigma-Aldrich, H3003) (internal standard) solution in ethanol (Fisher Chemical, E/0650DF/C17) was added for a final concentration of 2.44 mg/L. Samples were flushed with nitrogen for 10 s, sealed with a silicone septum, stored at −80 °C and analyzed in batches of 20.

The GC was equipped with a DB-WAXetr column (length, 30 m; internal diameter, 0.32 mm; layer thickness, 0.50 µm; Agilent Technologies, Santa Clara, CA, USA) to the FID and an HP-5 column (length, 30 m; internal diameter, 0.25 mm; layer thickness, 0.25 µm; Agilent Technologies, Santa Clara, CA, USA) to the FPD. N 2 was used as the carrier gas. Samples were incubated for 20 min at 70 °C in the headspace autosampler (Flow rate, 35 cm/s; Injection volume, 1000 µL; Injection mode, split; Combi PAL autosampler, CTC analytics, Switzerland). The injector, FID and FPD temperatures were kept at 250 °C. The GC oven temperature was first held at 50 °C for 5 min and then allowed to rise to 80 °C at a rate of 5 °C/min, followed by a second ramp of 4 °C/min until 200 °C kept for 3 min and a final ramp of (4 °C/min) until 230 °C for 1 min. Results were analyzed with the GCSolution software version 2.4 (Shimadzu, Kyoto, Japan). The GC was calibrated with a 5% EtOH solution (VWR International) containing the volatiles under study (Supplementary Table  S7 ).

HS-SPME-GC-MS

HS-SPME-GC-MS (Shimadzu GCMS-QP-2010 Ultra) was used to measure additional volatile compounds, mainly comprising terpenoids and esters. Samples were analyzed by HS-SPME using a triphase DVB/Carboxen/PDMS 50/30 μm SPME fiber (Supelco Co., Bellefonte, PA, USA) followed by gas chromatography (Thermo Fisher Scientific Trace 1300 series, USA) coupled to a mass spectrometer (Thermo Fisher Scientific ISQ series MS) equipped with a TriPlus RSH autosampler. 5 ml of degassed beer sample was placed in 20 ml vials containing 1.75 g NaCl (VWR, 27810.295). 5 µl internal standard mix was added, containing 2-heptanol (1 g/L) (Sigma-Aldrich, H3003), 4-fluorobenzaldehyde (1 g/L) (Sigma-Aldrich, 128376), 2,3-hexanedione (1 g/L) (Sigma-Aldrich, 144169) and guaiacol (1 g/L) (Sigma-Aldrich, W253200) in ethanol (Fisher Chemical, E/0650DF/C17). Each sample was incubated at 60 °C in the autosampler oven with constant agitation. After 5 min equilibration, the SPME fiber was exposed to the sample headspace for 30 min. The compounds trapped on the fiber were thermally desorbed in the injection port of the chromatograph by heating the fiber for 15 min at 270 °C.

The GC-MS was equipped with a low polarity RXi-5Sil MS column (length, 20 m; internal diameter, 0.18 mm; layer thickness, 0.18 µm; Restek, Bellefonte, PA, USA). Injection was performed in splitless mode at 320 °C, a split flow of 9 ml/min, a purge flow of 5 ml/min and an open valve time of 3 min. To obtain a pulsed injection, a programmed gas flow was used whereby the helium gas flow was set at 2.7 mL/min for 0.1 min, followed by a decrease in flow of 20 ml/min to the normal 0.9 mL/min. The temperature was first held at 30 °C for 3 min and then allowed to rise to 80 °C at a rate of 7 °C/min, followed by a second ramp of 2 °C/min till 125 °C and a final ramp of 8 °C/min with a final temperature of 270 °C.

Mass acquisition range was 33 to 550 amu at a scan rate of 5 scans/s. Electron impact ionization energy was 70 eV. The interface and ion source were kept at 275 °C and 250 °C, respectively. A mix of linear n-alkanes (from C7 to C40, Supelco Co.) was injected into the GC-MS under identical conditions to serve as external retention index markers. Identification and quantification of the compounds were performed using an in-house developed R script as described in Goelen et al. and Reher et al. 87 , 88 (for package information, see Supplementary Table  S8 ). Briefly, chromatograms were analyzed using AMDIS (v2.71) 89 to separate overlapping peaks and obtain pure compound spectra. The NIST MS Search software (v2.0 g) in combination with the NIST2017, FFNSC3 and Adams4 libraries were used to manually identify the empirical spectra, taking into account the expected retention time. After background subtraction and correcting for retention time shifts between samples run on different days based on alkane ladders, compound elution profiles were extracted and integrated using a file with 284 target compounds of interest, which were either recovered in our identified AMDIS list of spectra or were known to occur in beer. Compound elution profiles were estimated for every peak in every chromatogram over a time-restricted window using weighted non-negative least square analysis after which peak areas were integrated 87 , 88 . Batch effect correction was performed by normalizing against the most stable internal standard compound, 4-fluorobenzaldehyde. Out of all 284 target compounds that were analyzed, 167 were visually judged to have reliable elution profiles and were used for final analysis.

Discrete photometric and enzymatic analysis

Discrete photometric and enzymatic analysis (Thermo Scientific TM Gallery TM Plus Beermaster Discrete Analyzer) was used to measure acetic acid, ammonia, beta-glucan, iso-alpha acids, color, sugars, glycerol, iron, pH, protein, and sulfite. 2 ml of sample volume was used for the analyses. Information regarding the reagents and standard solutions used for analyses and calibrations is included in Supplementary Table  S7 and Supplementary Table  S9 .

NIR analyses

NIR analysis (Anton Paar Alcolyzer Beer ME System) was used to measure ethanol. Measurements comprised 50 ml of sample, and a 10% EtOH solution was used for calibration.

Correlation calculations

Pairwise Spearman Rank correlations were calculated between all chemical properties.

Sensory dataset

Trained panel.

Our trained tasting panel consisted of volunteers who gave prior verbal informed consent. All compounds used for the validation experiment were of food-grade quality. The tasting sessions were approved by the Social and Societal Ethics Committee of the KU Leuven (G-2022-5677-R2(MAR)). All online reviewers agreed to the Terms and Conditions of the RateBeer website.

Sensory analysis was performed according to the American Society of Brewing Chemists (ASBC) Sensory Analysis Methods 90 . 30 volunteers were screened through a series of triangle tests. The sixteen most sensitive and consistent tasters were retained as taste panel members. The resulting panel was diverse in age [22–42, mean: 29], sex [56% male] and nationality [7 different countries]. The panel developed a consensus vocabulary to describe beer aroma, taste and mouthfeel. Panelists were trained to identify and score 50 different attributes, using a 7-point scale to rate attributes’ intensity. The scoring sheet is included as Supplementary Data  3 . Sensory assessments took place between 10–12 a.m. The beers were served in black-colored glasses. Per session, between 5 and 12 beers of the same style were tasted at 12 °C to 16 °C. Two reference beers were added to each set and indicated as ‘Reference 1 & 2’, allowing panel members to calibrate their ratings. Not all panelists were present at every tasting. Scores were scaled by standard deviation and mean-centered per taster. Values are represented as z-scores and clustered by Euclidean distance. Pairwise Spearman correlations were calculated between taste and aroma sensory attributes. Panel consistency was evaluated by repeating samples on different sessions and performing ANOVA to identify differences, using the ‘stats’ package (v4.2.2) in R (for package information, see Supplementary Table  S8 ).

Online reviews from a public database

The ‘scrapy’ package in Python (v3.6) (for package information, see Supplementary Table  S8 ). was used to collect 232,288 online reviews (mean=922, min=6, max=5343) from RateBeer, an online beer review database. Each review entry comprised 5 numerical scores (appearance, aroma, taste, palate and overall quality) and an optional review text. The total number of reviews per reviewer was collected separately. Numerical scores were scaled and centered per rater, and mean scores were calculated per beer.

For the review texts, the language was estimated using the packages ‘langdetect’ and ‘langid’ in Python. Reviews that were classified as English by both packages were kept. Reviewers with fewer than 100 entries overall were discarded. 181,025 reviews from >6000 reviewers from >40 countries remained. Text processing was done using the ‘nltk’ package in Python. Texts were corrected for slang and misspellings; proper nouns and rare words that are relevant to the beer context were specified and kept as-is (‘Chimay’,’Lambic’, etc.). A dictionary of semantically similar sensorial terms, for example ‘floral’ and ‘flower’, was created and collapsed together into one term. Words were stemmed and lemmatized to avoid identifying words such as ‘acid’ and ‘acidity’ as separate terms. Numbers and punctuation were removed.

Sentences from up to 50 randomly chosen reviews per beer were manually categorized according to the aspect of beer they describe (appearance, aroma, taste, palate, overall quality—not to be confused with the 5 numerical scores described above) or flagged as irrelevant if they contained no useful information. If a beer contained fewer than 50 reviews, all reviews were manually classified. This labeled data set was used to train a model that classified the rest of the sentences for all beers 91 . Sentences describing taste and aroma were extracted, and term frequency–inverse document frequency (TFIDF) was implemented to calculate enrichment scores for sensorial words per beer.

The sex of the tasting subject was not considered when building our sensory database. Instead, results from different panelists were averaged, both for our trained panel (56% male, 44% female) and the RateBeer reviews (70% male, 30% female for RateBeer as a whole).

Beer price collection and processing

Beer prices were collected from the following stores: Colruyt, Delhaize, Total Wine, BeerHawk, The Belgian Beer Shop, The Belgian Shop, and Beer of Belgium. Where applicable, prices were converted to Euros and normalized per liter. Spearman correlations were calculated between these prices and mean overall appreciation scores from RateBeer and the taste panel, respectively.

Pairwise Spearman Rank correlations were calculated between all sensory properties.

Machine learning models

Predictive modeling of sensory profiles from chemical data.

Regression models were constructed to predict (a) trained panel scores for beer flavors and quality from beer chemical profiles and (b) public reviews’ appreciation scores from beer chemical profiles. Z-scores were used to represent sensory attributes in both data sets. Chemical properties with log-normal distributions (Shapiro-Wilk test, p  <  0.05 ) were log-transformed. Missing chemical measurements (0.1% of all data) were replaced with mean values per attribute. Observations from 250 beers were randomly separated into a training set (70%, 175 beers) and a test set (30%, 75 beers), stratified per beer style. Chemical measurements (p = 231) were normalized based on the training set average and standard deviation. In total, three linear regression-based models: linear regression with first-order interaction terms (LR), lasso regression with first-order interaction terms (Lasso) and partial least squares regression (PLSR); five decision tree models, Adaboost regressor (ABR), Extra Trees (ET), Gradient Boosting regressor (GBR), Random Forest (RF) and XGBoost regressor (XGBR); one support vector machine model (SVR) and one artificial neural network model (ANN) were trained. The models were implemented using the ‘scikit-learn’ package (v1.2.2) and ‘xgboost’ package (v1.7.3) in Python (v3.9.16). Models were trained, and hyperparameters optimized, using five-fold cross-validated grid search with the coefficient of determination (R 2 ) as the evaluation metric. The ANN (scikit-learn’s MLPRegressor) was optimized using Bayesian Tree-Structured Parzen Estimator optimization with the ‘Optuna’ Python package (v3.2.0). Individual models were trained per attribute, and a multi-output model was trained on all attributes simultaneously.

Model dissection

GBR was found to outperform other methods, resulting in models with the highest average R 2 values in both trained panel and public review data sets. Impurity-based rankings of the most important predictors for each predicted sensorial trait were obtained using the ‘scikit-learn’ package. To observe the relationships between these chemical properties and their predicted targets, partial dependence plots (PDP) were constructed for the six most important predictors of consumer appreciation 74 , 75 .

The ‘SHAP’ package in Python (v0.41.0) was implemented to provide an alternative ranking of predictor importance and to visualize the predictors’ effects as a function of their concentration 68 .

Validation of causal chemical properties

To validate the effects of the most important model features on predicted sensory attributes, beers were spiked with the chemical compounds identified by the models and descriptive sensory analyses were carried out according to the American Society of Brewing Chemists (ASBC) protocol 90 .

Compound spiking was done 30 min before tasting. Compounds were spiked into fresh beer bottles, that were immediately resealed and inverted three times. Fresh bottles of beer were opened for the same duration, resealed, and inverted thrice, to serve as controls. Pairs of spiked samples and controls were served simultaneously, chilled and in dark glasses as outlined in the Trained panel section above. Tasters were instructed to select the glass with the higher flavor intensity for each attribute (directional difference test 92 ) and to select the glass they prefer.

The final concentration after spiking was equal to the within-style average, after normalizing by ethanol concentration. This was done to ensure balanced flavor profiles in the final spiked beer. The same methods were applied to improve a non-alcoholic beer. Compounds were the following: ethyl acetate (Merck KGaA, W241415), ethyl hexanoate (Merck KGaA, W243906), isoamyl acetate (Merck KGaA, W205508), phenethyl acetate (Merck KGaA, W285706), ethanol (96%, Colruyt), glycerol (Merck KGaA, W252506), lactic acid (Merck KGaA, 261106).

Significant differences in preference or perceived intensity were determined by performing the two-sided binomial test on each attribute.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The data that support the findings of this work are available in the Supplementary Data files and have been deposited to Zenodo under accession code 10653704 93 . The RateBeer scores data are under restricted access, they are not publicly available as they are property of RateBeer (ZX Ventures, USA). Access can be obtained from the authors upon reasonable request and with permission of RateBeer (ZX Ventures, USA).  Source data are provided with this paper.

Code availability

The code for training the machine learning models, analyzing the models, and generating the figures has been deposited to Zenodo under accession code 10653704 93 .

Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355 , 391–394 (2017).

Article   ADS   CAS   PubMed   Google Scholar  

Plutowska, B. & Wardencki, W. Application of gas chromatography–olfactometry (GC–O) in analysis and quality assessment of alcoholic beverages – A review. Food Chem. 107 , 449–463 (2008).

Article   CAS   Google Scholar  

Legin, A., Rudnitskaya, A., Seleznev, B. & Vlasov, Y. Electronic tongue for quality assessment of ethanol, vodka and eau-de-vie. Anal. Chim. Acta 534 , 129–135 (2005).

Loutfi, A., Coradeschi, S., Mani, G. K., Shankar, P. & Rayappan, J. B. B. Electronic noses for food quality: A review. J. Food Eng. 144 , 103–111 (2015).

Ahn, Y.-Y., Ahnert, S. E., Bagrow, J. P. & Barabási, A.-L. Flavor network and the principles of food pairing. Sci. Rep. 1 , 196 (2011).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bartoshuk, L. M. & Klee, H. J. Better fruits and vegetables through sensory analysis. Curr. Biol. 23 , R374–R378 (2013).

Article   CAS   PubMed   Google Scholar  

Piggott, J. R. Design questions in sensory and consumer science. Food Qual. Prefer. 3293 , 217–220 (1995).

Article   Google Scholar  

Kermit, M. & Lengard, V. Assessing the performance of a sensory panel-panellist monitoring and tracking. J. Chemom. 19 , 154–161 (2005).

Cook, D. J., Hollowood, T. A., Linforth, R. S. T. & Taylor, A. J. Correlating instrumental measurements of texture and flavour release with human perception. Int. J. Food Sci. Technol. 40 , 631–641 (2005).

Chinchanachokchai, S., Thontirawong, P. & Chinchanachokchai, P. A tale of two recommender systems: The moderating role of consumer expertise on artificial intelligence based product recommendations. J. Retail. Consum. Serv. 61 , 1–12 (2021).

Ross, C. F. Sensory science at the human-machine interface. Trends Food Sci. Technol. 20 , 63–72 (2009).

Chambers, E. IV & Koppel, K. Associations of volatile compounds with sensory aroma and flavor: The complex nature of flavor. Molecules 18 , 4887–4905 (2013).

Pinu, F. R. Metabolomics—The new frontier in food safety and quality research. Food Res. Int. 72 , 80–81 (2015).

Danezis, G. P., Tsagkaris, A. S., Brusic, V. & Georgiou, C. A. Food authentication: state of the art and prospects. Curr. Opin. Food Sci. 10 , 22–31 (2016).

Shepherd, G. M. Smell images and the flavour system in the human brain. Nature 444 , 316–321 (2006).

Meilgaard, M. C. Prediction of flavor differences between beers from their chemical composition. J. Agric. Food Chem. 30 , 1009–1017 (1982).

Xu, L. et al. Widespread receptor-driven modulation in peripheral olfactory coding. Science 368 , eaaz5390 (2020).

Kupferschmidt, K. Following the flavor. Science 340 , 808–809 (2013).

Billesbølle, C. B. et al. Structural basis of odorant recognition by a human odorant receptor. Nature 615 , 742–749 (2023).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Smith, B. Perspective: Complexities of flavour. Nature 486 , S6–S6 (2012).

Pfister, P. et al. Odorant receptor inhibition is fundamental to odor encoding. Curr. Biol. 30 , 2574–2587 (2020).

Moskowitz, H. W., Kumaraiah, V., Sharma, K. N., Jacobs, H. L. & Sharma, S. D. Cross-cultural differences in simple taste preferences. Science 190 , 1217–1218 (1975).

Eriksson, N. et al. A genetic variant near olfactory receptor genes influences cilantro preference. Flavour 1 , 22 (2012).

Ferdenzi, C. et al. Variability of affective responses to odors: Culture, gender, and olfactory knowledge. Chem. Senses 38 , 175–186 (2013).

Article   PubMed   Google Scholar  

Lawless, H. T. & Heymann, H. Sensory evaluation of food: Principles and practices. (Springer, New York, NY). https://doi.org/10.1007/978-1-4419-6488-5 (2010).

Colantonio, V. et al. Metabolomic selection for enhanced fruit flavor. Proc. Natl. Acad. Sci. 119 , e2115865119 (2022).

Fritz, F., Preissner, R. & Banerjee, P. VirtualTaste: a web server for the prediction of organoleptic properties of chemical compounds. Nucleic Acids Res 49 , W679–W684 (2021).

Tuwani, R., Wadhwa, S. & Bagler, G. BitterSweet: Building machine learning models for predicting the bitter and sweet taste of small molecules. Sci. Rep. 9 , 1–13 (2019).

Dagan-Wiener, A. et al. Bitter or not? BitterPredict, a tool for predicting taste from chemical structure. Sci. Rep. 7 , 1–13 (2017).

Pallante, L. et al. Toward a general and interpretable umami taste predictor using a multi-objective machine learning approach. Sci. Rep. 12 , 1–11 (2022).

Malavolta, M. et al. A survey on computational taste predictors. Eur. Food Res. Technol. 248 , 2215–2235 (2022).

Lee, B. K. et al. A principal odor map unifies diverse tasks in olfactory perception. Science 381 , 999–1006 (2023).

Mayhew, E. J. et al. Transport features predict if a molecule is odorous. Proc. Natl. Acad. Sci. 119 , e2116576119 (2022).

Niu, Y. et al. Sensory evaluation of the synergism among ester odorants in light aroma-type liquor by odor threshold, aroma intensity and flash GC electronic nose. Food Res. Int. 113 , 102–114 (2018).

Yu, P., Low, M. Y. & Zhou, W. Design of experiments and regression modelling in food flavour and sensory analysis: A review. Trends Food Sci. Technol. 71 , 202–215 (2018).

Oladokun, O. et al. The impact of hop bitter acid and polyphenol profiles on the perceived bitterness of beer. Food Chem. 205 , 212–220 (2016).

Linforth, R., Cabannes, M., Hewson, L., Yang, N. & Taylor, A. Effect of fat content on flavor delivery during consumption: An in vivo model. J. Agric. Food Chem. 58 , 6905–6911 (2010).

Guo, S., Na Jom, K. & Ge, Y. Influence of roasting condition on flavor profile of sunflower seeds: A flavoromics approach. Sci. Rep. 9 , 11295 (2019).

Ren, Q. et al. The changes of microbial community and flavor compound in the fermentation process of Chinese rice wine using Fagopyrum tataricum grain as feedstock. Sci. Rep. 9 , 3365 (2019).

Hastie, T., Friedman, J. & Tibshirani, R. The Elements of Statistical Learning. (Springer, New York, NY). https://doi.org/10.1007/978-0-387-21606-5 (2001).

Dietz, C., Cook, D., Huismann, M., Wilson, C. & Ford, R. The multisensory perception of hop essential oil: a review. J. Inst. Brew. 126 , 320–342 (2020).

CAS   Google Scholar  

Roncoroni, Miguel & Verstrepen, Kevin Joan. Belgian Beer: Tested and Tasted. (Lannoo, 2018).

Meilgaard, M. Flavor chemistry of beer: Part II: Flavor and threshold of 239 aroma volatiles. in (1975).

Bokulich, N. A. & Bamforth, C. W. The microbiology of malting and brewing. Microbiol. Mol. Biol. Rev. MMBR 77 , 157–172 (2013).

Dzialo, M. C., Park, R., Steensels, J., Lievens, B. & Verstrepen, K. J. Physiology, ecology and industrial applications of aroma formation in yeast. FEMS Microbiol. Rev. 41 , S95–S128 (2017).

Article   PubMed   PubMed Central   Google Scholar  

Datta, A. et al. Computer-aided food engineering. Nat. Food 3 , 894–904 (2022).

American Society of Brewing Chemists. Beer Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A.).

Olaniran, A. O., Hiralal, L., Mokoena, M. P. & Pillay, B. Flavour-active volatile compounds in beer: production, regulation and control. J. Inst. Brew. 123 , 13–23 (2017).

Verstrepen, K. J. et al. Flavor-active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Meilgaard, M. C. Flavour chemistry of beer. part I: flavour interaction between principal volatiles. Master Brew. Assoc. Am. Tech. Q 12 , 107–117 (1975).

Briggs, D. E., Boulton, C. A., Brookes, P. A. & Stevens, R. Brewing 227–254. (Woodhead Publishing). https://doi.org/10.1533/9781855739062.227 (2004).

Bossaert, S., Crauwels, S., De Rouck, G. & Lievens, B. The power of sour - A review: Old traditions, new opportunities. BrewingScience 72 , 78–88 (2019).

Google Scholar  

Verstrepen, K. J. et al. Flavor active esters: Adding fruitiness to beer. J. Biosci. Bioeng. 96 , 110–118 (2003).

Snauwaert, I. et al. Microbial diversity and metabolite composition of Belgian red-brown acidic ales. Int. J. Food Microbiol. 221 , 1–11 (2016).

Spitaels, F. et al. The microbial diversity of traditional spontaneously fermented lambic beer. PLoS ONE 9 , e95384 (2014).

Blanco, C. A., Andrés-Iglesias, C. & Montero, O. Low-alcohol Beers: Flavor Compounds, Defects, and Improvement Strategies. Crit. Rev. Food Sci. Nutr. 56 , 1379–1388 (2016).

Jackowski, M. & Trusek, A. Non-Alcohol. beer Prod. – Overv. 20 , 32–38 (2018).

Takoi, K. et al. The contribution of geraniol metabolism to the citrus flavour of beer: Synergy of geraniol and β-citronellol under coexistence with excess linalool. J. Inst. Brew. 116 , 251–260 (2010).

Kroeze, J. H. & Bartoshuk, L. M. Bitterness suppression as revealed by split-tongue taste stimulation in humans. Physiol. Behav. 35 , 779–783 (1985).

Mennella, J. A. et al. A spoonful of sugar helps the medicine go down”: Bitter masking bysucrose among children and adults. Chem. Senses 40 , 17–25 (2015).

Wietstock, P., Kunz, T., Perreira, F. & Methner, F.-J. Metal chelation behavior of hop acids in buffered model systems. BrewingScience 69 , 56–63 (2016).

Sancho, D., Blanco, C. A., Caballero, I. & Pascual, A. Free iron in pale, dark and alcohol-free commercial lager beers. J. Sci. Food Agric. 91 , 1142–1147 (2011).

Rodrigues, H. & Parr, W. V. Contribution of cross-cultural studies to understanding wine appreciation: A review. Food Res. Int. 115 , 251–258 (2019).

Korneva, E. & Blockeel, H. Towards better evaluation of multi-target regression models. in ECML PKDD 2020 Workshops (eds. Koprinska, I. et al.) 353–362 (Springer International Publishing, Cham, 2020). https://doi.org/10.1007/978-3-030-65965-3_23 .

Gastón Ares. Mathematical and Statistical Methods in Food Science and Technology. (Wiley, 2013).

Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Preprint at http://arxiv.org/abs/2207.08815 (2022).

Gries, S. T. Statistics for Linguistics with R: A Practical Introduction. in Statistics for Linguistics with R (De Gruyter Mouton, 2021). https://doi.org/10.1515/9783110718256 .

Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2 , 56–67 (2020).

Ickes, C. M. & Cadwallader, K. R. Effects of ethanol on flavor perception in alcoholic beverages. Chemosens. Percept. 10 , 119–134 (2017).

Kato, M. et al. Influence of high molecular weight polypeptides on the mouthfeel of commercial beer. J. Inst. Brew. 127 , 27–40 (2021).

Wauters, R. et al. Novel Saccharomyces cerevisiae variants slow down the accumulation of staling aldehydes and improve beer shelf-life. Food Chem. 398 , 1–11 (2023).

Li, H., Jia, S. & Zhang, W. Rapid determination of low-level sulfur compounds in beer by headspace gas chromatography with a pulsed flame photometric detector. J. Am. Soc. Brew. Chem. 66 , 188–191 (2008).

Dercksen, A., Laurens, J., Torline, P., Axcell, B. C. & Rohwer, E. Quantitative analysis of volatile sulfur compounds in beer using a membrane extraction interface. J. Am. Soc. Brew. Chem. 54 , 228–233 (1996).

Molnar, C. Interpretable Machine Learning: A Guide for Making Black-Box Models Interpretable. (2020).

Zhao, Q. & Hastie, T. Causal interpretations of black-box models. J. Bus. Econ. Stat. Publ. Am. Stat. Assoc. 39 , 272–281 (2019).

Article   MathSciNet   Google Scholar  

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer, 2019).

Labrado, D. et al. Identification by NMR of key compounds present in beer distillates and residual phases after dealcoholization by vacuum distillation. J. Sci. Food Agric. 100 , 3971–3978 (2020).

Lusk, L. T., Kay, S. B., Porubcan, A. & Ryder, D. S. Key olfactory cues for beer oxidation. J. Am. Soc. Brew. Chem. 70 , 257–261 (2012).

Gonzalez Viejo, C., Torrico, D. D., Dunshea, F. R. & Fuentes, S. Development of artificial neural network models to assess beer acceptability based on sensory properties using a robotic pourer: A comparative model approach to achieve an artificial intelligence system. Beverages 5 , 33 (2019).

Gonzalez Viejo, C., Fuentes, S., Torrico, D. D., Godbole, A. & Dunshea, F. R. Chemical characterization of aromas in beer and their effect on consumers liking. Food Chem. 293 , 479–485 (2019).

Gilbert, J. L. et al. Identifying breeding priorities for blueberry flavor using biochemical, sensory, and genotype by environment analyses. PLOS ONE 10 , 1–21 (2015).

Goulet, C. et al. Role of an esterase in flavor volatile variation within the tomato clade. Proc. Natl. Acad. Sci. 109 , 19009–19014 (2012).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Borisov, V. et al. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 1–21 https://doi.org/10.1109/TNNLS.2022.3229161 (2022).

Statista. Statista Consumer Market Outlook: Beer - Worldwide.

Seitz, H. K. & Stickel, F. Molecular mechanisms of alcoholmediated carcinogenesis. Nat. Rev. Cancer 7 , 599–612 (2007).

Voordeckers, K. et al. Ethanol exposure increases mutation rate through error-prone polymerases. Nat. Commun. 11 , 3664 (2020).

Goelen, T. et al. Bacterial phylogeny predicts volatile organic compound composition and olfactory response of an aphid parasitoid. Oikos 129 , 1415–1428 (2020).

Article   ADS   Google Scholar  

Reher, T. et al. Evaluation of hop (Humulus lupulus) as a repellent for the management of Drosophila suzukii. Crop Prot. 124 , 104839 (2019).

Stein, S. E. An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass Spectrom. 10 , 770–781 (1999).

American Society of Brewing Chemists. Sensory Analysis Methods. (American Society of Brewing Chemists, St. Paul, MN, U.S.A., 1992).

McAuley, J., Leskovec, J. & Jurafsky, D. Learning Attitudes and Attributes from Multi-Aspect Reviews. Preprint at https://doi.org/10.48550/arXiv.1210.3926 (2012).

Meilgaard, M. C., Carr, B. T. & Carr, B. T. Sensory Evaluation Techniques. (CRC Press, Boca Raton). https://doi.org/10.1201/b16452 (2014).

Schreurs, M. et al. Data from: Predicting and improving complex beer flavor through machine learning. Zenodo https://doi.org/10.5281/zenodo.10653704 (2024).

Download references

Acknowledgements

We thank all lab members for their discussions and thank all tasting panel members for their contributions. Special thanks go out to Dr. Karin Voordeckers for her tremendous help in proofreading and improving the manuscript. M.S. was supported by a Baillet-Latour fellowship, L.C. acknowledges financial support from KU Leuven (C16/17/006), F.A.T. was supported by a PhD fellowship from FWO (1S08821N). Research in the lab of K.J.V. is supported by KU Leuven, FWO, VIB, VLAIO and the Brewing Science Serves Health Fund. Research in the lab of T.W. is supported by FWO (G.0A51.15) and KU Leuven (C16/17/006).

Author information

These authors contributed equally: Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni.

Authors and Affiliations

VIB—KU Leuven Center for Microbiology, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Michiel Schreurs, Supinya Piampongsant, Miguel Roncoroni, Lloyd Cool, Beatriz Herrera-Malaver, Florian A. Theßeling & Kevin J. Verstrepen

CMPG Laboratory of Genetics and Genomics, KU Leuven, Gaston Geenslaan 1, B-3001, Leuven, Belgium

Leuven Institute for Beer Research (LIBR), Gaston Geenslaan 1, B-3001, Leuven, Belgium

Laboratory of Socioecology and Social Evolution, KU Leuven, Naamsestraat 59, B-3000, Leuven, Belgium

Lloyd Cool, Christophe Vanderaa & Tom Wenseleers

VIB Bioinformatics Core, VIB, Rijvisschestraat 120, B-9052, Ghent, Belgium

Łukasz Kreft & Alexander Botzki

AB InBev SA/NV, Brouwerijplein 1, B-3000, Leuven, Belgium

Philippe Malcorps & Luk Daenen

You can also search for this author in PubMed   Google Scholar

Contributions

S.P., M.S. and K.J.V. conceived the experiments. S.P., M.S. and K.J.V. designed the experiments. S.P., M.S., M.R., B.H. and F.A.T. performed the experiments. S.P., M.S., L.C., C.V., L.K., A.B., P.M., L.D., T.W. and K.J.V. contributed analysis ideas. S.P., M.S., L.C., C.V., T.W. and K.J.V. analyzed the data. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Kevin J. Verstrepen .

Ethics declarations

Competing interests.

K.J.V. is affiliated with bar.on. The other authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Florian Bauer, Andrew John Macintosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, supplementary data 7, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Schreurs, M., Piampongsant, S., Roncoroni, M. et al. Predicting and improving complex beer flavor through machine learning. Nat Commun 15 , 2368 (2024). https://doi.org/10.1038/s41467-024-46346-0

Download citation

Received : 30 October 2023

Accepted : 21 February 2024

Published : 26 March 2024

DOI : https://doi.org/10.1038/s41467-024-46346-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

big data scientific research

Help | Advanced Search

Computer Science > Software Engineering

Title: disl: fueling research with a large dataset of solidity smart contracts.

Abstract: The DISL dataset features a collection of $514,506$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract from Etherscan up to January 15, 2024, DISL surpasses existing datasets in size and recency.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. The Rise of Big Data and Data Science #infographic

    big data scientific research

  2. 5 Steps of the Data Analysis Process

    big data scientific research

  3. Accelerating innovation in Pharma with Artificial Intelligence & Big

    big data scientific research

  4. 21 Best Big Data Research Topics

    big data scientific research

  5. What is Big Data Analytics and Why it is so Important?

    big data scientific research

  6. Big Data & Data Mining

    big data scientific research

VIDEO

  1. AWS re:Invent 2016: Building a Platform for Collaborative Scientific Research on AWS (LFS301)

  2. Big Data: 1st lecture (introduction, parallelism)

  3. 🚀 Data Intelligence Company: Big Data, AI, and Security

  4. Vint Cerf: Big Data and Social Media 🗃 CERN

  5. Big Data Computing Week 8 Assignment 8 Answers || Aug-2023 || NPTEL

  6. Get Your Dream Job! Learn Data Science with Innomatics

COMMENTS

  1. Scientific Research and Big Data

    Scientific Research and Big Data. First published Fri May 29, 2020. Big Data promises to revolutionise the production of knowledge within and beyond science, by enabling novel, highly efficient ways to plan, conduct, disseminate and assess research. The last few decades have witnessed the creation of novel ways to produce, store, and analyse ...

  2. Full article: Big data for scientific research and discovery

    Huadong Guo. With data volumes expanding beyond the Petabyte and Exabyte levels across many scientific disciplines, the role of big data for scientific research is becoming increasingly apparent: the massive data processing has become valuable for scientific research. The term big data is not only a buzzword and a marketing tool, but also it ...

  3. Big data in Earth science: Emerging practice and promise

    The European Marine Board's Future Science Brief on Big Data in Marine Science identifies continued sensor development; infrastructure for data collection, processing, ... The sheer volume of Big Earth Data can make converting and storing research data to open scientific data formats time-consuming and financially challenging .

  4. The impact of big data on research methods in information science

    Research methods are roadmaps, techniques, and procedures employed in a study to collect data, process data, analyze data, yield findings, and draw a conclusion to achieve the research aims. To a large degree the availability, nature, and size of a dataset can affect the selection of the research methods, even the research topics.

  5. SciSciNet: A large-scale open data lake for the science of science research

    Here we present SciSciNet, a large-scale open data lake for the science of science research, covering over 134M scientific publications and millions of external linkages to funding and public uses.

  6. Big Data Research

    The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on ...

  7. A review of big data and medical research

    In this descriptive review, we highlight the roles of big data, the changing research paradigm, and easy access to research participation via the Internet fueled by the need for quick answers. Universally, data volume has increased, with the collection rate doubling every 40 months, ever since the 1980s. 4 The big data age, starting in 2002 ...

  8. Big data: Historic advances and emerging trends in biomedical research

    Abstract. Big data is transforming biomedical research by integrating massive amounts of data from laboratory experiments, clinical investigations, healthcare records, and the internet of things. Specifically, the increasing rate at which information is obtained from omics technologies (genomics, epigenomics, transcriptomics, proteomics ...

  9. Home page

    The Journal of Big Data publishes open-access original research on data science and data analytics. Deep learning algorithms and all applications of big data are welcomed. Survey papers and case studies are also considered. The journal examines the challenges facing big data today and going forward including, but not limited to: data capture ...

  10. Moving back to the future of big data-driven research: reflecting on

    While becoming part of our lives, the data collected about individuals in the form of big data is transferred between academic and non-academic research, scientific and commercial enterprises.

  11. Big Data, new epistemologies and paradigm shifts

    Whilst the epistemologies of Big Data empiricism and data-driven science seem set to transform the approach to research taken in the natural, life, physical and engineering sciences, their trajectory in the humanities and social sciences is less certain. ... With respect to the sciences, access to Big Data and new research praxes has led some ...

  12. Applying big data beyond small problems in climate research

    Big data could have big potential in various scientific disciplines 4 including climate research 5,6, but it remains unclear what questions big data can potentially help to answer. The usefulness ...

  13. Big data for scientific research and discovery

    actions are: (1) produce case studies in big data for international scientific programs; (2) promote sharing of big data solutions across scientific disciplines; (3) research policy, ethical and legal issues for big data; and (4) research stewardship and sustainability challenges for big data by establishing Working Groups on big data for ...

  14. Notes to Scientific Research and Big Data

    Notes to Scientific Research and Big Data. 1. When a data collection can or should be regarded as "big data", and the significance of this particular label for research, is discussed at length in Leonelli (2016), Kitchin and McArdle (2016) and Aronova, van Oertzen, and Sepkoski (2017). 2. This understanding of scientific knowledge is also ...

  15. Big Data in Science and Healthcare: A Review of Recent Literature and

    Recognizing, understanding, and using Big Data in terms of scientific research and healthcare are necessary at this time in order to arrive at best evidence in a world of ever increasing data. Further investigation into the limitations of Big Data, such as inconsistencies regarding standards, policy, ethics, gaps in structured databases and ...

  16. Big data: new methods and ideas in geological scientific research

    Big data is not a straightforward rejection and denial of traditional research; on the contrary, big data technology is introduced into traditional research to advance progress in traditional scientific research. Currently, the knowledge structure used in geology is unreasonable. Big data research is a cross-disciplinary field that requires two ...

  17. Data science: a game changer for science and innovation

    This paper shows data science's potential for disruptive innovation in science, industry, policy, and people's lives. We present how data science impacts science and society at large in the coming years, including ethical problems in managing human behavior data and considering the quantitative expectations of data science economic impact. We introduce concepts such as open science and e ...

  18. Reproducibility and Scientific Integrity of Big Data Research in Urban

    Big data science has brought on new challenges, to which the scientific community needs to adapt by applying adequate ethical, methodological, ... Big data research presents a unique opportunity for a cultural shift in the way public-health research is conducted today. At the same time, big data use will only result in a beneficial impact to ...

  19. Economics in the age of big data

    Economic science has evolved over several decades toward greater emphasis on empirical work. The data revolution of the past decade is likely to have a further and profound effect on economic research. Increasingly, economists make use of newly available large-scale administrative data or private sector data that often are obtained through ...

  20. Scientific research and big data

    2022. TLDR. Risks associated with collecting large amounts of data for training neural networks, the dilemma of the strong opacity of these systems, and issues related to the misuse of already trained neural networks are discussed, as exemplified by the recent proliferation of deepfakes are introduced. Expand. Highly Influenced.

  21. Research Areas

    Research Areas. The world is being transformed by data and data-driven analysis is rapidly becoming an integral part of science and society. Stanford Data Science is a collaborative effort across many departments in all seven schools. We strive to unite existing data science research initiatives and create interdisciplinary collaborations ...

  22. Big data in digital healthcare: lessons learnt and ...

    Big Data initiatives in the United Kingdom. The UK Biobank is a prospective cohort initiative that is composed of individuals between the ages of 40 and 69 before disease onset (Allen et al. 2012 ...

  23. Do scientists respond faster than Google trends in discussing ...

    A study in Health Data Science introduces an advanced research framework to dissect the vast textual landscape surrounding COVID-19. This methodology leverages keywords from Google Trends ...

  24. Big Data Research

    2020 — Volumes 19-22. 2019 — Volumes 15-18. 2018 — Volumes 11-14. 2017 — Volumes 7-10. 2016 — Volumes 3-6. 2015 — Volume 2. 2014 — Volume 1. Read the latest articles of Big Data Research at ScienceDirect.com, Elsevier's leading platform of peer-reviewed scholarly literature.

  25. Evidence

    The current warming trend is different because it is clearly the result of human activities since the mid-1800s, and is proceeding at a rate not seen over many recent millennia. 1 It is undeniable that human activities have produced the atmospheric gases that have trapped more of the Sun's energy in the Earth system. This extra energy has warmed the atmosphere, ocean, and land, and ...

  26. Dr. Craig Stark takes UCI's women's health research to new heights with

    As we celebrate Women's History Month this March, we also find ourselves at a historic moment in scientific inquiry for women's health research. Today, nearly two-thirds of Americans with Alzheimer's disease are women, but the underlying cause of this sex disparity is still poorly understood. For decades, research focusing on women's health has been inadequate, with a mere 0.5% of all ...

  27. New Research Highlights GenAI as an Increasing Priority for Analytics

    insightsoftware, the comprehensive provider of solutions for the Office of the CFO, released Embedded Analytics Insights for 2024, a research report in partnership with Hanover Research uncovering the embedded analytics priorities, trends, and challenges of modern developer teams.The data highlights why developers decide to build or buy analytics functionality and desired analytics features ...

  28. Predicting and improving complex beer flavor through machine ...

    Research in the lab of K.J.V. is supported by KU Leuven, FWO, VIB, VLAIO and the Brewing Science Serves Health Fund. Research in the lab of T.W. is supported by FWO (G.0A51.15) and KU Leuven (C16 ...

  29. [2403.16861] DISL: Fueling Research with A Large Dataset of Solidity

    The DISL dataset features a collection of $514,506$ unique Solidity files that have been deployed to Ethereum mainnet. It caters to the need for a large and diverse dataset of real-world smart contracts. DISL serves as a resource for developing machine learning systems and for benchmarking software engineering tools designed for smart contracts. By aggregating every verified smart contract ...