(Stanford users can avoid this Captcha by logging in.)

  • Send to text email RefWorks EndNote printer

Data-driven sequential decision making by understanding and adopting rational behavior

Digital content, more options.

  • Contributors

Description

Creators/contributors, contents/summary, bibliographic information.

Stanford University

  • Stanford Home
  • Maps & Directions
  • Search Stanford
  • Emergency Info
  • Terms of Use
  • Non-Discrimination
  • Accessibility

© Stanford University , Stanford , California 94305 .

Restricted to current U-M faculty, staff, and students

  •   Home
  • Research Collections
  • Dissertations and Theses (Ph.D. and Master's)

Data-Driven Learning and Resource Allocation in Healthcare Operations Management

PDF file

Deep Blue DOI

Collections, remediation of harmful language.

The University of Michigan Library aims to describe library materials in a way that respects the people and communities who create, use, and are represented in our collections. Report harmful or offensive language in catalog records, finding aids, or elsewhere in our collections anonymously through our metadata feedback form . More information at Remediation of Harmful Language .

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.

Data-Driven Requirements Elicitation: A Systematic Literature Review

  • Review Article
  • Open access
  • Published: 04 January 2021
  • Volume 2 , article number  16 , ( 2021 )

Cite this article

You have full access to this open access article

  • Sachiko Lim 1 ,
  • Aron Henriksson 1 &
  • Jelena Zdravkovic 1  

13k Accesses

37 Citations

Explore all metrics

Requirements engineering has traditionally been stakeholder-driven. In addition to domain knowledge, widespread digitalization has led to the generation of vast amounts of data (Big Data) from heterogeneous digital sources such as the Internet of Things (IoT), mobile devices, and social networks. The digital transformation has spawned new opportunities to consider such data as potentially valuable sources of requirements, although they are not intentionally created for requirements elicitation. A challenge to data-driven requirements engineering concerns the lack of methods to facilitate seamless and autonomous requirements elicitation from such dynamic and unintended digital sources. There are numerous challenges in processing the data effectively to be fully exploited in organizations. This article, thus, reviews the current state-of-the-art approaches to data-driven requirements elicitation from dynamic data sources and identifies research gaps. We obtained 1848 hits when searching six electronic databases. Through a two-level screening and a complementary forward and backward reference search, 68 papers were selected for final analysis. The results reveal that the existing automated requirements elicitation primarily focuses on utilizing human-sourced data, especially online reviews, as requirements sources, and supervised machine learning for data processing. The outcomes of automated requirements elicitation often result in mere identification and classification of requirements-related information or identification of features, without eliciting requirements in a ready-to-use form. This article highlights the need for developing methods to leverage process-mediated and machine-generated data for requirements elicitation and addressing the issues related to variety, velocity, and volume of Big Data for the efficient and effective software development and evolution.

Similar content being viewed by others

data driven dissertation

Data-Driven Requirements Engineering: A Guided Tour

data driven dissertation

Towards Integrating Data-Driven Requirements Engineering into the Software Development Process: A Vision Paper

data driven dissertation

Data-Driven Requirements Engineering. The SUPERSEDE Way

Avoid common mistakes on your manuscript.

Introduction

Requirements elicitation is one of the most critical activities in requirements engineering, which, in turn, is a major determinant of successful development of information systems [ 1 ]. In conventional requirements engineering, requirements are elicited from domain knowledge obtained from stakeholders, relying primarily on qualitative data collection methods (e.g., interviews, workshops, and focus group discussions) [ 2 ]. The ongoing digitalization of organizations and society at large—as seen, for instance, by the proliferation of e-commerce and the advent of IoT—has led to an unprecedented and increasing amount of high-velocity and heterogeneous data, which is often referred to as Big Data [ 3 ].

The digital transformation has spawned new opportunities to consider this type of dynamic data from digital sources as potentially valuable sources of requirements, in addition to domain knowledge. Harnessing both traditional and new data sources in a complementary fashion may help improve the quality of existing or facilitate the development of new software systems. Nevertheless, conventional elicitation techniques are often time-consuming and not sufficiently scalable for processing such fast-growing data or capable of considering stakeholder groups that are becoming increasingly large and global. This highlights the need for a data-driven approach to support continuous and automated requirements engineering from ever-growing amounts of data.

There have been numerous efforts to automate requirements elicitation from static data, i.e., data that are generated with a relatively low velocity and rarely updated. These efforts can be grouped according to the following three aims: (1) eliciting requirements from static domain knowledge (e.g., documents written in natural languages [ 4 , 5 ], ontologies [ 6 , 7 ], and various types of models, e.g., business process models [ 8 ], UML use cases and sequence diagrams [ 9 ]), (2) performing specific requirements engineering activities based on requirements that have been already elicited (e.g., requirements prioritization [ 10 ], classification of natural language requirements [ 11 ], management of requirements traceability [ 12 ], requirements validation [ 13 ], generation of a conceptual model from natural language requirements [ 14 ]), or (3) developing tools to enhance stakeholders’ ability to perform requirements engineering activities based on static domain knowledge or existing requirements (e.g., tool-support for collaborative requirements prioritization [ 15 ] and requirements negotiation with rule-based reasoning [ 16 ]).

Several systematic reviews have been conducted on automated requirements elicitation from static domain knowledge. Meth et al. conducted a systematic review on tool support for automated requirements elicitation from domain documents written in natural language, where they analyzed and categorized the identified studies according to an analytical framework which consists of tool categories, technological concepts, and evaluation approaches [ 17 ]. Nicolás and Toval conducted a systematic review of the methods and techniques for transforming domain models (e.g., business models, UML models, and user interface models), use cases, scenarios, and user stories into textual requirements [ 18 ]. In both of these reviews, the requirements sources contained static domain knowledge.

Much less focus has been placed on eliciting requirements from dynamic data, and data that were not intentionally collected for the purpose of requirements elicitation. There are four main advantages to focus on dynamic data from such “unintended” digital sources. First, dynamic data-driven requirements engineering facilitates secondary use of data, which eliminates the need for collecting data specifically for requirements engineering, in turn enhancing scalability. Second, unintended digital sources can include data relevant for new system requirements that otherwise would not be discovered since utilizing such data sources allows for the collection of data from larger and global stakeholders who are beyond the reach of an organization relying on traditional elicitation methods [ 19 ]. Including such requirements, which a current software system is not supporting, can bring business values in the form of improved customer satisfaction, cost and time reduction, and optimized operations [ 20 ]. Third, focusing on dynamic data allows for capturing up-to-date user requirements, which in turn enables timely and effective operational decision making. Finally, dynamic data from unintended digital sources are machine-readable, which facilitates automated and continuous requirements engineering. A fitting requirements elicitation approach provides new opportunities and competitive advantages in a fast-growing market by extracting real-time business insights and knowledge from a variety of digital sources.

Crowd-based requirements engineering (CrowdRE) is a good example that has taken advantage of dynamic data from unintended digital sources. A primary focus of CrowdRE has been on eliciting requirements from explicit user feedback from crowd users (e.g., app reviews and data from social media) by applying various techniques based on machine learning and natural language processing [ 21 ]. Genc-Nayebi and Abran conducted a systematic review on opinion mining from mobile app store user reviews to identify existing solutions and challenges for mining app reviews, as well as to propose future research directions [ 22 ]. They focused on specific data-mining techniques used for review analysis, domain adaptation methods, evaluation criteria to assess the usefulness and helpfulness of the reviews, techniques for filtering out spam reviews, and application features. Martin et al. [ 26 ] surveyed on studies that performed app store analysis to extract both technical and non-technical attributes for software engineering. Tavakoli et al. [ 27 ] conducted a systematic review on techniques and tools for extracting useful software development information through mobile app review mining. The aforementioned literature reviews only focus on utilizing app reviews, while leaving out other types of human-sourced data that are potentially useful as requirement sources. There is also a growing interest in embracing contextual and usage data of crowd users (i.e., implicit user feedback) for requirements elicitation. This systematic review, thus, broadens the scope of previous literature reviews by considering more diverse data sources than merely app reviews for requirements elicitation.

Another relevant approach to data-driven requirements engineering is the application of process mining capabilities for requirements engineering. Process mining is an evidence-based approach to infer valuable process-related insights primarily from event logs, discovered models, and pre-defined process models. Process mining can be divided into three types: process discovery, conformance checking, and process enhancement [ 23 ]. Ghasemi and Amyot performed a systematic review on goal-oriented process modeling in which the selected studies were categorized into three areas: (1) goal modeling and requirements elicitation, (2) intention mining (i.e., the discovery of intentional process models going beyond mere activity process models), and (3) key performance indicators (i.e., means for monitoring goals) [ 23 ]. Their findings indicate that the amount of research on goal-oriented process mining is still limited. In addition to explicit and implicit user feedback, as well as event logs and process models, there may be more opportunities to leverage a broader range of dynamic data sources for requirements engineering, such as sensor readings.

Zowghi and Coulin [ 24 ] performed a comprehensive survey on techniques, approaches, and tools used for requirements elicitation. However, their work exclusively focused on conventional, stakeholder-driven requirements elicitation methods. Our study instead investigated the data-driven requirements elicitation. More recently, Arruda and Madhavji [ 25 ] systematically reviewed the literature on requirements engineering to develop Big Data applications. They identified the process and type of requirements needed for developing Big Data applications, identified challenges associated with requirements engineering in the context of Big Data applications, discussed the available requirements engineering solutions for the development of Big Data applications, and proposed future research directions. This study is different from their work because we studied methods to elicit requirements from Big Data rather than eliciting requirements for Big Data applications.

To our knowledge, no systematic review has been performed with an explicit focus on automated requirements elicitation for information systems from three types of dynamic data sources: human-sourced data sources, process-mediated data sources, and machine-generated data sources. The aim of this study is, therefore, to perform a comprehensive and systematic review of the research literature on existing state-of-the-art methods for facilitating automatic requirements elicitation for information systems driven by dynamic data from unintended digital sources.

This review may help requirements engineers and researchers understand the existing data-driven requirements elicitation techniques and gaps need to be addressed to facilitate data-driven requirements elicitation. Those insights may provide a basis for further development of algorithms and methods to leverage the increasing availability of Big Data as requirements sources.

Definitions and Scope

In this study, dynamic data are defined as raw data available in a digital form that changes frequently and have not already been analyzed or aggregated . Dynamic data certainly include but are not limited to Big Data, which in itself is challenging to define [ 28 ]. In addition to Big Data, dynamic data also include data that does not strictly meet the 4 Vs of Big Data (i.e., Volume, Variety, Veracity, and Velocity) but are still likely to contain relevant requirements-related information. Domain knowledge includes, for example, intellectual property, business documents, existing system specifications, goals, standards, conferences, and knowledge from customers or external providers.

This study excludes static domain knowledge that is less frequently created or modified and has been the primary focus of existing automated requirements engineering. Unintended digital sources are defined as sources of data generated via digital technologies that are unintended with respect to requirements elicitation. Thus, dynamic data from unintended digital sources are the digital data pulled from data sources that are created/modified frequently without the intention of eliciting requirements.

Of note is that the two terms “dynamic data” and “unintended digital source” together define the scope of this systematic review. For example, although domain documents are often created without the intention of performing requirements engineering, they are not considered to be dynamic data and, therefore, outside of the scope of this study.

Dynamic data from unintended digital sources expand explicit and implicit user feedback, defined by Morales-Ramirez et al. [ 29 ]. In their study, user feedback is considered as “a reaction of users, which roots in their perceived quality of experience”, which indicates the existence of a specific user is assumed. However, there are many devices which collect Big Data such as environmental IoT sensors to measure temperature, humidity, and pollution level, without interacting users. Since we foresee the possibility of eliciting requirements from such data sources, we decided to use a different term from the term “implicit user feedback”. To categorize the sources of data, we used human-sourced, process-mediated, and machine-generated data, following Firmani et al. [ 30 ].

Research Questions

To achieve the aim of the study, we formulated the main research question as follows: how can requirements elicitation from dynamic data be supported through automation? The main research question has been further divided into the following sub-research questions:

RQ1: What types of dynamic data are used for automated requirements elicitation?

We focus on describing the sources of the data, but also study whether there have been attempts to integrate multiple types of data sources and whether domain knowledge has been used in addition to dynamic data.

RQ2: What types of techniques and technologies are used for automating requirements elicitation?

We are interested in learning which underlying techniques and technologies are used in the proposed methods, as well as how they are put together and evaluated.

RQ3: What are the outcomes of automated requirements elicitation?

We assess how far the proposed methods go in automating requirements elicitation, the form of the outputs generated by the data-driven elicitation method, and what types of requirements are elicited.

This systematic review will advance scientific knowledge on data-driven requirements engineering for continuous system development and evolution by (1) providing a holistic analysis of the state-of-the-art methods that support automatic requirements elicitation from dynamic data, (2) identifying associated research gaps, and (3) providing directions for future research. The paper is structured as follows: the second section presents the research methods used in our study; the third section presents an overview of the selected studies and the results based on our analytical framework; the fourths section provides a detailed analysis and discussion of each component of the analytical framework; the fifth section describes potential threats to validity; finally, the last section concludes the paper and suggests directions for future work.

A systematic literature review aims to answer a specific research question using systematic methods to consolidate all relevant evidence that meets pre-defined eligibility criteria [ 3 ]. It consists of three main phases: planning, conducting, and reporting the review. The main activities of the planning phase are problem formulation and protocol development. Before the actual review process started, we formulated research questions. The study protocol was then developed, conforming to the guideline of the systematic literature review proposed by Kitchenham and Charters [ 31 ]. The protocol included the following contents: background, the aim of the study, research questions, selection criteria, data sources (i.e., electronic databases), search strategy, data collection, data synthesis, and the timeline of the study. The protocol was approved by the research group, which consists of the first author and two research experts: one expert in requirements engineering and one expert in data science. The actual review process starts during the conducting phase. The phase includes the following activities: identifying potentially eligible studies based on title, abstract and keywords, selecting eligible studies through full-text screening, extracting and synthesizing data that are relevant to answer the defined research question(s), performing a holistic analysis, and interpreting the findings. During the reporting phase, the synthesized findings are documented and disseminated to an appropriate channel.

Selection Criteria

Inclusion and exclusion criteria were developed to capture the most relevant articles for answering our research questions.

Inclusion Criteria

We included articles that met all the following inclusion criteria:

Requirements elicitation is supported through automation.

Requirements are elicited from digital and dynamic data sources.

Digital and dynamic data sources are created without intention with respect to requirements engineering.

Changes in requirements should involve the elicitation of new requirements.

The article has been peer-reviewed.

The full text of the article is written in English.

Exclusion Criteria

We excluded articles that met at least one of the following exclusion criteria:

Requirements are elicited solely from non-dynamic data.

The proposed method is performed based on existing requirements.

Studies that merely presented the proposed artifact without any or sufficient descriptions of evaluation methods.

Review papers, keynote talks, or abstracts of conference proceedings.

Data Sources

We performed a comprehensive search in six electronic databases (Table 1 ). In the first iteration, we searched Scopus, Web of Science, ACM Digital Library, and IEEE Xplore. Those databases were selected because they together cover the top ten information systems journals and conferences [ 17 ]. In addition, EBSCOhost and ProQuest, which are two major databases in the field of information systems, were searched to maximize the coverage of relevant publications, in line with a previous systematic review in the area [ 17 ]. ProQuest and EBSCOhost include both peer-reviewed and non-peer-reviewed articles. We, however, considered only peer-reviewed articles to be consistent with our inclusion criteria. The differences in the search field across databases are due to the different search functionalities of each electronic database.

Search Strategy

A comprehensive search strategy was developed in consultation with a librarian and the two co-authors who are experts in the fields of requirements engineering and data science, respectively. First, we extracted three key components from the first research question: requirements elicitation, automation, and Big Data sources and related analytics (Table 2 ). These components formed the basis for creating a logical search string. Big Data can refer either to data sources or to analytics/data-driven techniques to process Big Data. The term is also closely related to data-mining/machine-learning/data science/artificial intelligence techniques. We thus included keywords and synonyms that cover both Big Data sources and related analytics.

To construct a search string, keywords and synonyms that were grouped in the same component were connected by OR-operators, while each key component was connected by AND-operators, which means at least one keyword from each component must be present. The search string was adapted using the specific syntax of each database’s search function. The search string was iteratively tested and refined to optimize search results through trial search.

Study Selection

The entire search was performed by the first author (SL). Before starting the review process, we tested a small number of articles to establish agreement and consistency among reviewers. We then conducted a pilot study in which three reviewers independently assessed 50 randomly selected papers to estimate the sample size that is needed to ensure a substantial level of agreement (i.e., 0.61–0.80) based on the Landis and Koch-Kappa’s benchmark scale [ 32 ]. Each paper was screened by assessing its title, abstract, and keywords against our selection criteria (level 1 screening). During level 1 screening, articles were classified into one of the three categories: (1) included, (2) excluded, or (3) uncertain. Studies that fell into category 1 and 3 proceeded to full-text screening (level 2 screening) since the aim of the level 1 screening was to identify potentially relevant articles or those that lack sufficient information to be excluded.

After each reviewer had assessed 50 publications, we computed the Fleiss’s Kappa to calculate the inter-rater reliability. We, however, did not discuss the results of each reviewer’s assessment. The Fleiss’s Kappa was used because there were more than two reviewers. The Fleiss’ Kappa was computed to be 0.786. Sample size estimation was performed, following a confidence interval approach suggested by Rotondi and Donner [ 33 ]. Using 0.786 as the point estimate of Kappa and 0.61 as the expected lower bound, the required minimum sample size was estimated to be 139. The value of 0.61 was used as the lower bound of Kappa because it is the lower limit of “substantial” inter-rater reliability based on the Landis and Koch-Kappa’s benchmark scale [ 32 ], which is what we had aimed for. Since we achieved a substantial level of agreement and did not discuss results not to influence each other’s decisions, each of three reviewers independently continued to screen the remainder of the 89 randomly chosen publications based on titles, abstracts, and keywords (level 1 screening). The overall Fleiss’ Kappa for reviewing 139 articles was 0.850, which indicates an “almost perfect” agreement, according to the benchmark scale proposed by Landis and Koch [ 32 ]. Since we were able to achieve a very high inter-rater reliability, the rest of the level 1 screening was conducted by a single reviewer (SL). However, all of the three reviewers discussed and reached a consensus on the articles which SL classified as uncertain or could not decide on with sufficient confidence.

Before conducting the level 2 screening, we discussed which information should be extracted from the eligible articles. Based on the discussion, we developed a preliminary analytical framework to standardize the information to be extracted. We tested this on a small number of full-text papers and refined the data extraction form accordingly. In the level 2 screening, at least two authors reviewed the full-text of each paper that has been identified in level 1 screening to assess its eligibility in the final analysis. In addition to keyword-based search on the databases, we also performed forward/backward reference searching of all the included studies. SL extracted data from all the eligible studies, while each of AH and JZ divided the data extraction task by half. This was done to ensure that data extracted by SL could be cross-checked by at least one of the two reviewers who have richer experiences and knowledge. Any disagreements between the two reviewers were referred to the third reviewer and resolved by consensus.

To update the search results, an additional search was performed on July 3, 2020, using the same search query and introducing the two-level screening process (i.e., keyword-based search followed by full-text screening). While filtering can be performed by specifying the publication date and year in some databases, in other databases, the search can only be filtered by the publican year. Thus, we manually excluded the studies that have been published before the date of the initial search. However, we did not perform a backward and forward reference search during the updating phase. We then applied the same selection criteria used for the initial search to identify the relevant studies.

Analytical Framework and Data Collection

After considering the selected articles, we iteratively developed and refined an analytical framework, which covers both design and evaluation perspectives, to answer our research questions. The framework consists of three components: types of dynamic data sources used for automated requirements elicitation, techniquesand technologies used for automated requirements elicitation, and the outcomes of automated requirements elicitation. Table 3 summarizes the extracted data that are associated with each component of the analytical framework. Each component of the analytical framework is described in detail below.

Types of Dynamic Data Sources Used for Automated Requirements Elicitation

To answer RQ1, we extracted the following information: (1) types of dynamic data sources, (2) types of dynamic data, (3) integration of data sources, (4) relation of dynamic data to a given organization, and (5) additional domain knowledge that is used to elicit system requirements.

Types of Dynamic Data Sources

Dynamic data sources are categorized into one or a combination of human-sourced data sources, process-mediated data sources, and machine-generated data sources [ 30 ]. This provides insights into which types of data sources have drawn the most or the least attention as potential requirements sources in the existing literature. The categorization also helps to analyze whether there exists any process pattern in the automated requirements elicitation within each data source type.

Human-sourced data sources refer to the digitized records of human experiences. To name a few, examples of human-sourced data sources include social media, blogs, and contents from mobile phones. Process-mediated data sources are records of business processes and business events that are monitored, which includes electronic health records, commercial transactions, banking records, credit card payments. Machine-generated data sources are the records of fixed and mobile sensors and machines that are used to measure the events and situations in the physical world. They include, for example, readings from environmental and barometric pressure sensors, outputs of medical devices, satellite image data, and location data such as RFID chip readings and GPS outputs.

Types of Dynamic Data

To understand what types of dynamic data have been used for eliciting system requirements in the existing literature, we extracted the specific types of dynamic data that were used in each of the selected studies and grouped them into seven categories. Those categories are online reviews (e.g., app reviews, expert reviews, and user reviews), micro-blogs (e.g., Twitter), online discussions/forums, software repositories (e.g., issue tracking systems and GitHub), usage data, sensor readings, and mailing lists.

Integration of Data Sources

We explored whether the study integrates multiple types of dynamic data sources (i.e., any combination of human-sourced, process-mediated, and machine-generated data sources). We classified the selected studies into “yes” if the study has used multiple dynamic data sources, otherwise into “no.”

Relation of Dynamic Data to a Given Organization

Understanding whether requirements are elicited from external or internal data sources in relation to a given organization is important for requirements engineers to identify potential sources that can bring innovations into the requirements engineering process and facilitate software evolution and development of new promising software systems. We thus classified the selected studies into “yes” if the platform is owned by the organization and “no” if it is owned by a third party.

Additional Domain Knowledge that was Used to Elicit System Requirements

We assessed whether the study uses any domain knowledge in combination with dynamic data to explore the possible ways of integrating both dynamic data and domain knowledge. The selected studies were classified into “yes,” if the study uses any domain knowledge in addition to dynamic data, otherwise classified into “no.”

Techniques Used for Automated Requirements Elicitation

To answer RQ2, the following four types of information were extracted: (1) technique(s) used for automated requirements elicitation, including process pattern of automating requirements elicitation, (2) use of aggregation/summarization, (3) use of visualization, and (4) evaluation methods.

Technique(s) Used for Automation

Implementing promising algorithms is a prerequisite for effective and efficient automation of the requirements elicitation process. To identify the state-of-the-art algorithms, specific methods that were used for automating requirements elicitation were extracted and categorized into machine learning, rule-based classification, model-oriented approach, topic modeling, and traditional clustering.

Aggregation/Summarization

Summarization helps navigate requirements engineers to pinpoint the relevant information efficiently out of the ever-growing amount of data. We thus assessed whether the study summarizes/aggregates requirements-related information to obtain high-level requirements. If summarization/aggregation is performed, we also extracted specific techniques used for summarization/aggregation.

Visualization

Visualization facilitates requirements engineers to interpret the results of data analysis efficiently and effectively as well as to gain (new) insights in data. We assessed whether the study visualizes the output of the study to enhance their interpretability. If visualization is provided, the specific method used for visualization was also extracted.

Evaluation Methods

To understand how rigorously the performance of the proposed artifact was evaluated, we extracted methods that were used to assess the artifact. Evaluation methods were further divided into two dimensions: evaluation approach and evaluation concepts and metrics [ 17 ]. The evaluation concept of each selected study was categorized into one of the following groups: controlled experiment, case study, proof of concept, and other concepts. In a controlled experiment, the proposed artifact is evaluated in a controlled environment [ 34 ]. A case study aims to assess the artifact in-depth in a real-world context [ 34 ]. A proof of concept is defined as a demonstration of the proposed artifact to verify its feasibility for a real-world application. Other concepts refer to studies using other approaches to evaluate their artifact that does not fall into any category of the aforementioned evaluation approach. We also extracted evaluation concepts and metrics used for the artifact evaluation. Evaluation concepts were classified into one or more of the following categories: completeness, correctness, efficiency, and other evaluation concepts.

The Outcomes of Automated Requirements Elicitation

To answer RQ3, we assessed the outcomes of automated requirements elicitation by extracting the following information: (1) types of requirements, (2) expression of the elicited requirements (i.e., in what form outputs that were generated by automated requirements elicitation were expressed), and (3) additional requirements engineering activity supported through automation.

Expression of the Elicited Requirements

To understand how the obtained requirements are expressed and how far the elicitation activity reached, outputs of automated requirements elicitation were extracted, which were grouped into the following categories: identification and classification of requirements-related information, identification of candidate features related to requirements, and elicitation of requirements.

Intended Degree of Automation

Based on the degree of the proposed automated method, the selected studies were classified into either full automation or semi-automation. We classified the study as full automation if the study fulfilled either of the following conditions: (1) the proposed artifact automated the entire requirements elicitation process without human interaction, or (2) the proposed artifact only supports the partial process of requirements elicitation; however, the part it addressed was fully automated. Semi-automation refers to having a human-in-the-loop for automating requirements elicitation, thus requirements are directed by human interactions.

Additional Requirements Engineering Activity Supported Through Automation

Understanding to what extent the entire requirements engineering process has already been automated is essential to clarify the direction of future research that aims at increasing the level of automation in performing the requirements engineering process. We thus extracted the requirements engineering activity that was supported through automation other than requirements elicitation, if any.

Quality Assessment

We simply assessed the quality of the selected studies based on CORE Conference Rankings for conferences, workshops, and symposia, and SCImago Journal Rank (SJR) indicators for journal papers. We assumed that a study with a higher score of CORE or SJR has higher quality than one with a lower score. The papers that have been ranked A*, A, B, or C for the CORE index get the point of 1.5, 1.5, 1, and 0.5, respectively. If a paper is ranked Q1 or Q2 for the SJR indicator, the paper receives 2 and 1.5, respectively, while a paper that is ranked Q3 or Q4 gets 1. If a conference/journal paper is not included in the CORE/SJR ranking, the paper scores 0 points.

Data Synthesis

We narratively synthesized the findings of this systematic review, which includes basic descriptive statistics and qualitative analyses of (semi-)automated elicitation methods that are sub-grouped by dynamic data source as well as identified research gap(s), and implications and recommendations for future research.

Figure  1 shows a flow diagram of the article selection. We obtained 1,848 hits when searching the 6 electronic databases. We removed 458 duplicates, leaving 1,390 articles for level 1 screening (Table 4 ). After level 1 screening, we identified 40 articles to proceed to level 2 screening. The level 2 screening resulted in the inclusion of 29 articles for data extraction. We excluded the remaining eleven papers due to: the study not using dynamic data for requirements elicitation; the study being based on existing requirements that had already been elicited; the study not automating requirements elicitation to any degree; and the study proposing a method for automated requirements elicitation without sufficient evaluation.

figure 1

Flow diagram of article selection

In addition, a forward and backward reference search identified 1017 additional articles. Out of these, 22 articles met our inclusion criteria. Thus, a total of 51 papers were considered in the final analysis. Reasons for similar numbers of articles being identified in the query-based search and the backward/forward search include: the studies using terms such as “elicit requirements”, “requirements”, “requirements evolution” instead of “requirements elicitation”; using keywords which cover only one or two of the three keyword blocks despite being relevant; using only the name of a specific analytics technique (e.g., Long Short-term Memory) and not more general terms included in the identified keywords, e.g., machine learning.

To update the search results, we performed additional search and two-level screening, using the same search query process. The updated search identified 401 after removing duplicates (Table 4 ). Two-level screening resulted in including 17 additional studies. However, we did not perform a backward and forward reference search during this phase. We also included one study that was not captured by the search query but was recommended by an expert due to its relevance to our research question. We, therefore, selected a total 68 studies to be included in this review.

General Characteristics of the Selected Studies

Of the 68 selected articles, conference proceedings are the most frequent publication type ( n  = 41), followed by journal articles ( n  = 16), workshop papers ( n  = 7), and symposium papers ( n  = 4). All selected studies except one (2009) were published between 2012 and 2020. Figure  2 depicts the total number of the included papers per publication year. Although the number of publications dropped in 2018, in general, there is an increasing trend of publications between 2012 and 2019. For the year 2020, the result is shown as of July 3. A further observation is thus needed to confirm the increasing trend at the end of the year. The median score for study quality was 1 with the interquartile range of 0–1.5 (Appendix 2).

figure 2

Publication trend

Types of Dynamic Data Sources Used for Requirements Elicitation

Dynamic data sources used for automated requirements elicitation.

Among dynamic data sources, human-sourced data sources have been primarily used as requirements sources. Among the three types of dynamic data sources, the vast majority (93%, n  = 63) of the studies used human-sourced data sources for eliciting requirements. Only four studies (6%) explored using either machine-generated ( n  = 2) or process-mediated ( n  = 2) data sources. Almost all the studies focused on a single type of dynamic data source. We identified only one study attempting to integrate multiple types of dynamic data sources (1%).

The Specific Types of Dynamic Data Used for Automated Requirements Elicitation

The following seven data sources have been used for automated requirements elicitation: online reviews, micro-blogs, online discussions/forums, software repositories, software/app production descriptions, sensor readings, usage data from system–user interactions, and mailing lists (Table 5 ). Online reviews are reviews of a product or service that is posted and shown publicly online by people who have purchased a given service or product. Microblogs, which are typically published on social media sites, are a type of blog in which users can post a message in a form of different content formats such as short texts, audio, video, and images. They are designed for quick conversational interactions among users. Online discussions/forums are online discussion sites where people can post messages to exchange knowledge. Software repositories are platforms for sharing software packages or source codes, which primarily contain three elements: a trunk, branches, and tags. This study also considered issue-tracking systems as software repositories, which are detailed reports of bugs or complaints written in the form of free texts. Sensor readings are electrical outputs of devices that detect and respond to inputs from a physical phenomenon, which results in a large amount of streaming data. Usage data are run-time data collected when users are interacting with a given system. Mailing lists are a type of electronic discussion forums. E-mail messages sent by specific subscribers are shared by everyone on a mailing list.

Figure  3 depicts the types of dynamic data that have been used for automated requirements elicitation. Online reviews are the most frequently used type of dynamic data for eliciting requirements (53%), followed by micro-blogs (18%) and online discussions/forums (12%), software repositories (10%), and software/app product descriptions (7%). Other types of dynamic data include sensor readings (3%), usage data from system–user interactions (4%), and mailing lists (3%).

figure 3

Types of dynamic data used for automated requirements elicitation

Several studies used multiple types of human-sourced data to gain complementary information and improve the quality of the analysis. Wang et al. [ 92 ] assessed whether the use of app changelogs improves the accuracy of identifying and classifying functional and non-functional requirements from app reviews, compared to the results obtained from the mere use of app reviews. Although there were no additional positive effects of app changelogs on improving the accuracy of automatic requirements classification, their subsequent study [ 93 ] shows that the accuracy of classifying requirements in app reviews by augmenting the reviews with the text feature words extracted from app changelogs.

Takahashi et al. in [ 100 ] used Apache Commons User List and App Store reviews. However, those two types of datasets were used independently without being integrated to evaluate their proposed elicitation process. Moreover, Stanik et al. [ 65 ] used three datasets: app reviews, tweets written in English, and tweets written in Italian. On the other hand, Johann et al. [ 94 ] integrated both app reviews and descriptions to provide information on which app features are or are not actually reviewed. In addition, Ali et al. [ 66 ] combined tweets for a smartwatch and Facebook comments of wearable and smartwatch.

Some studies used multiple types of software repositories. Morales-Ramirez et al. [ 84 ] used two types of datasets obtained from the issue tracking system of the Apache OpenOffice community and the feedback gathering system of SEnerCON, which is an industrial project in the home energy management domain. In their different study [ 79 ], open-source software mailing lists, and OpenOffice online discussions were used to identify relevant requirements information. Nyamawe et al. [ 87 ] used commits from GitHub repository and feature requests from JIRA issue tracker, while Oriol et al. [ 89 ] and Franch et al. [ 88 ] considered heterogenous software repositories.

Only one study used multiple types of data sources (e.g., human-sourced data and machine-generated data). Wüest et al. in [ 99 ] used both app user feedback (i.e., human-sourced data) and app usage data (i.e., process-mediated data).

Relation of Dynamic Data to an Organization of Interest

The majority of the studies used dynamic data that was external to the organization of interest. Of the 68 studies included in the analysis, 57 studies (85%) used dynamic data which was externally related to a given organization (i.e., data were collected outside of an organization’s platforms) [ 36 , 37 , 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 , 58 , 59 , 60 , 61 , 62 , 63 , 64 , 65 , 66 , 67 , 68 , 69 , 70 , 71 , 72 , 73 , 74 , 75 , 76 , 77 , 80 , 81 , 82 , 86 , 87 , 88 , 89 , 90 , 91 , 92 , 93 , 94 , 96 , 101 ]. Nine studies (13%) used dynamic data that were collected from platforms belonging to the organization: issue tracking systems [ 84 , 85 , 102 ]; user feedback from the online discussion and open-source software mailing lists [ 79 ]; sensors equipped with an intelligent product which is also known as a product embedded information devices (PEID) [ 95 ]; software production forum [ 103 ]; user feedback tool [ 99 ]. On the other hand, only two studies (3%) used both internal and external dynamic data [ 78 , 100 ].

Additional Use of Domain Knowledge Used for Requirements Elicitation

Only one study considered additional inclusion of domain knowledge in eliciting requirements. Yang et al. [ 44 ] combined the app review analysis and the Wizard-of-Oz technique for the requirements elicitation process. The results indicate that integrating the two sources can complement each other to elicit more comprehensive requirements that cannot be obtained from either one of the sources.

Approaches for Automated Requirements Elicitation

Approaches used for human-sourced data.

Since human-sourced data are typically expressed in natural language, natural language processing (NLP) is commonly used for analyzing this type of data. All of the 63 studies which used human-sourced information started the requirements elicitation process by preprocessing the raw data using NLP techniques. Data preprocessing typically involves removing noise (e.g., HTML tags) to retain only text data. Another critical data preparation activity is tokenization, which means splitting the text into sentences and tokens (words, punctuation marks, and digits), respectively.

Further analysis of the text using NLP typically involves syntactic analysis, such as part-of-speech tagging. Two studies have used speech-acts, which are acts performed by a speaker when making an utterance, as parameters to train supervised learning algorithms [ 79 , 84 ]. For eliciting requirements, nouns, verbs, and adjectives are often identified since they are more likely used for describing requirements-related information than other parts of speech, including adverbs, numbers, and quantifiers [ 40 ].

A common preprocessing activity is stopword filtering, which involves removing tokens that are common but carry little meaning, including function words (e.g., “the”, “and”, and “this”), punctuations (e.g., “.”, “?”, and “!”), and special characters (e.g., “#” and “@”), and numbers. Normalization is moreover often carried out by lowercasing (i.e., convert all text data to lowercase), stemming (i.e., reduce inflectional word forms to their root form such as reducing “play”, “playing” and “played” to their common root form of “play”) and lemmatization (i.e., grouping the different inflected forms of words which are syntactically different but semantically equal to be analyzed as a base form, called lemma, such as grouping “sees” and “saw” into a single base form of “see”).

Once the text data have been preprocessed, features are typically extracted for the subsequent modeling phase. Feature extraction can be done using a bag of words (i.e., simply count occurrences of tokens without considering word order nor normalizing counters), n -grams (i.e., extract the contiguous sequence of n tokens such as bi-gram which indicates the extraction of token pairs), and collocations (i.e., extract a sequence of words that co-occur more often than by chance, for example, “strong tea”). To evaluate how important a word is for a given document, a bag of words are often weighted, using a weighting scheme such as term frequency-inverse document frequency (tf-idf), which gives high weights to words that have a high frequency in a particular document, while having a low frequency in an entire set of documents. Other common features are based on syntactic or semantic analysis of the text (e.g., part-of-speech tags). Sentiment analysis, which is the automated process of identifying and quantifying the opinion or emotional tone of a piece of text through NLP, was used in 18 studies (38%), either to feed into algorithms as features to increase the accuracy of the algorithms or to understand user satisfaction.

After preprocessing the human-sourced data and extracting features for data modeling, the next step of requirements elicitation was to perform either classification or clustering. Classification refers to classifying (text) data into pre-defined categories related to requirements, for example, classifying app reviews into bug reports, feature requests, user experiences, and text ratings [ 38 ]. Classification has been performed using three approaches: machine learning (ML), rule-based classification, or model-oriented approaches. In the ML approach, classification is performed by a model built by a learning algorithm based on pre-labeled data.

In the ML approach, various learning algorithms automatically learn statistical patterns within a set of training data, such that a predictive model is able to predict a class for unseen data. In most studies, ML relied on supervised ML. In supervised ML, a predictive model is built based on instances that were pre-assigned with known class labels (i.e., training set). The model is then used to predict a label associated with unseen instances (i.e., test set). A downside with supervised ML is that it typically requires a large amount of labeled data (i.e., ground-truth set) to learn accurate predictive models.

To reduce the cost of labeling a large amount of data, a few studies used the active learning paradigm and semi-supervised machine learning for classification. Active learning enables machines to wisely select unlabeled data points to be labeled next in a way that optimizes a decision boundary created by a given learning algorithm and interactively queries the user to label the selected data points to improve classification accuracy. Semi-supervised learning is an intermediate technique between supervised and unsupervised ML, which utilizes both labeled and unlabeled data in the training process.

Rule-based classification is a classification scheme that uses certain rules, such as language patterns. Rule-based classification excels in performing simpler tasks where domain experts can define rules, while classification using ML works well for the tasks which are easily performed by humans but where (classification) rules are hard to formulate. However, listing all the rules can be tedious and needs to be hand-crafted by skilled experts with abundant domain knowledge. Moreover, rules might need to be refined as new datasets become available, which requires additional resources and limits scalability [ 77 ]. A model-oriented approach, which includes utilizing conceptual or meta-models, are applied to define and relate the mined terms and drive classification.

On the other hand, clustering has been performed using either topic modeling or more traditional clustering techniques. Topic modeling is an unsupervised (i.e., learn from unlabeled instances) dimension reduction and clustering technique, which aims to discover hidden semantic patterns in the collection of a document. Topic modeling is used to represent an extensive collection of documents as abstract topics consisting of a set of keywords. In automated requirements elicitation, topic modeling is mainly used for either discovering system features or grouping similar fine-grained features that are extracted using different approaches into high-level features. Traditional clustering is an unsupervised ML technique that aims to discover the intrinsic structure of the data by partitioning a set of data into groups based on their similarity and dissimilarity. Among the selected studies, traditional clustering has been mainly used to discover inherent groupings of features in requirements-related information.

Some studies have performed clustering after classification. Classification was first performed to identify and classify requirements-related information, using machine learning or rule-based classification. Clustering is then applied to the identified requirements-related information (e.g., improvement requests), while ignoring data irrelevant to requirements, to discover inherent groupings of features, using topic modeling or traditional clustering. Table 6 provides a more detailed description of the automated approaches proposed in each study.

Figure  4 depicts the descriptive statistics of the approaches for automated requirements elicitation used in the selected studies. For classification, the most commonly used approach was based on the ML approach (60%), followed by rule-based classification (17%) and model-oriented approach (6%). For clustering, topic modeling (16%) was the most commonly used approach, followed by more traditional clustering techniques (13%) and unsupervised rule-based clustering (2%).

figure 4

Techniques used for requirements elicitation from human-sourced data that are grouped according to classification (i.e., machine learning (ML), rule-based classification, and model-oriented approach) and clustering (i.e., topic modeling, traditional clustering, and rule-based unsupervised NLP)

In nine studies, two different approaches have been combined. Two studies performed classification with supervised ML for filtering and subsequently conducted clustering with topic modeling [ 47 , 68 ]. Guzman et al. [ 68 ] first ran Multinomial Naïve Bayes and Random Forest, which are both supervised learning algorithms, to extract tweets that request software improvement. Biterm Topic Model, which is a topic modeling used for short texts, was then used to group semantically similar tweets for software evolution. Zhao and Zhao [ 47 ] ran a supervised deep-learning neural network was first used to extract software features, and their corresponding sentiments and hierarchical LDA was subsequently to extract hierarchical software features with positive and negative sentiments.

Two studies performed classification using ML, which was followed by unsupervised clustering analysis [ 53 , 58 ]. Jiang et al. [ 58 ] used Support Vector Machine, or a supervised machine-learning algorithm, for pruning incorrect software features that were extracted from online reviews. K-means clustering, an unsupervised clustering analysis, was then performed to categorize the extracted features into semantically similar system aspects. Sun and Peng [ 53 ] first used Naïve Bayes, a supervised machine-learning algorithm, for filtering informative comments, which were subsequently clustered using K-means, an unsupervised clustering analysis.

Jiang et al. [ 41 ] first performed rule-based classification based on syntactic parsing and sentiment analysis to extract opinions about software features and their corresponding sentiment words. Subsequently, S-GN, whose base algorithms are a type of K-means clustering, was performed to cluster similar opinion expressions about software features into a category which represents an overall, functional, or quality requirements. On the other hand, Bakar et al. [ 63 ] combined unsupervised clustering analysis and topic modeling in which K-means was first run to identify the similar documents. They then performed latent semantic analysis, which is a type of topic modeling, to group similar software features within the documents.

Guzman and Maalej [ 40 ] and Dalpiaz and Parente [ 46 ] first extracted software features based on rule-based classification, which uses collocation finding algorithm and the LDA was subsequently applied to group similar software features. Zhang et al. [ 60 ] first used linear regressions based on supervised ML to select helpful online reviews. Then conjoint analysis (i.e., a statistical technique used in market research to assess and quantify the consumers’ values on product features or service) was performed to assess the impact of the features from helpful online reviews on the consumers’ overall rating.

In several studies, visualization has been provided to help requirements engineers efficiently sift through and effectively interpret the most important requirements-related information. Bakiu and Guzman [ 55 ] first performed the aggregation of features. The results were then visualized at two levels of granularity (i.e., high-level and detailed). Sun and Peng [ 53 ] first extracted scenario information of similar user comments and then aggregated and visualized as aggregated scenario models. Software features [ 52 ] and technically informative information from the potential requirements sources [ 64 , 86 ] were summarized, ranked, and visualized using word clouds. Luiz et al. [ 49 ] summarized overall user evaluation of the mobile applications, their features, and the corresponding user sentiment polarity and scores in a single graphical interface. Oriol et al. [ 89 ] implemented a quality-aware strategic dashboard, which has various functionalities (e.g., quality assessment, forecasting techniques, and what-if analysis) and allows for maintaining traceability of quality requirements generation and documentation process. Wüest et al. [ 99 ] fused user feedback and correlated GPS data and visualize the fused data on a map, equipping the parking app with context-awareness.

Techniques Used for Process-Mediated Data

The two studies that used process-mediated data focused on eliciting emerging requirements through observations and analysis of time-series user behavior (i.e., run-time observation of system–user interactions) and the corresponding environmental context values [ 97 , 98 ]. In both studies, Conditional Random Fields (CRF), which is a statistical modeling method, was used to infer goals (i.e., high-level requirements).

Xie et al. [ 97 ] proposed a method to elicit requirements consisting of the three steps. First, a computational model is trained and built based on pre-defined user’s goals in the domain knowledge, using supervised CRF to infer user’s implicit goals (i.e., outputs) from the observation and analysis of run-time user behavior and the corresponding environmental values (i.e., inputs). After the goal inference, the user’s intention (i.e., the execution path) for achieving a given goal is obtained by connecting the situation (i.e., a time-stamped sequence of user behavior that is labeled with a goal and environmental context values) labeled with the same goal into a sequence. Finally, an emerging intention, which is a new sequence pattern of user behavior that has not been pre-defined or captured in the domain knowledge base, is detected.

An emerging intention can occur in three cases; when a user has a new goal; when a user has a new strategy for achieving an existing goal; when a user cannot perform operations in an intended way due to system flaws. Requirements, thus, can be elicited by validating emerging intentions by domain experts based on the analyses of goal transition, divergent behaviors from the optimal usage, and erroneous behavior.

In the analysis of goal transition, domain experts look at two goals that frequently appear consecutively based on the results of goal inference with a high confidence level assigned by the CRF and elicit requirements that make the goal transition smoother.

In the analysis of divergent behavior, domain experts focus on user behaviors that deviate from an expected way to operate the system because the user’s irregular behavior may indicate user’s misunderstanding of required operational procedures, dissatisfaction with the system, and emerging desires. Those divergent behaviors are given a low confidence level by the CRF model.

In the analysis of erroneous behavior, requirements can be elicited by investigating the error reports with high occurrences that may reflect users’ emerging desires that are not supported by the current system. In addition, requirements can be elicited from user behaviors, which are actually normal behavior but are mistakenly considered as erroneous due to the system flaws. The proposed method is assumed to be used in a sensor-laden computer application domain. Thus, it may also be applicable to machine-generated data. The main challenge, however, is to increase the level of automation for analyzing potential emerging intentions and users’ emerging requirements.

Yang et al. in [ 98 ] used CRF to infer goals based on a time-stamped sequence of user behavior that is labeled with a goal and environmental context values, which is called a situation. Based on the results of goal inference, intention inference was performed by relating a sequence of situations that are labeled as the same goal. When an intention has not been pre-defined in the domain knowledge base, the intention is detected as an emerging intention and exported as possible new requirements for future system development or evolution.

However, the method proposed in both studies still requires a substantial degree of human oracles, which needs to be reduced in future research to increase the scalability and promote the implementation of their approach in real-life settings. In addition, the proposed method does not yet support diverse requirements. The method proposed by Xie et al. [ 97 ], capture only emerging functional but not non-functional requirements. The approach proposed in [ 98 ] can only support the identification of the low-level design alternatives (i.e., new ways of fulfilling a given intention).

Notably, Wüest et al. in [ 99 ] proposed to use both human-sourced and process-generated data. Their approach is based on the control loop for self-adaptive systems for collecting and analyzing user feedback (i.e., human source data) as well as system usage and the location data (i.e., GPS data). The analysis is driven by rules or models of expected system usage. The system decides how to interpret the results of the analysis and modify its behavior at run-time, which allows for understanding changing user requirements for software evolution.

Techniques Used for Machine-Generated Data

Voet et al. [ 95 ] first extracted goal-relevant usage elements as features, from the data recorded via a handheld grinder, a type of product embedded information devices (PEID) equipped with sensors and onboard capabilities. Feature selection was then performed to reduce system workload and improve the prediction accuracy of the machine-learning algorithm, compared to using raw sensor data. Specifically, the support vector machine classifier, which is a supervised machine-learning algorithm, was used to build and train the model to predict the four different usage element states. The model was then tested on the sensor data from the two different usage scenarios that have not been used for training. The collection of the predicted usage element states, or user profiles, can be analyzed manually or by clustering to identify the deviation from the intended optimal usage profile. Requirements can be inferred by analyzing users’ deviant behaviors.

Liang et al. [ 96 ] mined user behavior patterns from instances of user behavior, which consist of user context (i.e., time, the location and the motion state of the crowd mobile users) and the currently running apps, using Apriori-M algorithm, which is an efficient algorithm based on Apriori algorithm that is used for frequent item set mining. User behavior patterns, which infer emergent requirements or requirements changes, are ranked and used for service recommendation. Service recommendation is performed periodically, using the service recommendation algorithm. The algorithm takes mined user behavior as inputs and outputs the apps to remind the user. In service recommendation, matching is performed between the current user context and the context of user behavior patterns mined from mobile crowd users, according to the ranking order. If the two matches, the mobile app(s) in the user behavior patterns are automatically recommended to the user as solutions to meet the requirements inferred from user behavior patterns.

In summary, most of the existing solutions support the elicitation of requirements from a single data source, primarily from human source data. There is a lack of methods to support requirements elicitation from heterogeneous data sources. In addition, only a few studies have supported context-awareness and real-time data processing and analysis. Those features are crucial to enable continuous and dynamic elicitation of requirements, which are especially important for context-aware applications and time-critical systems such as health systems. Moreover, many studies lack the argument on how each proposed solution help processing a large volume of data.

Evaluation methods include three components: evaluation approach, concept, and metrics. Of the 68 selected studies, controlled experiments were the most frequently applied approach for evaluating the proposed artifact (75%), followed by a case study (19%) and a proof of concept (6%) (Fig.  5 a).

figure 5

a Evaluation approach, b Evaluation concepts

Among the 51 studies that used controlled experiments, 46 studies compared the results produced by the proposed artifacts against a manually annotated ground-truth set. For example, Bakiu and Guzman [ 55 ] compared the performance of multi-label classification against a manually created golden standard in classifying features extracted from unseen user reviews into different dimensions of usability and user experience.

Only three studies compared the performance of the proposed artifact with the results of manual analysis without the aid of automation [ 57 , 62 , 78 ]. For example, Bakar et al. [ 62 ] compared the software features that were extracted using their proposed semi-automated method with those that were obtained manually.

Two studies conducted an experiment in different ways. Liang et al. [ 96 ] used a longitudinal approach for conducting an experiment. They compared obtained user behavior patterns with those that were collected after some time interval to confirm the correctness of the Apriori-M algorithm. Abad et al. [ 44 ] compared Wizard-of-Oz (WOz) and user review analysis qualitatively. In a few studies [ 46 , 88 , 90 ], the proposed techniques have been evaluated with intended users. The rest of the studies used a case study or a proof of concept as an evaluation approach.

Most frequently used evaluation concept was correctness (78%), followed by completeness (74%), no/other metrics (13%), and efficiency (10%) (Fig.  5 b). Other metrics, for example, include usability, creativity, the intended user’s perceived usefulness, and satisfaction. Most of the studies combined several evaluation concepts. Three different combinations of the concepts were identified: (1) completeness and correctness ( n  = 42), (2) completeness and correctness and efficiency ( n  = 7), and (3) correctness and efficiency ( n  = 2). In most cases, the correctness and completeness were assessed using precision (i.e., the fraction of correctly predicted instances among the total predicted instances) and recall (i.e., the fraction of correctly predicted positive instances among all the instances in actual class), respectively. In addition, F-measure was also used to address a trade-off between precision and recall.

Efficiency has been assessed in terms of the size of training data [ 38 , 39 ], the time to recognize and classify software features [ 76 ], the time required to identify relevant requirements information for both manual and automated analysis [ 57 ], the time taken to complete the extraction of software features [ 63 ], the time and space needed to build the classification model [ 48 , 50 , 51 ], and the total execution time of the machine-learning algorithm [ 78 ]. The user’s perceived efficiency was measured using a 5-point Likert scale [ 89 ]. (Fig. 6 )

figure 6

Final outcomes of automated requirements elicitation

The Outcomes of the Automated Requirements Elicitation

Expression of final outcomes produced by the automated part of requirements elicitation.

Outcomes of the automated requirements elicitation have been classified into the following three categories: (1) identification and classification of requirements-related information, (2) identification of candidate features related to requirements, and (3) elicitation of requirements (Table 7 ). Only 21% of the studies have enabled the automated elicitation of requirements. A majority of the studies have resulted in automated identification and classification of requirements-related information (51%), or identification of candidate features related to requirements (28%) (Fig.  5 ).

Identification and classification of requirements-related information have been made by classifying dynamic data into different classes of issues based on; relevance to different stakeholders for identifying responsibilities, the technical relevance for filtering only relevant data (e.g., classifying into either feature request or other), and types of technical issues to be inspected (e.g., classifying into feature requests, bug reports, user experiences, and user ratings, and classifying into functional or non-functional requirements). Some studies performed classification at a deeper level (e.g., classifying into four types of non-functional requirements (i.e., usability, reliability portability, or performance, or functional requirements).

Identification of candidate features related to requirements refers to discovering functional components of a system. Features, however, typically have less granularity than requirements and do not tell what behavior, conditions, and details would be needed to obtain the full functionality. They, thus, need to be further processed to become full requirements.

Elicitation of requirements has been done mostly at high level. Most of them elicited requirements at high level in the form of goals, aggregated scenarios, or high-level textual requirements. Franch et al. [ 88 ] and Oriol et al. [ 89 ] semi-automated the elicitation of complete requirements in the form of user stories and requirements specified in semi-formal language.

Degree of Intended Automation

A proposed artifact was classified into the two levels of the intended automation: intended full automation or semi-automation. Of note is that we consider artifacts that support the automation of requirements elicitation either entirely or partially. Artifacts are classified into intended full automation in the following two circumstances: (1) when the proposed part is automated without human intervention for completion or (2) when only minimum interactions are needed for completion. Minimum human interaction is defined as human oracles being in the loop once at the initial stage of the elicitation process, which includes the creation of the ground-truth set and conceptual models as well as the specification of a set of keywords and language patterns. Based on the definitions, the majority of the proposed methods (84%) were intended to be fully automated, while the rest are semi-automated methods that require human oracles to be in the loop for each iteration of the process.

The majority of the selected studies exclusively focused on enabling requirements elicitation from dynamic data, without considering other requirements engineering activities. Of the 68 studies included in the analysis, 50 studies (74%) exclusively proposed methods to enable automated requirements elicitation, while 18 studies (26%) supported other requirements engineering activities in addition to requirements elicitation. Prioritization was the most frequently supported additional requirements engineering activity ( n  = 11), followed by elicitation for change management ( n  = 7), and documentation ( n  = 2). More detailed information is provided in Table 8 .

We conducted a systematic literature review on the existing data-driven methods for automated requirements elicitation. The main motivations for this review were two-fold: (1) using dynamic data has the potential to enrich stakeholder-driven requirements elicitation by eliciting new requirements which cannot be obtained from other sources, and (2) no systematic review has been conducted on the state-of-the-art methods to elicit requirements from dynamic data from unintended digital sources. Of 1848 records retrieved from 6 electronic database search and 1017 articles identified through backward and forward reference search, we selected 51 studies that met our inclusion criteria and included in the final analysis to answer the following three research questions. RQ1: What types of dynamic data are used for automated requirements elicitation? RQ2: What types of techniques and technologies are used for automating requirements elicitation? RQ3: What are the outcomes of automated requirements elicitation? In the following sections, we provide a discussion of the main findings, the identified research gaps, and issues to be addressed in future research.

RQ1: What Types of Dynamic Data Are Used for Automated Requirements Elicitation?

Existing research on data-driven requirements elicitation from dynamic data sources has primarily focused on utilizing human-sourced data in the form of online reviews, micro-blogs, online discussions/forums, software repositories, and mailing lists. The use of online reviews was substantially more prevalent, compared to other types of human-sourced data. The result indicates the current data-driven requirements elicitation is largely crowd-based. On the contrary, process-mediated and machine-generated data sources have only, in some instances, been explored as potential sources of requirements. The predominance of human-sourced information is rather expected and can be explained by the following two reasons: (1) users’ preferences and needs regarding system are typically explicitly expressed in natural language, from which it is—relatively speaking—straightforward to obtain requirements compared to process-mediated and machine-generated data, and (2) there are abundant sources of human-sourced data that are publicly available and readily accessible.

Much more research is, thus, needed to develop methods capable of eliciting requirements from process-mediated and machine-generated data that are not expressed in natural language and from which requirements need to be inferred. There is still a lack of methods to infer requirements as well as evidence regarding the applicability of the proposed approach to more diverse types of process-mediated and machine-generated data. Process-mediated and machine-generated data enable run-time requirements elicitation [ 19 ]. They also help system developers to understand usage data and the corresponding context, which allows elicitation of performance-related as well as context-dependent requirements [ 19 ]. In addition, almost all of the studies have focused on using only a single type of dynamic data and typically also a single data source.

A few studies have utilized multiple human-sourced data sources; however, there has been only one attempt to combine different types of dynamic data sources. As such, there is currently insufficient evidence that using multiple types of data leads to more effective requirements elicitation, but it remains an open issue that merits investigation. We believe that research in this direction would be highly interesting in an attempt to improve data-driven requirements elicitation, both in terms of the coverage and quality of the elicited requirements. Utilizing semantic technologies can be useful for enabling the integration of heterogeneous data sources [ 107 ].

In addition, only one study integrated dynamic data and domain knowledge to elicit requirements [ 44 ]. The results from that study indicate the potential benefits of using dynamic data together with domain knowledge to elicit requirements that cannot be captured using either one of the data sources. It is likely that domain knowledge, which is typically relatively static but of high quality, can help to enrich data-driven requirements elicitation efforts from dynamic data sources. A larger number of studies are needed to confirm the impacts of integrating domain knowledge with dynamic data on the quality and diversity of outcomes obtained from the automated requirements process.

RQ2: What Types of Techniques and Technologies Are Used for Automating Requirements Elicitation?

Techniques used for the automated requirement elicitation.

Human-sourced data are typically expressed in natural language, which is inherently difficult to analyze computationally due to its ambiguous nature and lack of rigid structure. In all the selected studies, human-sourced data have been (pre-)processed using natural language processing techniques to facilitate subsequent analysis. Although the techniques used for preprocessing varies across studies, data cleaning, text normalization, and feature extraction for data modeling are frequently performed preprocessing steps in automated requirements engineering. Commonly used features include surface-level tokens, words, and phrases, but also syntactic (e.g., part of speech tags) and semantic features (e.g., the positive/negative/neutral sentiment of a sentence). After data preparation and feature extraction, data modeling or analysis for the purpose of requirements elicitation is typically performed using classification or clustering, or classification followed by clustering.

Classification in the context of automated requirements elicitation involves either of the following three tasks: (1) filtering out data irrelevant to requirements, (2) classifying text based on the relevance to different stakeholder groups, or (3) classifying text into different categories of technical issues, such as bug reports and feature requests. The classification tasks have been tackled using either rule-based approaches or machine learning, which is mostly done within the supervised learning paradigm. Although supervised machine learning can achieve high predictive performance in a well-defined classification task, it requires access to a sufficient amount of human-annotated data. As a result, many studies involved human to annotate data into pre-defined classes. The labeling task, however, is labor-intensive, time-consuming, and error-prone due to a considerable amount of noise and the ambiguous nature inherent in natural language [ 35 ].

Two solutions have been proposed to reduce the cost of labeling a large amount of data: active learning [ 35 ] and semi-supervised machine learning [ 43 ]. Dhinakaran et al. in [ 35 ] showed that classifiers trained with active learning strategies outperformed in classifying app reviews into feature requests, bug reports, user rating, or user experience than the baseline classifies that were passively trained on the randomly selected dataset. Deocadez et al. in [ 43 ] demonstrated that three semi-supervised algorithms (i.e., Self-training, RASCO, and Rel-RASCO) with four base classifiers achieved comparable predictive performance as that of classical supervised machine learning in classifying app reviews into functional or non-functional requirements. Although there is not a sufficient number of studies to draw a generalizable conclusion, classification using active learning and semi-supervised machine-learning strategies may have similar potential as conventional supervised machine learning in identifying and classifying requirements-related information, but requires a much smaller amount of labeled data compared to conventional supervised machine learning.

Another issue that needs to be addressed when using supervised learning is that human-sourced data sources include a significant proportion of non-informative and irrelevant data. Eliciting requirements from this source is thus often compared to “looking for a needle in a haystack” [ 70 ]. This leads to a highly unbalanced class distribution in terms of the non-informative and irrelevant data compared to the informative and relevant classes. The underlying class distribution largely affects the performances of machine learning-based classifiers [ 42 , 71 ]. In one study [ 42 ], the precision, recall, and F 1 measures for the under-represented classes were worse than those for the better-represented classes. Given that the classes relevant to requirements are not represented equally in most real-life occasions, the issue needs to be addressed in future research. One possible solution to resolve this issue may be applying different sampling techniques such as Synthetic Minority Oversampling Technique (SMOTE) to the training set to increase the number of instances in the class with fewer observations [ 71 , 84 ].

Contextualization may be another possible solution, which is done by filtering out non-informative and irrelevant data. Several studies [ 47 , 53 , 58 , 68 ] have used supervised classification before performing finer-grained classification or clustering. Filtering out noisy data can improve the classification or clustering accuracy. It also helps requirements engineers pinpoint the data relevant to requirements by automatically discarding non-informative data for requirements elicitation [ 69 ] as well as supports efficiently distributing data to the appropriate stakeholders within an organization [ 69 ]. Since contextualization can reduce the volume of data to be processed further, it mitigates the volume issue of Big Data.

Various supervised learning algorithms have been used to automate the requirements elicitation process. However, there is no “one-size-fits-all” algorithm that performs best for every single case, which is often referred to as the “No free lunch” theorem [ 108 ]. Experimenting and comparing many different algorithms for a specific problem demands time and domain knowledge related to machine learning from requirements engineers in addition to routine work. It would thus be helpful for them if the support tool were to accommodate functions that automatically identify and recommend the best algorithm among possible options.

Moreover, it would be even more valuable if the tool supports automatic optimization of the parameter configuration, which includes preprocessing, selection of machine-learning features, hyper-parameter settings, and evaluation metrics. Supervised machine learning has mainly been used for identifying and classifying data into pre-defined categories related to requirements. This is because supervised machine learning works well for tasks for which classification rules are difficult to formulate. Nevertheless, it requires a sufficient amount of human-annotated data to build a reliable predictive model, which is a time-consuming and error-prone task. On the other hand, rule-based classification, which was the second most frequently used classification approach, excels in performing simpler tasks for which rules can be formulated. In the literature, rule-based classification has been used for identifying candidate features more frequently than identifying and classifying requirements-related information. For rule-based classification to function well, however, sound domain knowledge is required to appropriately define rules that drive the classification process and determine the effectiveness of the classification.

Clustering has been used primarily for identifying candidate features or grouping semantically similar features. In the selected studies, clustering has been performed, using topic modeling or traditional clustering, which can be valuable alternatives to supervised learning in the absence of labeled historical data. More than half the studies that used clustering first classified data into pre-assigned categories relevant to requirements, which was primarily done using supervised machine learning or rule-based classification. Clustering is subsequently performed on the requirements-related information identified by classification, using topic modeling or traditional clustering. Those unsupervised machine-learning techniques, however, often lead to less accurate results than supervised leaning since there is no knowledge about output data.

The effectiveness of clustering can be affected by many factors (e.g., the number of clusters and selection of initial seeds), and evaluating unsupervised learning is problematic due to a lack of well-defined metrics. This may be a reason that classification is performed before clustering. Nevertheless, there are some efforts to ensure high quality of clustering. Cleland-Huang et al. [ 78 ] proposed the automated forum management (AFM) system that employs Stable Spherical K-Means (SPK) to mine feature requests from discussion threads in open source forums. In their study, Normalized Mutual Information (NMI) was computed to evaluate and ensure the quality of the cluster. In addition, since the selection of initial seeds highly influence on clustering results, the problem is mitigated by applying consensus clustering for the initial clustering. On the other hand, Sun and Peng [ 53 ] used the recommended cluster number (RCN) to determine the optimal number of clusters. There are also other metrics available to evaluate the quality of clustering, such as the Silhouette index. However, the consensus has not been reached regarding which measure to use for the evaluation because it depends on the nature of data and the desired clustering task.

Moreover, only a small proportion of the studies supported the visualization of the obtained results. Data visualization increases the interpretability of the results by leveraging visual capacity, which helps identify new and hidden patterns, outliers, and trends [ 16 ]. It also facilitates communication among different stakeholders within an organization. Providing visualizations, thus, is recommended to help requirements engineers understand the results and make a subsequent decision more efficiently and effectively.

Process-Mediated and Machine-Generated Data Sources

As described in the previous section (i.e., “RQ1: What types of dynamic data are used for automated requirements elicitation?”), our results indicate that there is a huge research gap in eliciting requirements from process-mediated and machine-generated data. Much more research should focus on exploring the methods to elicit requirements from data that are not written in natural language. Only two studies leveraged process-mediated data, both utilizing CRF, to infer goalswhich are high-level requirements. More research is need to develop methods and algorithms to elicit requirements from various types of process-mediated data.

Likewise, machine-generated data were used as requirements sources in two studies. Liang et al. [ 96 ] proposed to use the Apriori-M algorithm to infer context-aware requirements from behavior patterns that are mined from the run-time behavior of the mobile user. The results of the analysis lead to provide the user solutions that satisfy the inferred requirements. On the other hand, Voet et al. [ 95 ] proposed a method to classify goal-relevant usage element states using supervised machine learning and infer requirements based on the deviation from the optimal usage profile, which can be detected by manual analysis or unsupervised clustering.

Given that IoT data are one of the main driving forces of Big Data generation, there is a pressing need to develop a framework to elicit requirements from IoT data. Applying semantic technologies may be a promising solution to help machines interpret the meaning of data by semantically representing raw data in a human/machine interpretable form [ 107 ], which can facilitate the automatic requirements elicitation from large volumes of heterogeneous IoT data.

Rigorous evaluation is essential for ensuring that a proposed artifact meets its intended objectives, justifying its effectiveness and/or efficiency, and identifying its weaknesses, which need to be rectified in future work. The artifacts proposed in most of the identified studies were primarily evaluated through controlled experiments. Controlled experiments eliminate the influence of extraneous and unwanted variables that could account for a change of the dependent variable(s) other than the independent variable(s) of interest. Thus, their two main advantages are: (1) they are the most powerful method for inferring causal relationships between variables, and (2) they can achieve high internal validity [ 109 ]. Nevertheless, their main disadvantage is that since they are typically conducted in an artificial environment, conclusions may not be valid in real-life settings, which threatens the external validity [ 109 ].

Most studies that used controlled experiments as an evaluation approach evaluated results derived from a proposed artifact against a manually created ground-truth set. The quality of the ground-truth set, however, determines the performance of machine-learning algorithms. The majority of the studies, thus, recruited multiple annotators for the labeling task to obtain a “reliable” ground-truth set, which only contains peer-agreed labels. Some studies used an annotation guideline, performed a pilot run of classification tasks with small samples to avoid subjective assessment, reduce disagreements, and increase the quality of manual labeling [ 38 , 39 , 68 ].

Besides, a few studies compared the performance of automated analysis with a proposed artifact with the performance achieved by solely relying on manual analysis without the aid of the proposed artifact. Groen et al. [ 57 ] justified the efficiency and scalability of automated user review analysis and emphasized the need of automation for analyzing a large volume of dynamic data to support continuous requirements engineering. A case study was the second most frequently used evaluation approach in which the proposed methods are assessed through in-depth investigations of a specific instance in a real-life context. Proof of concept was used in a small proportion of the selected studies. It is used to demonstrate the feasibility of a proposed artifact theoretically to achieve an intended task. Although it may be suitable as a preliminary or formative evaluation, it has lower explanatory power compared to comparative evaluations (e.g., controlled experiments and case studies).

Most studies used standard metrics that are often used in the field of information retrieval. Completeness and correctness were the evaluation concepts that were the most frequently used in the studies, while some studies also assessed the efficiency of an artifact. Recall and precision were often used as metrics to measure completeness and correctness, respectively. Since there is a trade-off between precision and recall, many studies additionally used F-measure, which is the weighted harmonic mean of precision and recall. Most of the studies used F 1 -measure, which assigns equal weights on precision and recall (i.e., the harmonic mean of precision and recall). However, Guzman et al. [ 69 ] recall was assigned more importance (i.e., weights) than precision based on the study which claims that recall should be favored over precision since missing relevant requirements is more detrimental [ 110 ]. On the other hand, precision is also important when dealing with a dataset that contains large amounts of irrelevant information. Future research may explore techniques to optimize F-measures, including a weighted maximum likelihood solution [ 111 ]. Moreover, few studies have compared the effectiveness of automated requirements elicitation with that of traditional requirements elicitation driven by stakeholders. This can largely be explained by the fact that research on automated requirements elicitation is not mature enough since most methods have focused on identifying and classifying requirements-related information rather than eliciting requirements. However, this needs to be addressed in future research to demonstrate the value of automated requirements elicitation.

RQ3: What Are the Outcomes of Automated Requirements Elicitation?

Expression of requirements elicitation.

In traditional requirements engineering, requirements elicitation begins with the identification of relevant requirements sources such as stakeholders and domain documents, which is followed by two other sub-activities: the elicitation of existing requirements from the identified sources and elicitation of new and innovative requirements [ 1 ].

On the other hand, dynamic data-driven requirements elicitation has been done in the form of the following three activities: (1) identification and classification of requirements-related information, (2) identification of candidate features related to requirements, and (3) elicitation of requirements. However, those three activities have not necessarily been performed entirely nor sequentially. For example, many studies that aim to identify candidate features first performed classification, using supervised learning or rule-based classification, before clustering features, using topic modeling or traditional clustering, while the rest of them directly identified candidate software features, mainly using topic modeling or rule-based classification. One possible reason for performing classification before clustering is that classification can only classify data into coarse categories, which may include the repetitive information and the same sentiment, while clustering can further group individual data in a meaningful way. Thus, the specific combination of the two approaches can facilitate the work of requirements engineers (e.g., requirements reuse).

Most of the proposed methods supported the identification and classification of requirements-related information or the identification of candidate features. Identification and classification of requirements-related information help requirements engineers save time for the data analysis by filtering out a significant amount of irrelevant data and selectively identify a specific type of information which they are interested in such as feature request. It also helps to allocate the extracted data based on the relevance to stakeholder groups to support parallel data analysis within the same organization. Identification of candidate features helps requirements engineers understand user-preferred features and select features to be considered in software development and evolution. Features, however, are not yet formulated as requirements because those features require the engagement of requirements engineers to transform into requirements.

On the other hand, only about 20% of the studies automated the entire requirements elicitation. In most cases, the elicited requirements are high-level such as goals, aggregated scenarios, or high-level textual requirements. Those high-level requirements, however, do not include details of the objects (i.e., features) which are being concerned, nor conditions. This highlights the need for developing additional automated approaches or using traditional elicitation techniques with the involvement of human stakeholders to complete the requirements elicitation process.

A majority of the studies proposed methods that are intended to be fully automated after the minimum human interventions at the initial stage of the continuous elicitation process. However, most studies do not yet support the entire requirements elicitation. Given the high volume and velocity of dynamic data, requirements elicitation certainly needs to be automated to enhance efficiency and scalability.

However, fully automated methods are not necessarily better than semi-automated methods concerning the quality of requirements and the ease of implementing into an existing requirements engineering process as well as the organizational workflow. There is a lack of evidence on what level of automation leads to the most effective requirements elicitation within an organization. More research, thus, needs to be done on whether it is possible and better to automate the entire elicitation process, or whether some extent of human-in-the-loop is necessary.

If a semi-automated approach is considered preferable, another issue that needs to be addressed is where and when in the elicitation process human should come into play to facilitate effective automated requirements elicitation. In addition, the characteristics of dynamic data can be changed over time. The proposed automated approach should be flexible enough to incorporate and reflect these dynamic changes over time.

Our results show that three-quarters of the selected studies exclusively focused on requirements elicitation, while only one-quarter supported additional requirements engineering activities, which were requirements prioritization and management of requirements change. Therefore, no studies supported the automation of the entire requirements engineering process. A holistic framework, therefore, needs to be developed to increase the automation level of dynamic data-driven requirements engineering.

Threats to Validity

The results of the review need to be interpreted with caution due to the following limitations.

External validity

All the studies included in the review, except one utilizing user feedback in both English and Italian [ 65 ], focus on eliciting English requirements. Thus, our results cannot be generalizable to requirements elicitation in other languages. Further studies are needed to assess the applicability of the techniques used for eliciting English requirements to other languages.

Internal validity

Our search query might have missed potentially important keywords such as “requirements mining”, “feedback”, and “tool”. Not including those keywords affects the number of studies included in the analysis. Our search query also failed to capture the work following DevOps and human–computer interaction approaches, which may have resulted in omitting some important work. We did not perform a backward and forward reference search for updating the review. The absence also may have reduced the number of studies included in this review.

In addition, a single reviewer performed a large part of study selection and data extraction, which may cause errors that impact the results. We partially mitigate the risk by ensuring high inter-rater reliability tested on a small proportion of randomly selected samples and discussing with at least one of the other reviewers to decide the inclusion of undecided papers, as explained in the “Study Selection” section. Ideally, the entire study selection and data extraction process should have been performed by at least two reviewers.

Another limitation is that we defined an analytical framework to synthesize retrieved data in advance. However, the analytical framework was based on the previous systematic review of the automated requirements elicitation from domain documents. Moreover, we assessed the quality of individual study solely based on the SJR or CORE scores. Those scores may not always reflect the “true” strength of evidence provided by each study. A more detailed and formal quality assessment could have added value to the review by increasing the reliability of the results.

Publication bias

This review included only published peer-reviewed studies and excluded gray literature and commercial products, which may fill many of the gaps identified in this review. Thus, the frequencies of the techniques and concepts do not imply real-life usage frequencies or degree of usefulness. Including gray literature and commercial products would increase the review’s completeness and timeliness.

Conclusions and Future Work

We have conducted a systematic literature review concerning requirements elicitation from data generated via digital technologies that are unintended with respect to requirements. These sources can include data that is highly relevant for new system requirements, which otherwise could not be obtained from other sources. The motivation behind the proposed approaches lies in the fact that by including such requirements, which existing or new software systems are not supporting, important improvements concerning system functionality and quality can be made, as well as ensuring that requirements are up-to-date and enabling further automation of a continuous elicitation process.

This literature review provides an overview of the state-of-the-art with respect to data-driven requirements elicitation from dynamic data sources. This is the first systematic review focusing on efforts to automate or support requirements elicitation from these types of data sources—often referred to as Big Data—that include not only human-sourced data but also process-mediated and machine-generated data.

We obtained 1848 relevant studies by searching six electronic databases. After two levels of screening, and a complementary forward and backward reference search, 51 papers were selected for data analysis. We further performed additional 2-level screening to update our search, which resulted in including 17 more studies. Thus, in total, 68 studies are included in the final analysis. Those selected studies were analyzed to answer the defined research questions concerning (a) identification of specific data sources and data types used for the elicitation, (b) methods and techniques used for processing the data, and (c) classification of the content of obtained outputs in relation to what is expected from the traditional elicitation process.

The results revealed remarkable insights, which, when summarized, have shown the current clear dominance of the human-sourced data, compared to the process-mediated and machine-generated data sources. As a result of that the techniques used for data processing are based on natural language processing, while the use of machine learning for classification and clustering is prevalent. The dominant intention of the proposed methods was to automate the elicitation process fully, rather than to combine it with traditional stakeholder-involved approaches.

Furthermore, the results showed that the majority of the studies were considering both functional and non-functional (i.e., quality) requirements. The final results regarding the completeness and the readiness of the elicited data for use in system development or evolution are currently limited—most of the studies obtain some of the information relevant for requirement’s content, some studies target the identification of the core functionality or quality in terms of features, and only a few of the studies achieve a high-level requirement content. Finally, the majority of the studies evaluated the results in experimental environments, thus indicating rather a low extent of implementation of the method in a real-life requirements engineering setting.

The obtained results provide several directions for future work. One possible direction concerns the investigation of more extensive use and analysis of non-human-sourced data types. In addition, automatic data fusion and contextualization methods need to be investigated for integrating, processing, and analyzing a large volume of heterogeneous data sources to elicit requirements. Semantic technologies can be a promising solution to address the variety and volume issues of Big Data. Other direction leads to enabling real-time data processing and analyzing to facilitate continuous requirements elicitation from Big Data with high velocity.

Moreover, each proposed solution needs to be evaluated against traditional requirements to convince practitioners for its implementation in real-life. Further improvements also need to be made in the content and quality of the elicited data in relation to fully detailed requirements. Finally, a very important direction relates to the proposals for enabling context-awareness to capture requirements that changes dynamically over time.

Pohl K. Requirements engineering: fundamentals, principles, and techniques. Heidelberg: Springer; 2010.

Book   Google Scholar  

Pacheco C, García I, Reyes M. Requirements elicitation techniques: a systematic literature review based on the maturity of the techniques. IET Softw. 2018;12(4):365–78. https://doi.org/10.1049/iet-sen.2017.0144 .

Article   Google Scholar  

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Quart. 2012;36(4):1165–88.

Manrique-Losada B, Zapata-Jaramillo CM, Burgos DA. Re-expressing business processes information from corporate documents into controlled language. In: International Conference on Applications of Natural Language Processing to Information Systems . Cham: Springer, 2016, pp. 376–383.

Hauksdóttir D, Ritsing B, Andersen JC, Mortensen NH. Establishing reusable requirements derived from laws and regulations for medical device development. In 2016 IEEE 24th International Requirements Engineering Conference Workshops (REW) , 2016, pp. 220–228, https://doi.org/10.1109/REW.2016.045 .

Kaiya H, Saeki M. Using domain ontology as domain knowledge for requirements elicitation. In 14th IEEE International Requirements Engineering Conference (RE'06) , pp. 189–198. IEEE, 2006, pp. 186–195, https://doi.org/10.1109/RE.2006.72 .

Zong-yong L, Zhi-xue W, Ying-ying Y, Yue W, Ying L. Towards a multiple ontology framework for requirements elicitation and reuse. In 31st Annual International Computer Software and Applications Conference (COMPSAC 2007) , vol. 1, pp. 189–195. IEEE, 2007, https://doi.org/10.1109/COMPSAC.2007.216 .

Nogueira FA, De Oliveira HC. Application of heuristics in business process models to support software requirements specification. ICEIS. 2017;2:40–51.

Google Scholar  

Bendjenna H, Zarour NE, Charrel P. MAMIE: A methodology to elicit requirements in inter-company co-operative information systems. In 2008 International Conference on Computational Intelligence for Modelling Control & Automation , 2008, pp. 290–295, https://doi.org/10.1109/CIMCA.2008.101 .

Shao F, Peng R, Lai H, Wang B. DRank: a semi-automated requirements prioritization method based on preferences and dependencies. J Syst Softw. 2017;126:141–56. https://doi.org/10.1016/j.jss.2016.09.043 .

Abad ZSH, Karras O, Ghazi P, Glinz M, Ruhe G, Schneider K. “What works better? A study of classifying requirements. In 2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 496–501. IEEE, 2017, https://doi.org/10.1109/RE.2017.36 .

Hayes JH, Antoniol G, Adams B, Guehénéuc YG. Inherent characteristics of traceability artifacts: Less is more. In: 2015 IEEE 23rd International Requirements Engineering Conference (RE) , pp. 196–201. IEEE, 2015.

Kamalrudin M, Hosking J, Grundy J. MaramaAIC: tool support for consistency management and validation of requirements. Automat Softw Eng. 2017;24(1):1–45. https://doi.org/10.1007/s10515-016-0192-z .

Ahmed MA, Butt WH, Ahsan I, Anwar MW, Latif M, Azam F. A novel natural language processing (NLP) approach to automatically generate conceptual class model from initial software requirements. In International Conference on Information Science and Applications , pp. 476–484. Springer, Singapore, 2017.

Kifetew F, Munante D, Perini A, Susi A, Siena A, Busetta P. DMGame: A gamified collaborative requirements prioritisation tool. In: 2017 IEEE 25th International Requirements Engineering Conference (RE) , 2017, pp. 468–469, https://doi.org/10.1109/RE.2017.46 .

Ahmad S, Jalil IEA, Ahmad SSS. An enhancement of software requirements negotiation with rule-based reasoning: a conceptual model. J Telecommun Electron Comput Eng (JTEC). 2016;8(10):193–8.

Meth H, Brhel M, Maedche A. The state of the art in automated requirements elicitation. Inf Softw Technol. 2013;55(10):1695–709. https://doi.org/10.1016/j.infsof.2013.03.008 .

Nicolás J, Toval A. On the generation of requirements specifications from software engineering models: a systematic literature review. Inf Softw Technol. 2009;51(9):1291–307. https://doi.org/10.1016/j.infsof.2009.04.001 .

Groen EC, et al. The crowd in requirements engineering: the landscape and challenges. IEEE Softw. 2017;34(2):44–52. https://doi.org/10.1109/MS.2017.33 .

Ferguson M. Big data-why transaction data is mission critical to success. Intelligence Business Strategies Limited. https://public.dhe.ibm.com/common/ssi/ecm/im/en/iml14442usen/IML14442USEN.PDF , 2014.

Maalej W, Nayebi M, Johann T, Ruhe G. Toward data-driven requirements engineering. IEEE Softw. 2016;33(1):48–54. https://doi.org/10.1109/MS.2015.153 .

Genc-Nayebi N, Abran A. A systematic literature review: opinion mining studies from mobile app store user reviews. J Syst Softw. 2017;125:207–19. https://doi.org/10.1016/j.jss.2016.11.027 .

Ghasemi M, Amyot D. From event logs to goals: a systematic literature review of goal-oriented process mining. Requir Eng. 2019;25(1):67–93. https://doi.org/10.1007/s00766-018-00308-3 .

Zowghi D, Coulin C. Requirements elicitation: A survey of technique, approaches and tools. In: Engineering and managing software requirements. Berlin: Springer; 2005. p. 19–46.

Chapter   Google Scholar  

Arruda D, Madhavji NH. State of requirements engineering research in the context of Big Data applications. In International Working Conference on Requirements Engineering: Foundation for Software Quality . Cham: Springer, 2018, pp. 307–323.

Martin W, Sarro F, Jia Y, Zhang Y, Harman M. A survey of app store analysis for software engineering. IEEE Trans Software Eng. 2017;43(9):817–47. https://doi.org/10.1109/TSE.2016.2630689 .

Tavakoli M, Zhao L, Heydari A, Nenadić G. Extracting useful software development information from mobile application reviews: a survey of intelligent mining techniques and tools. Exp Syst Appl. 2018;113:186–99.

De Mauro A, Greco M, Grimaldi M. What is big data? A consensual definition and a review of key research topics. In: AIP conference proceedings , vol. 1644, no. 1, pp. 97–104. American Institute of Physics, 2015, https://doi.org/10.1063/1.4907823 .

Morales-Ramirez I, Perini A, Guizzardi RSS. An ontology of online user feedback in software engineering. Appl Ontol. 2015;10(3–4):297–330. https://doi.org/10.3233/AO-150150 .

Firmani D, Mecella M, Scannapieco M, Batini C. On the meaningfulness of ‘big data quality’ (invited paper). Data Sci Eng. 2016;1(1):6–20. https://doi.org/10.1007/s41019-015-0004-7 .

Kitchenham B, Charters S. Guidelines for performing systematic literature reviews in software engineering. 2007.

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. https://doi.org/10.2307/2529310 .

Article   MATH   Google Scholar  

Rotondi MA, Donner A. A confidence interval approach to sample size estimation for interobserver agreement studies with multiple raters and outcomes. J Clin Epidemiol. 2012;65(7):778–84. https://doi.org/10.1016/j.jclinepi.2011.10.019 .

Hevner AR, March ST, Park J, Ram S. Design science in information systems research. MIS Quart. 2004;28(1):75–105. https://doi.org/10.2307/25148625 .

Dhinakaran VT, Pulle R, Ajmeri N, Murukannaiah PK. App review analysis via active learning: Reducing supervision effort without compromising classification accuracy. In 2018 IEEE 26th International Requirements Engineering Conference (RE) , pp. 170–181. IEEE, 2018, https://doi.org/10.1109/RE.2018.00026 .

Do QA, Bhowmik T. Automated generation of creative software requirements: a data-driven approach. In Proceedings of the 1st ACM SIGSOFT International Workshop on Automated Specification Inference , pp. 9–12. 2018, https://doi.org/10.1145/3278177.3278180 .

Groen EC, Kopczynska S, Hauer MP, Krafft TD, Doerr J. Users-the hidden software product quality experts?: A study on how app users report quality aspects in online reviews. In 2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 80–89. IEEE, 2017, https://doi.org/10.1109/RE.2017.73 .

Maalej W, Kurtanović Z, Nabil H, Stanik C. On the automatic classification of app reviews. Requir Eng. 2016;21(3):311–31. https://doi.org/10.1007/s00766-016-0251-9 .

Maalej W, Nabil H. Bug report, feature request, or simply praise? On automatically classifying app reviews. In 2015 IEEE 23rd international requirements engineering conference (RE) , pp. 116–125. IEEE, 2015, https://doi.org/10.1109/RE.2015.7320414 .

Guzman E, Maalej W. How do users like this feature? A fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd international requirements engineering conference (RE) , pp. 153–162. IEEE, 2014, https://doi.org/10.1109/RE.2014.6912257 .

Jiang W, Ruan H, Zhang L, Lew P, Jiang J. For user-driven software evolution: Requirements elicitation derived from mining online reviews. In Pacific-Asia Conference on Knowledge Discovery and Data Mining , pp. 584–595. Springer, Cham, 2014.

Lu M, Liang P. Automatic classification of non-functional requirements from augmented app user reviews. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering , 2017, pp. 344–353.

Deocadez R, Harrison R, Rodriguez D. Automatically classifying requirements from app stores: A preliminary study. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , Sep. 2017, pp. 367–371, https://doi.org/10.1109/REW.2017.58 .

Abad ZSH, Sims SDV, Cheema A, Nasir MB, Harisinghani P. Learn more, pay less! Lessons learned from applying the wizard-of-oz technique for exploring mobile app requirements. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , Sep. 2017, pp. 132–138, https://doi.org/10.1109/REW.2017.71 .

Panichella S, Di Sorbo A, Guzman E, Visaggio CA, Canfora G, Gall HC. How can I improve my app? Classifying user reviews for software maintenance and evolution. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pp. 281–290. IEEE, 2015, https://doi.org/10.1109/ICSM.2015.7332474 .

Dalpiaz F, Parente M. RE-SWOT: From user feedback to requirements via competitor analysis. In International Working Conference on Requirements Engineering: Foundation for Software Quality , pp. 55–70. Springer, Cham, 2019.

Zhao L, Zhao A. Sentiment analysis based requirement evolution prediction. Fut Internet. 2019;11(2):52.

Jha N, Mahmoud A. Using frame semantics for classifying and summarizing application store reviews. Empir Softw Eng. 2018;23(6):3734–67.

Luiz W, et al. A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference , pp. 1909–1918. 2018.

Jha N, Mahmoud A. MARC: a mobile application review classifier. In REFSQ Workshops . 2017.

Jha N, Mahmoud A. Mining user requirements from application store reviews using frame semantics. In International working conference on requirements engineering: Foundation for software quality , pp. 273–287. Springer, Cham, 2017.

Carreno LVG, Winbladh K. Analysis of user comments: an approach for software requirements evolution. In 2013 35th international conference on software engineering (ICSE) , pp. 582–591. IEEE, 2013, https://doi.org/10.1109/ICSE.2013.6606604 .

Sun D, Peng R. A scenario model aggregation approach for mobile app requirements evolution based on user comments. In: Requirements engineering in the big data era. Berlin: Springer; 2015. p. 75–91.

Higashi K, Nakagawa H, Tsuchiya T. Improvement of user review classification using keyword expansion. In: Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 125–124. 2018.

Bakiu E, Guzman E. Which feature is unusable? Detecting usability and user experience issues from user reviews. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , pp. 182–187. IEEE, 2017, https://doi.org/10.1109/REW.2017.76 .

Srisopha K, Behnamghader P, Boehm B. Do users talk about the software in my product? Analyzing user reviews on IoT products. In the Proceedings of CIbSE XXI Ibero-American Conference on Software Engineering (CIbSE) , pp. 551–564, 2018.

Groen EC, Iese F, Schowalter J, Kopczynska S. Is there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability of manual analysis. In REFSQ Workshops. 2018.

Jiang W, Ruan H, Zhang L. Analysis of economic impact of online reviews: an approach for market-driven requirements evolution. In: Zowghi D, Jin Z, editors. Requirements engineering communications in computer and information science, vol. 432. Berlin: Springer; 2014. p. 45–59. https://doi.org/10.1007/978-3-662-43610-3_4 .

Buchan J, Bano M, Zowghi D, Volabouth P. Semi-automated extraction of new requirements from online reviews for software product evolution. In 2018 25th Australasian Software Engineering Conference (ASWEC) , pp. 31–40. IEEE, 2018.

Zhang Z, Qi J, Zhu G. Mining customer requirement from helpful online reviews. 2014 Enterprise Systems Conference , pp. 249–254. IEEE, 2014, https://doi.org/10.1109/ES.2014.38 .

Bakar NH, Kasirun ZM, Salleh N, Jalab HA. Extracting features from online software reviews to aid requirements reuse. Appl Soft Comput J. 2016;49:1297–315. https://doi.org/10.1016/j.asoc.2016.07.048 .

Bakar NH, Kasirun ZM, Salleh N. Terms extractions: an approach for requirements reuse. In 2015 2nd International Conference on Information Science and Security (ICISS) , pp. 1–4. IEEE, 2015.–254. IEEE, 2015, https://doi.org/10.1109/ICISSEC.2015.7371034 .

Bakar NH, Kasirun ZM, Salleh N, Halim A. Extracting software features from online reviews to demonstrate requirements reuse in software engineering. In Proceedings of the International Conference on Computing & Informatics , pp. 184–190. 2017.

Williams G, Mahmoud A. Mining twitter feeds for software user requirements. In 2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 1–10. IEEE, 2017, https://doi.org/10.1109/RE.2017.14 .

Stanik C, Haering M, Maalej W. Classifying multilingual user feedback using traditional machine learning and deep learning. In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW) , pp. 220–226. IEEE, 2019.

Ali N, Hwang S, Hong J-E. Your opinions let us know: mining social network sites to evolve software product lines. KSII Trans Internet Inf Syst. 2019;13(8):4191–211. https://doi.org/10.3837/tiis.2019.08.021 .

Alwadain A, Alshargi M. Crowd-generated data mining for continuous requirements elicitation. Int J Adv Comput Sci Appl. 2019;10(9):45–50.

Guzman E, Ibrahim M, Glinz M. A little bird told me: Mining tweets for requirements and software evolution. In 2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 11–20. IEEE, 2017, https://doi.org/10.1109/RE.2017.88 .

Guzman E, Alkadhi R, Seyff N. An exploratory study of twitter messages about software applications. Requir Eng. 2017;22(3):387–412.

Guzman E, Alkadhi R, Seyff N. A needle in a haystack: What do twitter users say about software? In 2016 IEEE 24th International Requirements Engineering Conference (RE) , pp. 96–105. IEEE, 2016, https://doi.org/10.1109/RE.2016.67 .

Kuehl N. Needmining: towards analytical support for service design. In International Conference on Exploring Services Science , pp. 187–200. Springer, Cham, 2016.

Nguyen V, Svee EO, Zdravkovic J. A semi-automated method for capturing consumer preferences for system requirements. In IFIP Working Conference on The Practice of Enterprise Modeling , pp. 117–132. Springer, Cham, 2016.

Svee EO, Zdravkovic J. A model-based approach for capturing consumer preferences from crowdsources: the case of twitter. In 2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS) , pp. 1–12. IEEE, 2016, https://doi.org/10.1109/RCIS.2016.7549323 .

Martens D, Maalej W. Extracting and analyzing context information in user-support conversations on twitter. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 131–141. IEEE, 2019.

Han X, Li R, Li W, Ding G, Qin S. User requirements dynamic elicitation of complex products from social network service. In 2019 25th International Conference on Automation and Computing (ICAC) , pp. 1–6. IEEE, 2019. https://doi.org/10.23919/IConAC.2019.8895140 .

Vlas RE, Robinson WN. Two rule-based natural language strategies for requirements discovery and classification in open source software development projects. J Manag Inf Syst. 2012;28(4):11–38. https://doi.org/10.2753/MIS0742-1222280402 .

Xiao M, Yin G, Wang T, Yang C, Chen M. Requirement acquisition from social Q&A sites. In: Liu L, Aoyama M, editors. Requirements engineering in the big data era Communications in Computer and Information Science, vol. 558. Berlin: Springer; 2015.

Cleland-Huang J, Dumitru H, Duan C, Castro-Herrera C. Automated support for managing feature requests in open forums. Commun ACM. 2009;52(10):68–74. https://doi.org/10.1145/1562764.1562784 .

Morales-Ramirez I, Kifetew FM, Perini A. Analysis of online discussions in support of requirements discovery. In International Conference on Advanced Information Systems Engineering (CAiSE) , pp. 159–174. Springer, Cham, 2017, https://doi.org/10.1007/978-3-319-59536-8_11 .

Khan JA, Liu L, Wen L. Requirements knowledge acquisition from online user forums. IET Softw. 2020;14(3):242–53. https://doi.org/10.1049/iet-sen.2019.0262 .

Khan JA, Xie Y, Liu L, Wen L. Analysis of requirements-related arguments in user forums. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 63–74. IEEE, 2019.

Khan JA. Mining requirements arguments from user forums. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 440–445. IEEE, 2019.

Tizard J, Wang H, Yohannes L, Blincoe K. Can a conversation paint a picture? Mining Requirements in software forums. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 17–27. IEEE, 2019.

Morales-Ramirez I, Kifetew FM, Perini A. Speech-acts based analysis for requirements discovery from online discussions. Inf Syst. 2018;86:94–112. https://doi.org/10.1016/j.is.2018.08.003 .

Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B. Software feature request detection in issue tracking systems. In 2016 IEEE 24th International Requirements Engineering Conference (RE) , pp. 166–175. IEEE, 2016, https://doi.org/10.1109/RE.2016.8 .

Portugal RLQ, Do Prado Leite JCS, Almentero E. Time-constrained requirements elicitation: Reusing GitHub content. In 2015 IEEE Workshop on Just-In-Time Requirements Engineering (JITRE) , pp. 5–8. IEEE, 2015, https://doi.org/10.1109/JITRE.2015.7330171 .

Nyamawe AS, Liu H, Niu N, Umer Q, Niu Z. Automated recommendation of software refactorings based on feature requests. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 187–198. IEEE, 2019.

Franch X, et al. Data-driven elicitation, assessment and documentation of quality requirements in agile software development. In International Conference on Advanced Information Systems Engineering , pp. 587–602. Springer, Cham, 2018.

Oriol M, et al. Data-driven and tool-supported elicitation of quality requirements in agile companies. Softw Qual J. 2020. https://doi.org/10.1007/s11219-020-09509-y .

Do QA, Chekuri SR, Bhowmik T. Automated support to capture creative requirements via requirements reuse. In International Conference on Software and Systems Reuse , pp. 47–63. Springer, Cham, 2019, https://doi.org/10.1007/978-3-030-22888-0_4 .

Kang Y, Li H, Lu C, Pu B. A transfer learning algorithm for automatic requirement model generation. J Intell Fuzzy Syst. 2019;36(2):1183–91. https://doi.org/10.3233/JIFS-169892 .

Wang C, Zhang F, Liang P, Daneva M, van Sinderen M. Can app changelogs improve requirements classification from app reviews?: An exploratory study. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement , pp. 1–4. 2018, https://doi.org/10.1145/3239235.3267428 .

Wang C, Wang T, Liang P, Daneva M, Van Sinderen M. Augmenting app reviews with app changelogs: An approach for app reviews classification. In Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 398–512. 2019, https://doi.org/10.18293/SEKE2019-176 .

Johann T, Stanik C, Maalej W. SAFE: A simple approach for feature extraction from app descriptions and app reviews. In 2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 21–30. IEEE, 2017, https://doi.org/10.1109/RE.2017.71 .

Voet H, Altenhof M, Ellerich M, Schmitt RH, Linke B. A framework for the capture and analysis of product usage data for continuous product improvement. J Manuf Sci Eng. 2019;141(2):021010.

Liang W, Qian W, Wu Y, Peng X, Zhao W. Mining context-aware user requirements from crowd contributed mobile data. In Proceedings of the 7th Asia-Pacific Symposium on Internetware , pp. 132–140. 2015, https://doi.org/10.1145/2875913.2875933 .

Xie H, Yang J, Chang CK, Liu L. A statistical analysis approach to predict user’s changing requirements for software service evolution. J Syst Softw. 2017;132:147–64. https://doi.org/10.1016/j.jss.2017.06.071 .

Yang J, Chang CK, Ming H. A situation-centric approach to identifying new user intentions using the mtl method. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) , vol. 1, pp. 347–356. IEEE, 2017.

Wüest D, Fotrousi F, Fricker S. Combining monitoring and autonomous feedback requests to elicit actionable knowledge of system use. In International Working Conference on Requirements Engineering: Foundation for Software Quality , pp. 209–225. Springer, Cham, 2019.

Takahashi H, Nakagawa H, Tsuchiya T. Towards automatic requirements elicitation from feedback comments: Extracting requirements topics using LDA. In Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 489–494. 2015, https://doi.org/10.18293/SEKE2015-103 .

Dhinakaran VT, Pulle R, Ajmeri N, Murukannaiah PK. App review analysis via active learning. In 2018 IEEE 26th International Requirements Engineering Conference (RE) , pp. 170–181. IEEE, 2018.

Licorish SA. Exploring the prevalence and evolution of android concerns: a community viewpoint. JSW. 2016;11(9):848–69.

Tizard J, Wang H, Yohannes L, Blincoe K. Can a conversation paint a picture? Mining requirements in software forums. In 2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 17–27. IEEE, 2019, https://doi.org/10.1109/RE.2019.00014 .

Al Kilani N, Tailakh R, Hanani A. Automatic classification of apps reviews for requirement engineering: Exploring the customers need from healthcare applications. In 2019 6th International Conference on Social Networks Analysis, Management and Security (SNAMS) , pp. 541–548. IEEE, 2019, https://doi.org/10.1109/SNAMS.2019.8931820 .

Yan X, Guo J, Lan Y, Cheng X. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web , pp. 1445–1456. 2013.

Merten T, Falis M, Hübner P, Quirchmayr T, Bürsner S, Paech B. Software feature request detection in issue tracking systems. In 2016 IEEE 24th International Requirements Engineering Conference (RE) , pp. 166–175. IEEE, 2016.

Barnaghi P, Wang W, Henson C, Taylor K. Semantics for the internet of things: early progress and back to the future. Int J Semant Web Inf Syst (IJSWIS). 2012;8(1):1–21. https://doi.org/10.4018/jswis.2012010101 .

Wolpert DH. The lack of a priori distinctions between learning algorithms. Neural Comput. 1996;8(7):1341–90. https://doi.org/10.1162/neco.1996.8.7.1341 .

Johannesson P, Perjons E. An introduction to design science. Berlin: Springer; 2014.

Berry DM. Evaluation of tools for hairy requirements engineering and software engineering tasks. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , pp. 284–291. IEEE, 2017.

Dimitroff G, Georgiev G, Toloşi L, Popov B. Efficient F measure maximization via weighted maximum likelihood. Mach Learn. 2015;98(3):435–54. https://doi.org/10.1007/s10994-014-5439-y .

Article   MathSciNet   MATH   Google Scholar  

Download references

Open Access funding provided by Stockholm University. This study was funded by the Department of Computer and System Sciences, Stockholm University.

Author information

Authors and affiliations.

Department of Computer and Systems Sciences, Stockholm University, DSV, PO Box 7003, 164 07, Kista, Stockholm, Sweden

Sachiko Lim, Aron Henriksson & Jelena Zdravkovic

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Sachiko Lim .

Ethics declarations

Conflict of interest.

The authors declare that they have no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 127 KB)

See Table 9

J1: J. A. Khan, L. Liu, and L. Wen. “Requirements knowledge acquisition from online user forums.” IET Software , vol. 14, no. 3, pp. 242–253, https://doi.org/10.1049/iet-sen.2019.0262 .

J2: M. Oriol et al., “Data-driven and tool-supported elicitation of quality requirements in agile companies,” Software Quality Journal , pp. 1–33, 2020, https://doi.org/10.1007/s11219-020-09509-y .

J3: N. Ali, S. Hwang, and J.-E. Hong. “Your opinions let us know: Mining social network sites to evolve software product lines.” KSII Transactions on Internet and Information Systems , vol. 13, no. 8, pp. 4191–4211, 2019, https://doi.org/10.3837/tiis.2019.08.021 .

J4: A. Alwadain and M. Alshargi. “Crowd-generated data mining for continuous requirements elicitation.” International Journal of Advanced Computer Science and Applications , vol. 10, no. 9, pp. 45–50, 2019.

J5: Y. Kang, H. Li, C. Lu, and B. Pu. “A transfer learning algorithm for automatic requirement model generation.” Journal of Intelligent and Fuzzy Systems , vol. 36, no. 2, pp. 1183–1191, 2019, https://doi.org/10.3233/JIFS-169892 .

J6: H. Voet, M. Altenhof, M. Ellerich, R. H. Schmitt, and B. Linke. “A Framework for the capture and analysis of product usage data for continuous product improvement.” Journal of Manufacturing Science and Engineering,  vol.141, no. 2, 2019.

J7: L. Zhao and A. Zhao. “Sentiment analysis based requirement evolution prediction.” Future Internet , vol. 11, no. 2, p. 52, 2019.

J8: N. Jha and A. Mahmoud. "Using frame semantics for classifying and summarizing application store reviews."  Empirical Software Engineering  23, no. 6 (2018): 3734–3767.

J9: I. Morales-Ramirez, F. M. Kifetew, and A. Perini, “Speech-acts based analysis for requirements discovery from online discussions,” Information Systems , vol. 86, pp.94–112, 2018, https://doi.org/10.1016/j.is.2018.08.003 .

J10: E. Guzman, R. Alkadhi, and N. Seyff. “An exploratory study of Twitter messages about software applications.” Requirements Engineering , vol. 22, no. 3, pp. 387–412, 2017.

J11: H. Xie, J. Yang, C. K. Chang, and L. Liu. “A statistical analysis approach to predict user’s changing requirements for software service evolution.” Journal of Systems and Software , vol. 132, pp. 147–164, 2017, https://doi.org/10.1016/j.jss.2017.06.071 .

J12: N. H. Bakar, Z. M. Kasirun, N. Salleh, and H. A. Jalab. “Extracting features from online software reviews to aid requirements reuse.” Applied Soft Computing Journal , vol. 49, pp. 1297–1315, 2016, https://doi.org/10.1016/j.asoc.2016.07.048 .

J13: S. A. Licorish. “Exploring the prevalence and evolution of Android concerns: A community viewpoint.” JSW , vol. 11, no. 9, pp. 848–869, 2016.

J14: W. Maalej, Z. Kurtanović, H. Nabil, and C. Stanik. “On the automatic classification of app reviews.” Requirements Engineering , vol. 21, no. 3, pp. 311–331, 2016, https://doi.org/10.1007/s00766-016-0251-9 .

J15: R. E. Vlas and W. N. Robinson. “Two rule-based natural language strategies for requirements discovery and classification in open source software Development Projects.” Journal of Management Information Systems , vol. 28, no. 4, pp. 11–38,, https://doi.org/10.2753/MIS0742-1222280402 .

J16: J. Cleland-Huang, H. Dumitru, C. Duan, and C. Castro-Herrera. “Automated support for managing feature requests in open forums.” Communications of the ACM , vol. 52, no. 10, pp. 68–74, 2009, https://doi.org/10.1145/1562764.1562784 .

C1: F. Dalpiaz and M. Parente. “RE-SWOT: From user feedback to requirements via competitor analysis.” In  International Working Conference on Requirements Engineering: Foundation for Software Quality , pp. 55–70. Springer, Cham, 2019.

C2: Q. A. Do, S. R. Chekuri, and T. Bhowmik. “Automated support to capture creative requirements via requirements reuse.” In  International Conference on Software and Systems Reuse , pp. 47–63. Springer, Cham, 2019, https://doi.org/10.1007/978-3-030-22888-0_4 .

C3: X. Han, R. Li, W. Li, G. Ding, and S. Qin. “User requirements dynamic elicitation of complex products from social network service.” In  2019 25th International Conference on Automation and Computing (ICAC) , pp. 1–6. IEEE, 2019. https://doi.org/10.23919/IConAC.2019.8895140 .

C4: J. A. Khan. “Mining requirements arguments from user forums.” In  2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 440–445. IEEE, 2019.

C5: J. A. Khan, Y. Xie, L. Liu, and L. Wen. “Analysis of requirements-related arguments in user forums.” In  2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 63–74. IEEE, 2019.

C6: N. Al Kilani, R. Tailakh, and A. Hanani. “Automatic classification of apps reviews for requirement engineering: exploring the customers need from healthcare applications.” In  2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) , pp. 541–548. IEEE, 2019, https://doi.org/10.1109/SNAMS.2019.8931820 .

C7: D. Martens and W. Maalej. “Extracting and analyzing context information in user-support conversations on twitter,” In  2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 131–141. IEEE, 2019.

C8: A. S. Nyamawe, H. Liu, N. Niu, Q. Umer, and Z. Niu. “Automated recommendation of software refactorings based on feature requests.” In  2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 187–198. IEEE, 2019.

C9: J. Tizard, H. Wang, L. Yohannes, and K. Blincoe. “Can a conversation paint a picture? Mining requirements in software forums.” In  2019 IEEE 27th International Requirements Engineering Conference (RE) , pp. 17–27. IEEE, 2019.

C10: C. Wang, T. Wang, P. Liang, M. Daneva, and M. Van Sinderen. "Augmenting app review with app changelogs: An approach for app review classification." In  Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 398–512. 2019, https://doi.org/10.18293/SEKE2019-176 .

C11: D. Wüest, F. Fotrousi, and S. Fricker. “Combining monitoring and autonomous feedback requests to elicit actionable knowledge of system use.” In  International Working Conference on Requirements Engineering: Foundation for Software Quality , pp. 209–225. Springer, Cham, 2019.

C12: J. Buchan, M. Bano, D. Zowghi, and P. Volabouth. “Semi-automated extraction of new requirements from online reviews for software product evolution.” In  2018 25th Australasian Software Engineering Conference (ASWEC) , pp. 31–40. IEEE, 2018.

C13: V. T. Dhinakaran, R. Pulle, N. Ajmeri, and P. K. Murukannaiah. "App review analysis via active learning: reducing supervision effort without compromising classification accuracy." In  2018 IEEE 26th International Requirements Engineering Conference (RE) , pp. 170–181. IEEE, 2018, https://doi.org/10.1109/RE.2018.00026 .

C14: X. Franch et al . . “Data-driven elicitation, assessment and documentation of quality requirements in agile software development.” In  International Conference on Advanced Information Systems Engineering , pp. 587–602. Springer, Cham, 2018.

C15: E. C. Groen, F. Iese, J. Schowalter, and S. Kopczynska. “Is there really a need for using NLP to elicit requirements? A benchmarking study to assess scalability of manual Analysis.” In  REFSQ Workshops . 2018.

C16: K. Higashi, H. Nakagawa, and T. Tsuchiya. “Improvement of user review classification using keyword expansion.” In  Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 125–124. 2018.

C17: W. Luiz et al . . "A feature-oriented sentiment rating for mobile app reviews." In  Proceedings of the 2018 World Wide Web Conference , pp. 1909–1918. 2018.

C18: K. Srisopha, P. Behnamghader, and B. Boehm. “Do users talk about the software in my product? Analyzing user reviews on IoT products.” In the Proceedings of CIbSE XXI Ibero-American Conference on Software Engineering (CIbSE) , pp.551–564, 2018

C19: N. H. Bakar, Z. M. Kasirun, N. Salleh, and A. Halim. “Extracting software features from online reviews to demonstrate requirements reuse in software engineering.” In  Proceedings of the International Conference on Computing & Informatics , pp. 184–190. 2017.

C20: M. Lu and P. Liang. “Automatic classification of non-functional requirements from augmented app user reviews.” In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering , 2017, pp. 344–353.

C21: E. C. Groen, S. Kopczynska, M. P. Hauer, T. D. Krafft, and J. Doerr. "Users—the hidden software product quality experts?: A study on how app users report quality aspects in online reviews." In  2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 80–89. IEEE, 2017, https://doi.org/10.1109/RE.2017.73 .

C22: E. Guzman, M. Ibrahim, and M. Glinz. “A little bird told me: Mining tweets for requirements and software evolution.” In  2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 11–20. IEEE, 2017, https://doi.org/10.1109/RE.2017.88 .

C23: N. Jha and A. Mahmoud. "Mining user requirements from application store reviews using frame semantics." In  International working conference on requirements engineering: Foundation for software quality , pp. 273–287. Springer, Cham, 2017.

C24: I. Morales-Ramirez, F. M. Kifetew, and A. Perini. “Analysis of online discussions in support of requirements discovery.” In  International Conference on Advanced Information Systems Engineering (CAiSE) , pp. 159–174. Springer, Cham, 2017, https://doi.org/10.1007/978-3-319-59536-8_11 .

C25: G. Williams and A. Mahmoud. “Mining twitter feeds for software user requirements.” In  2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 1–10. IEEE, 2017, https://doi.org/10.1109/RE.2017.14 .

C26: T. Johann, C. Stanik, A. M. A. B, and W. Maalej. “SAFE: A simple approach for feature extraction from app descriptions and app reviews.” In  2017 IEEE 25th International Requirements Engineering Conference (RE) , pp. 21–30. IEEE, 2017, https://doi.org/10.1109/RE.2017.71 .

C27: J. Yang, C. K. Chang, and H. Ming. “A situation-centric approach to identifying new user intentions using the mtl method.” In  2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) , vol. 1, pp. 347–356. IEEE, 2017.

C28: E. Guzman, R. Alkadhi, and N. Seyff. “A needle in a haystack: What do twitter users say about software?,” In  2016 IEEE 24th International Requirements Engineering Conference (RE) , pp. 96–105. IEEE, 2016, https://doi.org/10.1109/RE.2016.67 .

C29: N. Kuehl. "Needmining: Towards analytical support for service design." In  International Conference on Exploring Services Science , pp. 187–200. Springer, Cham, 2016.

C30: T. Merten, M. Falis, P. Hübner, T. Quirchmayr, S. Bürsner, and B. Paech. “Software feature request detection in issue tracking systems.” In  2016 IEEE 24th International Requirements Engineering Conference (RE) , pp. 166–175. IEEE, 2016,, https://doi.org/10.1109/RE.2016.8 .

C31: V. Nguyen, E. Svee, and J. Zdravkovic. "A semi-automated method for capturing consumer preferences for system requirements." In  IFIP Working Conference on The Practice of Enterprise Modeling , pp. 117–132. Springer, Cham, 2016.

C32: E. Svee and J. Zdravkovic. "A model-based approach for capturing consumer preferences from crowdsources: the case of Twitter." In  2016 IEEE Tenth International Conference on Research Challenges in Information Science (RCIS) , pp. 1–12. IEEE, 2016, https://doi.org/10.1109/RCIS.2016.7549323 .

C33: N. H. Bakar, Z. M. Kasirun, and N. Salleh. “Terms extractions: An approach for requirements reuse.” In  2015 2nd International Conference on Information Science and Security (ICISS) , pp. 1–4. IEEE, 2015.–254. IEEE, 2015, https://doi.org/10.1109/ICISSEC.2015.7371034 .

C34: W. Maalej and H. Nabil. "Bug report, feature request, or simply praise? on automatically classifying app reviews." In  2015 IEEE 23rd international requirements engineering conference (RE) , pp. 116–125. IEEE, 2015, https://doi.org/10.1109/RE.2015.7320414 .

C35: S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall. "How can I improve my app? Classifying user reviews for software maintenance and evolution." In  2015 IEEE International Conference on Software Maintenance and Evolution (ICSME) , pp. 281–290. IEEE, 2015, https://doi.org/10.1109/ICSM.2015.7332474 .

C36: H. Takahashi, H. Nakagawa, and T. Tsuchiya, “Towards automatic requirements elicitation from feedback comments: Extracting requirements topics using LDA,” In  Proceedings of the International Conference on Software Engineering and Knowledge Engineering (SEKE) , pp. 489–494. 2015, https://doi.org/10.18293/SEKE2015-103 .

C37: D. Sun and R. Peng. “A scenario model aggregation approach for mobile app requirements evolution based on user comments.” In  Requirements Engineering in the Big Data Era , pp. 75–91. Springer, Berlin, Heidelberg, 2015.

C38: E. Guzman and W. Maalej. “How do users like this feature? A fine grained sentiment analysis of app reviews.” In  2014 IEEE 22nd international requirements engineering conference (RE) , pp. 153–162. IEEE, 2014, https://doi.org/10.1109/RE.2014.6912257 .

C39: W. Jiang, H. Ruan, L. Zhang, P. Lew, and J. Jiang. “For user-driven software evolution: requirements elicitation derived from mining online reviews.” In  Pacific-Asia Conference on Knowledge Discovery and Data Mining , pp. 584–595. Springer, Cham, 2014.

C40: Z. Zhang, J. Qi, and G. Zhu. “Mining customer requirement from helpful online reviews,” In  2014 Enterprise Systems Conference , pp. 249–254. IEEE, 2014, https://doi.org/10.1109/ES.2014.38 .

C41: L. V. G. Carreno and K. Winbladh. “Analysis of user comments: An approach for software requirements evolution” In  2013 35th international conference on software engineering (ICSE) , pp. 582–591. IEEE, 2013, https://doi.org/10.1109/ICSE.2013.6606604 .

W1: C. Stanik, M. Haering, and W. Maalej. “Classifying multilingual user feedback using traditional machine learning and deep learning.” In  2019 IEEE 27th International Requirements Engineering Conference Workshops (REW) , pp. 220–226. IEEE, 2019.

W2: Q. A. Do and T. Bhowmik. "Automated generation of creative software requirements: a data-driven approach." In  Proceedings of the 1st ACM SIGSOFT International Workshop on Automated Specification Inference , pp. 9–12. 2018, https://doi.org/10.1145/3278177.3278180 .

W3: Z. S. H. Abad, S. D. V. Sims, A. Cheema, M. B. Nasir, and P. Harisinghani. “Learn more, pay less! Lessons learned from applying the wizard-of-oz technique for exploring mobile app requirements.” In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , Sep. 2017, pp. 132–138, https://doi.org/10.1109/REW.2017.71 .

W4: E. Bakiu and E. Guzman. “Which feature is unusable? Detecting usability and user experience issues from user reviews.” In  2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , pp. 182–187. IEEE, 2017, https://doi.org/10.1109/REW.2017.76 .

W5: R. Deocadez, R. Harrison, and D. Rodriguez. “Automatically classifying requirements from app stores: A preliminary study.” In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW) , Sep. 2017, pp. 367–371, https://doi.org/10.1109/REW.2017.58 .

W6: N. Jha and A. Mahmoud. “MARC: A mobile application review classifier.” In  REFSQ Workshops . 2017.

W7: R. L. Q. Portugal, J. C. S. Do Prado Leite, and E. Almentero. “Time-constrained requirements elicitation: Reusing GitHub content.” In  2015 IEEE Workshop on Just-In-Time Requirements Engineering (JITRE) , pp. 5–8. IEEE, 2015, https://doi.org/10.1109/JITRE.2015.7330171 .

S1: C. Wang, F. Zhang, P. Liang, M. Daneva, and M. van Sinderen. “Can app changelogs improve requirements classification from app reviews?: An exploratory study.” In  Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement , pp. 1–4. 2018, https://doi.org/10.1145/3239235.3267428 .

S2: W. Liang, W. Qian, Y. Wu, X. Peng, and W. Zhao. “Mining context-aware user requirements from crowd contributed mobile data.” In  Proceedings of the 7th Asia–Pacific Symposium on Internetware , pp. 132–140. 2015, https://doi.org/10.1145/2875913.2875933 .

S3: M. Xiao, G. Yin, T. Wang, C. Yang, and M. Chen. “Requirement acquisition from social Q&A sites.” In Liu L., Aoyama M. (eds) Requirements Engineering in the Big Data Era. Communications in Computer and Information Science , vol 558. Springer, Berlin, Heidelberg, 2015.

S4: W. Jiang, H. Ruan, and L. Zhang. “Analysis of economic impact of online reviews: An approach for market-driven requirements evolution.” In Zowghi D., Jin Z. (eds) Requirements Engineering. Communications in Computer and Information Science, vol. 432, pp. 45–59, 2014, Springer, Berlin, Heidelberg, https://doi.org/10.1007/978-3-662-43610-3_4 .

See Table 10

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Lim, S., Henriksson, A. & Zdravkovic, J. Data-Driven Requirements Elicitation: A Systematic Literature Review. SN COMPUT. SCI. 2 , 16 (2021). https://doi.org/10.1007/s42979-020-00416-4

Download citation

Received : 25 March 2020

Accepted : 02 December 2020

Published : 04 January 2021

DOI : https://doi.org/10.1007/s42979-020-00416-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Requirements engineering
  • Requirements elicitation
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 04 July 2019

The anatomy of the data-driven smart sustainable city: instrumentation, datafication, computerization and related applications

  • Simon Elias Bibri 1 , 2  

Journal of Big Data volume  6 , Article number:  59 ( 2019 ) Cite this article

13k Accesses

80 Citations

Metrics details

We are moving into an era where instrumentation, datafication, and computerization are routinely pervading the very fabric of cities, coupled with the interlinking, integration, and coordination of their systems and domains. As a result, vast troves of data are generated and exploited to operate, manage, organize, and regulate urban life, or a deluge of contextual and actionable data is produced, analyzed, and acted upon in real time in relation to various urban processes and practices. This data-driven approach to urbanism is increasingly becoming the mode of production for smart sustainable cities. In other words, a new era is presently unfolding wherein smart sustainable urbanism is increasingly becoming data-driven. However, topical studies tend to deal mostly with data-driven smart urbanism while barely exploring how this approach can improve and advance sustainable urbanism under what is labeled ‘data-driven smart sustainable cities.’ Having a threefold aim, this paper first examines how data-driven smart sustainable cities are being instrumented, datafied, and computerized so as to improve, advance, and maintain their contribution to the goals of sustainable development through more optimized processes and enhanced practices. Second, it highlights and substantiates the great potential of big data technology for enabling such contribution by identifying, synthesizing, distilling, and enumerating the key practical and analytical applications of this advanced technology in relation to multiple urban systems and domains with respect to operations, functions, services, designs, strategies, and policies. Third, it proposes, illustrates, and describes a novel architecture and typology of data-driven smart sustainable cities. The overall aim of this study suits thematic analysis as a research approach. I argue that smart sustainable cities are becoming knowable, controllable, and tractable in new dynamic ways thanks to urban science, responsive to the data generated about their systems and domains by reacting to the analytical outcome of many aspects of urbanity in terms of optimizing and enhancing operational functioning, management, planning, design, development, and governance in line with the goals of sustainable development. The proposed architecture, which can be replicated, tested, and evaluated in empirical research, will add additional depth to studies in the field. This study intervenes in the existing scholarly conversation by bringing new insights to and informing the ongoing debate on smart sustainable urbanism in light of big data science and analytics. This work serves to inform city stakeholders about the pivotal role of data-driven analytic thinking in smart sustainable urbanism practices, as well as draws special attention to the enormous benefits of the emerging paradigm of big data computing as to transforming the future form of such urbanism.

Introduction

Contemporary cities have a key role in strategic sustainable development; therefore, they have gained a central position in operationalizing this notion and applying this discourse. This is clearly reflected in the Sustainable Development Goal 11 (SGD 11) of the United Nations’ 2030 Agenda, which entails making cities more sustainable, resilient, inclusive, and safe [ 53 ]. In this regard, the UN’s 2030 Agenda regards information and communication technology (ICT) as a means to promote socio-economic development and protect the environment, increase resource efficiency, achieve human progress and knowledge in societies, upgrade legacy infrastructure, and retrofit industries based on sustainable design principles [ 54 ]. Hence, the multifaceted potential of the smart city approach as enabled by ICT has been under investigation by the UN [ 55 ] through their study on ‘Big Data and the 2030 Agenda for Sustainable Development.’ In particular, there is an urgent need for developing and applying data-driven innovative solutions and sophisticated approaches to overcome the challenges of sustainability and urbanization. In other words, the world is drowning in data—and if planners and policymakers realize the potential of harnessing these data in collaboration with data scientists, urban scientists, and computer scientists, the outcome could solve major global challenges [ 12 ].

In recent years, there has been a marked intensification of datafication. This is manifested in a radical expansion in the volume, range, variety, and granularity of the data being generated about urban environments and citizens (e.g., [ 12 , 33 , 34 , 36 ], with the primary aim of quantifying the whole of the city and thus putting it in a data format that can be organized, processed, and analyzed to generate useful knowledge for enhanced decision-making, as well as deep insights pertaining to a wide variety of practical uses and applications. We are currently experiencing the accelerated datafication of the city in a rapidly urbanizing world and witnessing the dawn of the big data era not out of the window, but in everyday life. Our urban everydayness is entangled with data sensing, data processing, and communication networking, and our wired world generates and analyzes overwhelming and incredible amounts of data. The modern city is turning into constellations of instruments and computers across many scales and morphing into a haze of software instructions, which are becoming essential to the operational functioning, planning, design, development, and governance of the city. The datafication of spatiotemporal citywide events has become a salient factor for the practice of smart sustainable urbanism.

Indeed, as a consequence of datafication, a new era is presently unfolding wherein smart sustainable urbanism is increasingly becoming data-driven [ 12 ]. At the heart of such urbanism is a computational understanding of city systems and processes that reduces urban life to logical and algorithmic rules and procedures, while also harnessing urban big data to provide a more holistic and integrated view or synoptic intelligence of the city. This is increasingly being directed towards improving, advancing, and maintaining the contribution of both sustainable cities and smart cities to the goals of sustainable development [ 12 ].

Overall, the new era of science and technology embodies an unprecedentedly transformative and constitutive power—manifested not only in the form of revolutionizing science and transforming knowledge, but also in advancing social practices, producing new discourses, catalyzing major shifts, and fostering societal transitions. Of particular relevance, it is instigating a massive change in the way both smart cities and sustainable cities are studied and understood, and in how they are planned, designed, operated, managed, and governed in the face of urbanization. To put it differently, these urban practices are becoming highly responsive to a form of data-driven urbanism that is the key mode of production for what have widely been termed smart sustainable cities whose monitoring, understanding, and analysis are accordingly increasingly relying on big data computing and underpinning technologies.

In a nutshell, the Fourth Scientific Revolution is set to erupt in cities, breaking out suddenly and dramatically, throughout the world. This is manifested in bits meeting bricks on a vast scale as instrumentation, datafication, and computerization are permeating the spaces we live in. The outcome will impact most aspects of urban life, raising questions and issues of urgent concern, especially those related to sustainability and urbanization. This pertains to what dimensions of cities will be most affected; how urban planning, design, development, and governance should change and evolve; and, most importantly, how cities will embrace and prepare for looming technological disruptions and opportunities.

However, topical studies tend to deal mostly with data-driven smart urbanism (e.g., [ 7 , 35 , 36 , 37 , 38 , 37 , 40 ] while barely exploring how this approach can improve and advance sustainable urbanism under what is labeled ‘data-driven smart sustainable cities’ as a leading paradigm of urbanism [ 11 , 12 ]. Moreover, research on big data applications in the context of smart cities tends to deal largely with economic growth, the quality of life, and governance (e.g., [ 5 , 8 , 15 , 26 , 30 , 31 , 32 , 33 , 35 , 49 ] while overlooking the rather more urgent issues and complex challenges related to sustainability. This paucity of research pertains particularly to the untapped potential of big data technologies and their novel applications for advancing sustainability in the context of smart sustainable cities [ 8 ]. Indeed, many of the emerging smart solutions are not aligned with sustainability goals [ 1 ]. This relates to the deficiencies and shortcomings of smart cities in this regard (see Bibri [ 11 ] for a detailed review).

Having a threefold aim, this paper first examines how data-driven smart sustainable cities are being instrumented, datafied, and computerized so as to improve, advance, and maintain their contribution to the goals of sustainable development through more optimized processes and enhanced practices. Second, it highlights and substantiates the great potential of big data technology for enabling such contribution by identifying, synthesizing, distilling, and enumerating the key practical and analytical applications of this technology in relation to multiple urban systems and domains with respect to operations, functions, services, designs, strategies, and policies. Third, it proposes, illustrates, and describes a novel architecture and typology of data-driven smart sustainable cities. I argue that smart sustainable cities are becoming knowable, controllable, and tractable in new dynamic ways thanks to urban science, responsive to the data generated about their systems and domains by reacting to the analytical outcome of many aspects of urbanity in terms of optimizing and enhancing operational functioning, management, planning, design, development, and governance in line with the goals of sustainable development.

The remainder of this paper is structured as follows. Section “ Conceptual background ” introduces and describes the key conceptual definitions in relevance to the topic of this study. Section “ A survey of related work ” provides a survey of related work. Section “ Method: thematic analysis ” outlines the research approach adopted in this study: thematic analysis. Section “ Results and discussion ” presents and combines results and discussion. As such, it delves into the heart of the data-driven smart sustainable city, covering a range of constituents and underpinnings; identifying, synthesizing, distilling, and enumerating the key practical and analytical applications of big data technology in terns of sustainability effects and benefits; discussing relevant policy and technology issues; and proposing, illustrating, and describing a novel architecture and typology of the data-driven smart sustainable city. The paper ends, in “ Conclusion ” section, with concluding remarks, contribution, and further research.

Conceptual background

  • Data-driven smart sustainable cities

‘Data-driven smart sustainable cities’ is a term that has recently gained traction in academia, government, and industry to describe cities that are increasingly composed and monitored by ICT of ubiquitous and pervasive computing and thus have the ability of using advanced technologies by city operations centers, planning and policy offices, research centers, innovation labs, and living labs for generating, processing, and analyzing the data deluge in order to enhance decision making processes and to develop and implement innovative solutions for improving sustainability, efficiency, resilience, equity, and the quality of life [ 12 ]. It entails developing a citywide instrumented system (i.e., inter-agency control, planning, innovation, and research hubs) for creating and inventing the future. For example, a data-driven city operations center, which is designed to monitor the city as a whole, pulls or brings together real-time data streams from many different agencies spread across various urban domains and then analyze them for decision making and problem solving purposes: optimizing, regulating, and managing urban operations (e.g., traffic, transport, energy, etc.).

  • Datafication

The big data revolution will transform the way we live, work, and think in the city. Datafication has become a buzzword in the era of big data revolution. This buzzword describes an urban trend of defining the key to core city operations and functions through a reliance on big data computing and underpinning technologies. In other words, the notion of datafication denotes that cities today are dependent upon their data to operate properly—and even to function at all with regard to many domains of urban life [ 12 ]. It also refers to the collective tools, processes, and technologies used to transform a city to a data-driven enterprise. In short, datafication involves turning many aspects of urban life into computerized data and transforming this information into value. As such, this concept helps better frame the changes taking place now [ 21 ]. A city that implements datafication is said to be datafied. To datafy a city is to put it in a quantified format so it can be structured and analyzed.

Cities are taking any possible quantifiable metric and squeezing useful knowledge out of it for enhanced decision-making and deep insights pertaining to many domains of urban life. Datafication entails that in a modern data-oriented urban landscape, a city’s performance is contingent on having control over the storage, management, processing, and analysis of the data, as well as on the extracted knowledge in the form of applied intelligence. Tackling sustainability and urbanization issues is one of the key concerns of the datafication of the contemporary city. To put it differently, the urban world is drowning in data—and if planners and policymakers realize the potential of harnessing these data in collaboration with urban scientists and data scientists, the outcome could solve major global challenges. The point at issue is that we generate enormous amounts of data on a daily basis, a binary trail of breadcrumbs that forms a map of urban life in terms of citizens’ experiences and urban dynamics, and hence the resulting disparate datasets can, if harnessed properly, open up a unique window of, and represent a goldmine, opportunity for making cities more sustainable and in tune with citizens’ actual needs and aspirations.

Big data computing and the underpinning technologies

Big data computing is an emerging paradigm of data science, which is of multidimensional data mining for scientific discovery over-large scale infrastructure. Data mining/knowledge discovery and decision-making from voluminous, varied, real-time, exhaustive, fine-grained, indexical, dynamic, flexible, evolvable, relational data is a daunting challenge/task in terms of storage, management, organization, processing, analysis, interpretation, evaluation, modeling, and simulation, as well as in terms of the visualization and deployment of the obtained results for different purposes. Big data computing amalgamates, as underpinning technologies, large-scale computation, new data-intensive techniques and algorithms, and advanced mathematical models to build and perform data analytics. Accordingly, big data computing demands a huge storage and computing power for data curation and processing for the purpose of discovering new or extracting useful knowledge typically intended for immediate use in an array of multitudinous decision-making processes to achieve different purposes. It entails the following components (see [ 12 ] for a detailed descriptive account):

Advanced techniques based on data science fundamental concepts and computer science methods.

Data mining models.

Computational mechanisms involving such sophisticated and dedicated software applications and database management systems.

Advanced data mining tasks and algorithms.

Modeling and simulation approaches and prediction and optimization methods.

Data processing platforms.

Cloud and fog computing models.

The term ‘big data’ is essentially used to mean collections of datasets whose volume, velocity, variety, exhaustivity, relationality, and flexibility make it so difficult to manage, process, and analyze the data using the traditional database systems and software techniques. The term ‘big data analytics’ denotes ‘any vast amount of data that has the potential to be collected, stored, retrieved, integrated, selected, preprocessed, transformed, analyzed, and interpreted for discovering new or extracting useful knowledge. Prior to this, the analytical outcome (the obtained results) can be evaluated and visualised in an understandable format before their deployment for decision-making purposes (e.g., improving, adjusting, or changing an operation, function, service, strategy, or policy)… In the domain of smart sustainable urbanism, big data analytics refers to a collection of sophisticated and dedicated software applications and database management systems run by machines with very high processing power, which can turn a large amount of urban data into useful knowledge for enhanced decision-making and deep insights in relation to various urban domains, such as transport, mobility, traffic, environment, energy, land use, waste management, education, healthcare, public safety, planning and design, and governance’ [ 9 ], p. 234).

A survey of related work

In one of the earlier works on data-driven urbanism, Batty [ 5 ] describes how the growth of big data is shifting the emphasis from longer term strategic planning to short-term thinking about how cities function and can be managed. His argument revolves around the sea change in the kinds of data that are emerging about what happens where and when in cities, and how it is drastically altering the way we conceive of, understand, and plan smart cities. Bettencourt [ 7 ] explores how big data can be useful in urban planning by formalizing the planning process as a general computational problem. The focus in his paper is on scientific (complexity science) and engineering principles (big data technologies) pertaining to data-driven urbanism, and how they particularly relate to urban policy, management, and planning as to achieving new solutions to wicked and intractable urban problems. In his article ‘The Real-time City? Big Data and Smart Urbanism’ Kitchin [ 33 ] focuses on smart cities as increasingly composed of and monitored by pervasive and ubiquitous computing, and drawing on a number of examples, details how cities as being instrumented with digital devices and infrastructure produce big data which enable real-time analysis of city life, new modes of urban governance, and provide the raw material for envisioning and enacting more efficient, competitive, productive, open, and transparent cities. He moreover provides a critical reflection on the implications of big data and smart urbanism, examining five emerging concerns: the politics of big urban data; technocratic governance and city development; corporatization of city governance and technological lock-ins; buggy, brittle and hack-able cities; and the panoptic city. A large part of this examination is also the aim of Kitchin [ 34 ] paper, which indeed provides a critical overview of data-driven, networked urbanism and smart cities focusing in particular on the relationship between data and the city (rather than network infrastructure or computational or urban issues), and critically examines a number of urban data issues, including corporatization, ownership, control, privacy and security, anticipatory governance, and technical challenges. Kitchin [ 36 ] examines the forms, practices, and ethics of smart cities and urban science, paying particular attention to: instrumental rationality and realist epistemology; privacy, dataveillance and geosurveillance; and data uses, such as social sorting and anticipatory governance. Overall, the above works lack an important strand to the topic of smart or data-driven urbanism: sustainability, and also tend to focus on either technical or political issues related to urban big data. In view of that, Bibri [ 11 ] provides a comprehensive, state-of-the-art review and synthesis addressing the sustainability and unsustainability of smart urbanism and related big data applications in terms of research issues and debates, knowledge gaps, technological advancements, as well as challenges and common open issues.

Research on big data analytics and its application in the context of smart cities tends to deal largely with economic development (i.e., management, efficiency, effectiveness, innovation, productivity, etc.), the quality of life in terms of service delivery betterment, and governance (e.g., [ 5 , 11 , 14 , 12 , 26 , 31 , 32 , 34 , 49 ] while overlooking and barely exploring the rather more urgent issues and complex challenges related to sustainability [ 8 ]. This paucity of research pertains particularly to the untapped potential of big data technologies and their novel applications for enhancing the environmental and social aspects of sustainability in the context of smart sustainable cities [ 8 , 11 , 12 ]. Indeed, many of the emerging smart solutions are not aligned with sustainability goals [ 1 ]. This relates to the deficiencies and misunderstandings of smart cities in this regard [ 11 ], to reiterate. Consequently, a recent research wave has started to focus on enhancing smart city approaches to achieve the required level of sustainability using big data applications under what is labelled ‘smart sustainable cities’ or ‘sustainable smart cities’ (e.g., [ 3 , 6 , 8 , 11 , 12 ]. Therefore, there are only a few studies that have recently focused on the uses of big data applications in relation to the different aspects of sustainability in the context of smart sustainable cities (see, e.g., [ 8 , 9 ], Bibri [ 12 , 15 ]. This lack of research can be explained by the fact that such cities are a new urban phenomenon, and the concept only became widespread during the mid 2010s.

Method: thematic analysis

It is assumed that in data-driven smart sustainable cities, there are concepts and applications that repeat themselves and compose distinct models of such cities in the context of sustainability. Therefore, this paper uses a qualitative approach to identify these concepts and applications as well as the underlying technologies, and eventually to identify the constructs behind them. This relates to the thematic analysis approach, where the aim of qualitative studies is to describe and explain a pattern of relationships, a process that entails a set of conceptual and subject categories [ 46 ] pertaining in this context to the data-driven smart sustainable city.

Following a set of qualitative ‘tactics’ suggested by Miles and Huberman [ 45 ] that can assist in generating meanings from diverse material, a thematic analysis was designed and employed with two purposes in mind. First, to identify the most advanced big data applications related to the three dimensions of sustainability and related concepts and technologies. Second, to conceptualize the theoretical base behind the model of the data-driven smart sustainable city with the underlying technological and other components. As an inductive analytic approach, thematic analysis can be used to address the different types of questions posed by researchers to produce complex conceptual or analytical cross-examinations of meaning in qualitative material. This can be done through discovering patterns, relationships, themes, and concepts in this material that include multidisciplinary and interdisciplinary literature. Thereby, thematic analysis is an appropriate approach when analyzing and synthesizing a large body of documents—in the form of, for example, conceptual frameworks and descriptive accounts. It can be applied to produce theory-driven analyses.

The main steps of this study’s thematic analysis approach are as follows:

Review of smart cities, sustainable cities, data-driven cities, big data technologies and their novel applications, and other multidisciplinary and interdisciplinary literature. The aim is to deconstruct a multidisciplinary and interdisciplinary text related to the model of the data-driven smart sustainable city that puts emphasis on instrumentation, datafication, and computerization and related big data applications for multiple urban systems and domains. The outcome of this process entails numerous themes, applications, technologies, and urban centers, that are related to the respective model.

Pattern recognition entails the ability to see patterns in seemingly random information. The purpose is to note major patterns and concepts within the result of the first step, and then to, in this second step, look for similarities or patterns and organize the results by concepts.

Identifying a city model involves recognizing a specific and distinctive model of the data-driven smart sustainable city.

Conceptualization is about finding theoretical relationships among the identified concepts and the data-driven smart sustainable city.

Results and discussion

As cities are routinely embedded with all kinds of ICT forms, including infrastructure, platforms, systems, devices, sensors and actuators, and networks, the volume of data generated about them is growing exponentially and diversifying, providing rich, heterogenous streams of information about urban environments and citizens. This data deluge enables the real-time analysis of different urban systems and interconnects data across different urban domains to provide detailed views of the relationships between different forms of data that can be utilized for advancing the various aspects of urbanity through new modes of operational functioning, planning, design, development, and governance in the context of sustainability, as well as provides the raw material for envisioning more sustainable, efficient, resilient, and livable cities. The point at issue is that we generate enormous amounts of data on a daily basis, a binary trail of breadcrumbs that forms a map of urban life in terms of citizens’ experiences and urban dynamics, and these disparate datasets, if harnessed properly, open up a unique window of, and represent a goldmine, opportunity for making cities more sustainable and in tune with citizens’ actual needs and aspirations.

On the evolving integration of data-driven smart cities and sustainable cities

Both smart cities and sustainable cities are becoming ever more computationally augmented and digitally instrumented and networked, their systems interlinked and integrated, their domains combined and coordinated, and thus their networks coupled and interconnected, and consequently, vast troves of urban data are being generated and used to control, manage, organize, and regulate urban life in real time [ 11 , 12 ]. In other words, the increasing pervasiveness of urban systems, domains, and networks utilizing digital technologies is generating enormous amounts of digital traces capable of reflecting in real time how people make use of urban spaces and infrastructures and how urban activities and processes are performed, an information asset which is being leveraged in steering smart cities and sustainable cities. Indeed, citizens leave their digital traces just about everywhere they go, both voluntarily and involuntarily, and when cross-referenced with each citizen’s spatial, temporal, and geographical contexts, the data harnessed at this scale offers a means of describing, and responding to, the dynamics of the city in real time. In addition to individual citizens, city systems, domains, and networks constitute a key source of data deluge, which is generated by various urban entities, including governmental agencies, authorities, administrators, institutions, organizations, enterprises, and communities by means of urban operations, functions, services, designs, strategies, and policies.

Smart cities are increasingly connecting the ICT infrastructure, the physical infrastructure, the social infrastructure, and the economic infrastructure to leverage their collective intelligence, thereby striving to render themselves more sustainable, efficient, functional, resilient, livable, and equitable. It follows that smart cities of the future seek to solve a fundamental conundrum of cities—ensure sustainable socio-economic development, equity, and enhanced quality-of-life at the same time as reducing costs and increasing resource efficiency and environment and infrastructure resilience. This is increasingly enabled by utilizing a fast-flowing torrent of urban data and the rapidly evolving data analytics technologies; algorithmic planning and governance; and responsive, networked urban systems. In particular, the generation of colossal amounts of data and the development of sophisticated data analytics for understanding, monitoring, probing, regulating, and planning the city is one significant aspect of smart cities that is being embraced by sustainable cities to improve, advance, and maintain their contribution to the goals of sustainable development (see, e.g., [ 8 , 9 , 12 , 15 , 17 ]. Generally, a sustainable city can be understood as a set of approaches into operationalizing sustainable development in, or practically applying the knowledge about sustainability and related technologies to the planning and design of, existing and new cities or districts. It represents an instance of sustainable urban development, a strategic approach to achieving the long-term goals of urban sustainability. Accordingly, it needs to balance between the environmental, social, and economic goals of sustainability as an integrated process. Specifically, as put succinctly by Bibri and Krogstie ([ 14 ], p. 11), a sustainable city ‘strives to maximize the efficiency of energy and material use, create a zero-waste system, support renewable energy production and consumption, promote carbon-neutrality and reduce pollution, decrease transport needs and encourage walking and cycling, provide efficient and sustainable transport, preserve ecosystems and green space, emphasize design scalability and spatial proximity, and promote livability and community-oriented human environments.’

There are different instances of sustainable cities as an umbrella concept, which are identified as models of sustainable urban forms. Of these, the compact city and the eco–city are advocated as more sustainable and environmentally sound models [ 12 ]. From a conceptual perspective, Jabareen [ 28 ] ranks the compact city as more sustainable than the eco–city. Ideally, the compact city secures socially beneficial, economically viable, and environmentally sound development through dense and mixed use patterns that rely on sustainable transportation [ 18 , 22 , 29 , 30 ]. It emphasizes, in addition to density, mixed-land uses, and sustainable transportation, compactness, social mix or diversity, high standards of environmental and urban management systems, energy–efficient buildings, closeness to local squares, more space for bikes and pedestrians, and green areas [ 12 ]. Whereas the eco–city focuses on renewable resources, passive solar design, ecological and cultural diversity, urban greening, and environmental management and other environmentally sound policies [ 28 ]. The eco–city encompasses a wide range of urban–ecological proposals that seek to achieve urban sustainability within different local and national contexts (see [ 48 ] for a set of case studies). These approaches propose a wide range of environmental, social, and institutional policies that are directed to managing urban spaces to achieve sustainability, and this type emphasizes environmental management and promotes the ecological agenda through a set of institutional and policy tools [ 28 ]. All in all, the effects of the compact city and the eco-city combined are compatible with the fundamental goals of sustainable development.

Furthermore, for supra-national states, national governments, and city officials, smart cities offer the enticing potential of environmental and socio-economic development—more sustainable, livable, functional, safe, equitable, and transparent cities, and the renewal of urban centers as hubs of innovation and research (e.g., [ 3 , 6 , 11 , 12 , 33 , 41 , 52 ]. While there are several main characteristics of a smart city as evidenced by industry and government literature (see, e.g., [ 27 , 33 ] for an overview), the one that this paper is concerned with focuses on environmental and social sustainability.

There has recently been much enthusiasm in the domain of smart sustainable urbanism about the immense possibilities and fascinating opportunities created by the data deluge and its extensive sources with regard to enhancing and optimizing urban operational functioning, management, planning, design, and governance in line with the goals of sustainable development as a result of thinking about and understanding sustainability and urbanization and their relationships in a data-analytic fashion for the purpose of generating and applying knowledge-driven, fact-based, strategic decisions in relation to such urban domains as transport, traffic, mobility, energy, environment, education, healthcare, public safety, public services, governance, and science and innovation [ 12 ].

Therefore, the operational functioning, management, planning, and design of smart sustainable cities as a set of interrelated systems is increasingly being dominated by the use of advanced data, information, and communication technologies. The provision of data from urban operations and functions is offering the prospect of urban environments wherein the implication of the way such cities are functioning and operating is continuously available, and urban planning is facing the prospect of becoming continuous as the data deluge floods from different urban domains and is updated in real time, thereby allowing for a dynamic conception of planning and a scalable and efficient form of design [ 12 ].

Digital instrumentation

The big data revolution is set to erupt in both smart cities and sustainable cities throughout the world. This is manifested in bits meeting bricks on a vast scale as instrumentation is routinely pervading the spaces we live in. Smart sustainable cities are depicted as constellations of instruments for measurement and control across many spatial scales that are connected through fixed and wirelessly ad hoc and mobile networks with a modicum of intelligence, which provide and coordinate continuous data regarding different aspects of urbanity in terms of the flow of decisions about the physical, infrastructural, operational, functional, and socio-economic forms of smart sustainable cities [ 8 ]. As such, the instrumentation of such cities offers the prospect of an objectively measured, real-time analysis of urban life and infrastructure, and opens up dramatically different forms of social organisation. It is the domain of the ICT industry that is providing the detailed hardware and software to provide the operating system for smart sustainable cities. This infrastructure entails integration, data collection and mining, decision making, practice enhancement, and service delivery in relation to sustainability, efficiency, resilience, equity, and the quality of life.

While there are different approaches to generating the deluge of urban data (e.g., directed, indirected, volunteered, etc.), the automated one is the most common and prominent among them. It pertains to various automatic functions of the devices and systems that are widely deployed and networked across urban environments. Indeed, the automated approach to urban data deluge generation has recently captured the imagination of those concerned with understanding, operating, managing, and planning cities, as well as seeking useful insights into urban systems, in particular in relation to the environment [ 9 ]. Especially, there has been increased interest in sensor networks and the IoT as well as the tracking and tracing of people and objects [ 33 ]. For example, sensors networks can be used to monitor the operation and condition of urban and public infrastructures, such as roads, rails, tunnels, sewage systems, water systems, power and gas provision systems, hospitals, facilities, and parks, as well as environmental conditions. In this context, smart sustainable/sustainable smart cities offer the prospect of real-time analysis of the processes operating and organizing urban life, which is of paramount importance to advancing the different aspects of sustainability. There are a number of tools and techniques used in the automated approach to generating urban data deluge (e [ 6 , 9 , 12 , 23 , 33 , 39 ], including sensors:

Global Positioning System (GPS) in vehicles and on people.

Smart tickets that are used to trace passenger travel.

RFID tags attached to objects and people.

Sensed data generated by a variety of sensors and actuators embedded into the objects or environments that regularly communicate their measurements.

Capture systems in which the means of performing tasks captures data about those tasks.

Digital devices that record and communicate the history of their own use.

Digital traces left through purchase of goods and related demand supply situations.

Transactions and interactions across digital networks that not only transfer information, but also generate data about the transactions and interactions themselves.

Clickstream data that record how people navigate through websites or apps.

Automatic Meter Reading (AMR) that communicates utility usage on a continuous basis.

Automated monitoring of public services provision.

The scanning of machine-readable objects such as travel passes, passports, or barcodes on parcels that register payment and movement through a system.

Machine to machine interactions across the IoT.

Uniquely indexical objects and machines that conduct automatic work as part of the IoT, communicating about their use and traceability if they are mobile (automatic doors, lighting and heating systems, washing machines, security alarms, wifi router boxes, etc.)

Transponders that monitor throughput at toll-booths, measuring vehicle flow along a road or the number of empty spaces in a car park, and track the progress of buses and trains along a route.

In view of the above, embedding more and more advanced ICT in various forms into smart sustainable/sustainable smart cities will undoubtedly continue and even escalate for the purpose of providing the most suitable tools and methods for handling the underlying complexity and thus dealing with the challenges they are facing and will continue to face. Especially, advanced ICT has an instrumental and shaping role in not only monitoring, understanding, and analyzing such cities, but also in improving sustainability, efficiency, resilience, and the quality of life in them. With that in regard, the broad availability of urban data is pushing research ever more into further advancing the core enabling technologies of big data analytics towards realizing and implementing urban intelligence functions and related simulation models and optimization and prediction methods.

From a different perspective, not all data are equally generated, and their variety is associated with the purpose of their use, among others. There are opportunistic data which are collected for one purpose and then used for another, e.g., data owned by cellphone companies to run their operations but used by transport companies to better understand urban mobility. User-generated data result from the engagement of citizens, e.g., data from social media platforms which provide valuable information to better understand today’s cities. Purposely sensed data, e.g., automated data, reflect the power of ubiquitous urban sensors that can be deployed ad hoc in public and private spaces to better understand some aspects of urban life and dynamics.

Moreover, the various sensor recording parameters, their length as to the collected data, where they are located, what kinds of sensors are embedded in which environments, their settings and calibration, their integration and fusion, and their exhaustiveness as technical configurations and deployments determine the nature of the produced data and the way they are stored, managed, processed, analyzed, and disciplined [ 12 ].

Big data ecosystem and its components

Big data trends are associated with pervasive and ubiquitous computing, which involves myriads of sensors pervading urban environments on a massive scale. Therefore, the volume of the data generated is huge and thus the processes, systems, platforms, infrastructures, and networks involved in handling these data are complex. Mechanisms to store, integrate, manage, process, analyze, and visualize the generated data through scalable applications remain a major scientific and technological challenge in the ambit of data science, urban science, and computer science.

The evolving data deluge is due to a number of the core enabling and driving technologies of ICT of pervasive and ubiquitous computing and thus big data computing. These are being fast embedded into the very fabric of contemporary cities, everyday practices and spaces, whether badging or regenerating themselves as smart sustainable to pave the way for adopting the upcoming innovative solutions to overcome the challenges of sustainability and urbanization in the years ahead. Further, like many areas to which big data computing can be applied, smart sustainable cities require the big data ecosystem and its components to be put in place as part of their ICT infrastructure prior to designing, developing, deploying, implementing, and maintaining the diverse applications that support sustainability and reduce the negative effects of urbanization. As a scientific and technological area, the core enabling technological components underlying the big data ecosystem are under vigorous investigation in both academic circles as well as the ICT industry towards the development of computationally augmented urban environments as part of the informational landscape of such cities [ 11 ]. Big data ecosystems are for capturing data to generate useful knowledge and deep insights. In the sphere of smart sustainable cities, the big data landscape is daunting, and there is no one ‘big data ecosystem’ or single go-to solution when it comes to building big data architecture. The big data ecosystem involves multivarious technologies in terms of quality and form, which allow to store, manage, process, analyze, visualize data, and deploy the obtained results. It consists of infrastructure and tools for storing, managing, processing, and analyzing data; specialized analytics techniques; and applications. Bibri and Krogstie [ 16 ] provide a comprehensive, state-of-the-art review of the core enabling technologies of big data analytics in relation to smart sustainable cities, including a synthesis and illustration of the key computational and analytical techniques, processes, and models associated with the functioning and application of big data analytics. The components addressed by the authors in rather more detail include, but are not limited to, the following:

Pervasive sensing in terms of collecting and measuring urban big data; the IoT and related RFID tags; sensor-based urban reality mining; and sensor technologies, types, and areas in big data computing.

Wireless communication network technologies and smart network infrastructures.

Cloud and fog/edge computing.

Advanced techniques and algorithms.

Conceptual and analytical frameworks.

Generally, big data ecosystems entail a number of permutations of the underlying core enabling technologies as shaped by the scale, complexity, and extension of the city projects and initiatives to be developed and implemented. In this respect, it is necessary to, as suggested by Chourabi et al. [ 20 ], take into account flexible design, quick deployment, extensible implementation, comprehensive interconnections, and advanced intelligence. Regardless, while there are some permutations that may well apply to most urban systems and domains, there are some technical aspects and details that remain specific to smart sustainable cities, more specifically to the requirements, objectives, and resources of related projects and initiatives, which are usually determined by and embedded in a given context [ 11 , 16 ]. Yet, most of, if not all, the possible permutations involve sensing technologies and networks, data processing platforms, cloud computing and/or fog computing infrastructures, and wireless communication and networking technologies. These are intended to provide a full analytic system of big data and related functional applications based on advanced decision support systems and strategies—urban intelligence functions and related simulations models and optimization and prediction methods [ 12 ]. On this note, Batty et al. [ 6 ] state that much of the focus on sustainable smart cities of the future, ‘will be in evolving new models of the city in its various sectors that pertain to new kinds of data and movements and actions that are largely operated over digital networks while at the same time, relating these to traditional movements and locational activity. Very clear conceptions of how these models might be used to inform planning at different scales and very different time periods are critical to this focus… Quite new forms of integrated and coordinated decision support systems will be forthcoming from research on smart cities of the future’.

Cloud computing for big data analytics

Characteristics and benefits.

The term ‘cloud computing’ has been defined in multiple ways by ICT experts and researchers and a wide range of organizations (e.g. government agencies) and institutions (e.g., educational institutions). Common threads running through most definitions are that cloud computing denotes a computing model in which standardized, scalable, and flexible ICT-enabled capabilities delivered in real-time via the Internet in the form of three types of services: (1) Software-as-a-Service (SaaS), (2) Platform-as-a-Service (PaaS), and (3) Infrastructure-as-a-Service (IaaS) to external users or customers. SaaS and PaaS denote the provider’s software applications and software development platforms respectively, and IaaS means virtual servers, storage facilities, processors, and networks as resources, all being delivered over the cloud. Thus, cloud computing consists of several components, which can be rapidly provisioned with minimal management effort. However, the diversity of the definitions of, coupled with the lack of agreement over what constitutes, cloud computing has created confusion as to what it really means as an emerging computing model, and consequently its definitions have been criticized for being too broad and unclear [ 12 ].

Having attracted attention and gained popularity worldwide, cloud computing is becoming increasingly a key part of the ICT infrastructure of both smart cities and sustainable cities as an extension of distributed and grid computing due to the prevalence of sensor technologies, storage facilities, pervasive computing infrastructures, and wireless communication networks. Especially, most of these technologies have become technically mature and financially affordable by cloud providers. By commoditizing services, low cost open source software, and geographic distribution, cloud computing is becoming increasingly an attractive option.

Users of cloud computing, including individuals, organizations, and government agencies employ it to, as a variety of enabled services, store and share information; manage, sift, and analyze databases; and deploy Web services, including processing huge datasets for complicated problems of scientific kinds [ 8 ]. Cloud computing can also be used to process urban big data and context data in relation to smart sustainable city applications.

Overall, among the key advantages provided by cloud computing technology include cost reduction, location and device independence, virtualization (sharing of servers and storage devices), multi-tenancy (sharing of costs across a large pool of cloud provider’s clients), scalability, performance, reliability, and maintenance [ 8 ]. Therefore, opting for cloud computing to perform big data analytics in the realm of smart sustainable cities (see [ 12 ] for an illustrative example of the application of cloud computing) remains thus far the most suitable option for the operation of infrastructures, applications, and services whose functioning is contingent upon how urban domains interrelate and collaborate, how efficient they are, and to what extent they are scalable as to achieving and maintaining the required level of sustainability [ 8 ].

Elements of big data

Big data analytics can be performed in the Cloud. This involves both big data Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). In line with the definition of cloud computing, there are three main elements of big data cloud (Fig.  1 ). Konugurthi et al. [ 42 ] describe them below:

figure 1

Big data cloud components

1. Big Data Infrastructure Services (BDIS): This layer offers core services, such as compute, storage, and data services for big data computing, as described below:

Basic storage service: Provides basis services for data delivery, which is organized either on physical or virtual infrastructure, and supports various operations, such as create, delete, modify, and update, with a unified data model supporting various types of data.

Data organization and access service: Provides management and location of data resources for all kinds of data, as well as selection, query transformation, aggregation and representation of query results, and semantic querying for selecting the data of interest.

Processing service: Mechanisms to access the data of interest, transferring to the compute node, efficient scheduling mechanism to process the data, programming methodologies, and various tools and techniques to handle the variety of data formats.

The elements of BDIS are described below:

Computing Clouds: On demand provisioning of compute resources, which can expand or shrink based on the analytics requirements.

Storage Clouds: Large volume of storages offered over the network, including file system, block storages, and object-based storage. Storage clouds offer to create file system of choice and also elastically scalable. They can be accessed based on the pricing models which are usually based on data volumes or data transfer. The several services provided in this regard are raw, block, and object-based storages.

Data Clouds: Are similar to Storage Clouds but unlike storage space delivery. They offer data as a service. Data Clouds offer tools and techniques to publish the data, tag the data, discover the data, and process the data of interest. Data Clouds operate on domain specific data leveraging the Storage Clouds to serve data as a service based on the four step of thr Standard Scientific Model, such as data collection, analysis, analyzed reports, and long-term preservation of the data.

2. Big Data Platform Services (BDPS): This layer offers schedulers, query mechanisms for data retrieval, and data-intensive programming models to address several big data analytic problems.

3. Big Data Analytics Services (BDAS): Big data analytics as services over big data-cloud infrastructure.

Urban operating centers and strategic planning and policy offices

The consequence of the evolving and soaring data deluge is that data-driven urbanism is changing how we know, operate, regulate, manage, plan, and govern city systems, both within particular domains and across them (e.g., [ 6 , 8 , 9 , 11 , 12 , 36 , 38 , 44 , 52 ]. Indeed, one of the implications of such urbanism is that urban systems are becoming much more tightly interlinked and integrated and urban domains highly coordinated, especially in the context of sustainability [ 12 ]. New data streams from such domains are changing how to use data science to extract and analyze these data to make a real impact.

There has recently been a marked tendency supported by practical endeavors to draw all the kinds of analytics associated with the city in terms of its urban domains into a single hub, supported by broader public and open data analytics. This entails creating a city-wide instrumented or centralized system that draws together data streams from many agencies (across city domains) for large scale analytics. For example, urban operating systems explicitly link together multiple urban technologies to enable greater coordination of urban systems and domains [ 36 ], especially for the purpose of advancing sustainability [ 9 ]. Similarly, urban operating centers attempt to draw together and interlink urban big data to provide integrated and holistic views and synoptic city intelligence [ 36 , 38 ] through processing, analyzing, visualizing, and monitoring the vast deluge of urban data that is used for real-time decision-making using advanced data analytics techniques. A notable example is the Centro De Operacoes Prefeitura Do Rio, an urban operations center staffed by 400 professional works for monitoring the operational functioning of the city [ 36 ]. Here, the aim is to knock down silos between different urban departments and to combine each one’s data to help the whole enterprise [ 51 ] as a complex endeavor. Indeed, this urban operations center draws together real-time data streams from 30 agencies, including public transport and traffic, mobility, power grid, municipal and utility services, emergency services, weather feeds, information sent in by the public via smartphones, and social media networks into a single data analytics center [ 33 , 36 ]. Urban operations centers provide a powerful means for making sense of, managing, and living in the city in the here-and-now, as well as for planning the city in terms of envisioning and predicting future scenarios, which is of value for those developing and using integrated, real-time city data analytics [ 33 ]. Examples of city operating systems or control rooms include Microsoft’s CityNext, Urbiotica’s City Operating System, IBM’s Smarter City, and PlanIT’s Urban Operating System, with the latter representing Enterprise Resource Planning (ERP) systems as intended to operate and coordinate the activities of large companies repurposed for cities [ 34 ].

There has been a transformation in the attributes of the data being collected, stored, and organized in datasets, This transformation has been enabled by new networked, digital technologies embedded into the fabric of urban environments that underpin the drive to create smart sustainable cities. In this context, many different initiatives in collecting data from new varieties of digital access are being fashioned, such as the satellite-enabled GPS in vehicles and on citizens, from social media sites, from transactions, and from access to numerous kinds of web sites. Satellite remote-sensing is increasingly widely deployed, in addition to a variety of scanning technologies associated with the IoT [ 9 ]. Other technologies include digital cameras, sensors, transponders, meters, actuators, and transduction loops that monitor various phenomena and continually send data to an array of control and management systems, such as urban operations centers, centralized control rooms, intelligent transport systems, logistics management systems, energy grids, and building management systems that can process and respond in real time to the data flow [ 25 , 33 , 36 ].

For example, data on traffic flow generated by sensors, cameras, transponders, and transduction loops in public transport systems can be produced in a real-time manner, fed back to a control room where analysts can monitor traffic levels using advanced software applications and alter traffic light sequencing and road speeds to try and maintain traffic flow [ 36 ]. This relates to smart traffic lights and signals (see [ 9 ] for a descriptive account). The big data application for traffic also involves the possibility of determining travel patterns across times of the day and days of the week concerning all nodes on the network, such as bus stops, sensor locations, and junctions, as well as creating and improving models and simulations to guide future urban development (e.g., to simulate what might happen to travel patterns by closing a road on the network). For a detailed account of diverse big data applications for environmental sustainability in the context of smart sustainable cities, including, in addition to traffic, mobility, energy, power grid, environment, buildings, infrastructure, and large scale deployment, the reader can be directed to Bibri [ 9 ].

In addition, the Policy and Strategic Planning Office in New York City has sought to create a data analytic hub to weave together data from a diverse set of city agencies in order to try to manage, regulate, plan, and govern the city more efficiently and effectively [ 33 ]. Huge amounts of data amounting to petabytes stream through the office on a daily basis for analysis in terms of cross-referencing data, spotting patterns and identifying and solving city problems [ 24 , 33 ]. A more ambitious endeavor in this direction would be to realize a joined-up planning, which entails an integration that enables systemwide effects to be understood, analyzed, tracked, and built into the very designs and responses that characterize the operations, functions, and services of the city. This involves connection, networks, and data integration in regard to urban agencies or domains.

A team of data analysts and other data operatives, aided by various data analytics software, monitor, manage, process, analyze, and visualize the vast deluge of urban data, alongside data aggregated over time and huge volumes of other kinds of data in terms of velocity, i.e., released on a more periodic basis, often mashing the datasets together to investigate particular aspects of city life and changes over time, and to build predictive models with respect to city management, planning, design, and development in the context of sustainability. The outcome is to be used for real-time decision-making and problem solving pertaining to urban operations and functions, as well as to other urban practices. In this respect, the data-driven city enables to make decisions by assessing what is happening at any one time and by responding and planning appropriately with respect to sustainability. Such assessment entails interlinking diverse forms of data, which provides a deeper, more holistic and robust analysis. This therefore allows for developing, running, regulating, and planning the city on the basis of strong, rationale evidence.

The implication and prospect of the above endeavors is a new form of highly responsive urbanism in which big data technologies and their systems are prefiguring and setting the urban agenda for sustainable development and influencing and controlling how city systems respond to and perform as to the goals of sustainable development.

Living labs

Smart sustainable cities revolves around the idea of a living laboratory for new technologies that can handle all the major systems a city requires and the key domains it involves. There are several descriptions and definitions of a living lab, according to different sources (e.g., [ 43 , 47 , 50 ]. In the context of this paper, a living lab as a research concept (e.g., [ 2 , 19 , 56 ] refers to a user-centered, open-innovation ecosystem operating in the city and targeted at improving sustainability through data-driven smart solutions and approaches, integrating innovation processes and concurrent research within a partnership involving public and private organizations and institutions, as well as citizens and communities. As such, it brings together interdisciplinary and transdisciplinary scholars, researchers, experts, and practitioners to develop, deploy, implement, and test in actual urban environments new technologies and strategies for design that respond to the long-term goals of sustainability. The endeavor here spans in city scale from the physical to the social and ecological, and addresses challenges related to the built environment in the context of sustainable urban forms. Especially, the effects of such forms are compatible with the goals of sustainable development in terms of transport provision, mobility and accessibility, travel behavior, energy conservation and efficiency, pollution and waste reduction, public health and safety, economic viability, and life quality [ 12 ]. In addition, in terms of the living lab process, the act of co-creating, exploring, experimenting, testing, and discovering—all of breakthrough scenarios, visions, ideas, concepts, and related technological artefacts in real-life setting in terms of urban design and services can generate scientific and practical innovations of high potential for advancing sustainability. This approach allows all the involved city stakeholders to concurrently consider both the global performance of data-driven smart sustainability solutions and their potential adoption by cities on different spatial scales. In all, the concept of the living lab this paper is concerned with relates to planners, scholars, researchers, scientists, experts, policymakers, and citizens for co-designing, exploring, experiencing, and refining new urban functions, services, strategies, policies, and regulations in real-life scenarios for evaluating their potential impacts on sustainability before their implementations.

Today, new technologies are giving citizens more opportunities to participate in the functioning, design, and governance of the city, which is being increasingly leveraged in the transition towards the needed sustainable development. Changes driven by digital technologies can happen without heavy infrastructure, as they can arise from bottom-up actions instead of being necessarily determined by city governments. These should therefore develop knowledge sharing platforms that get citizens engaged as much as possible and excited about smart sustainable urban transformations through open innovation and participatory research. Indeed, citizens can really be the ones to bring such transformations, if the right platforms can be created, and the installation and control of hardware can be done for no more than what citizens wish their city to become and how they aspire to see it evolving in the future.

An example of a living lab is the multipurpose experimental facility built by Zero Emission Buildings (ZEB), Faculty of Architecture and Fine Arts, the Norwegian University of Science and Technology (NTNU). As a test facility occupied by real persons using the building as their home, it focuses on the occupants and their use of innovative building technologies. This living laboratory is used to study various technologies and design strategies in a real-world living environment:

User-centerd development of new and innovative solutions: The test facility is used within a comprehensive design process focusing on user needs and experiences.

Performance testing of new and existing solutions: Exploring building performance in a context of realistic usage scenarios.

Detailed monitoring of the physical behaviour of the building and its installations as well as the users influence on them.

ZEB researchers within the fields of architecture, social science, materials science, building technologies, energy technologies, and indoor climate jointly study the interaction between the physical environment and the users.

This living lab and other similar initiatives related to different areas of sustainability are at the core smart sustainable cities in terms of their specific structural components. Examples of such initiatives relate to the design concepts and typologies characterizing the compact city and the eco-city as combined landscapes and approaches, notably compactness, density, mixed-land use, diversity, sustainable transport, passive solar design, and ecological design. Specifically, the multipurpose experimental facilities the proposed model is concerned with will focus on the significant themes evident in the current debates on various strategies and their effects and benefits in the context of sustainable urban forms. See Bibri [ 8 ] for a detailed list of these themes and strategies.

Innovations labs

Exploring the notion of smart sustainable cities as an innovation lab is about evolving urban intelligence functions associated with optimizing and enhancing operations, functions, services, designs, strategies, and policies across various urban domains in line with the goals of sustainable development. This can take the form of laboratories. Especially, building models of cities functioning in real time from routinely sensed data is becoming increasingly achievable and deployable (e.g., [ 6 , 12 , 33 ]. Although innovation labs are springing up everywhere, becoming now commonplace across industries, most of such initiatives still relate to the business domain.

In the context of this paper, an innovation lab denotes a working space designed to optimize and enhance sustainability innovation in the form of urban intelligence functions. It is a unique environment devoted to or exclusively intended for sharing and building new and expert knowledge, creating new ideas and alignment, and developing comprehensive solutions for sustainability in response to the needs, aspirations, and goals of the city and its stakeholders and citizens. An innovation lab also serves as an environment where a team of researchers, scientists, practitioners, and professionals can gather and design thinking for innovation can directly happen in relation to sustainability solutions, meaning it is designed to host innovation workshops. The key strengths lie in the team’s multidisciplinary knowledge and skills, long-standing experience, international know-how, and access to global networks in the sphere of urban sustainability and related technologies. Further, among the questions that an innovation lab for urban sustainability involve transport and traffic, mobility, energy, power grid, environment, buildings, infrastructures, design and planning, scientific research, governance, healthcare, public safety, and big data technology. This implies that such lab should host many interdisciplinary and transdisciplinary teams concerned with different city domains or sub-domains and the associated solutions. Applicable solutions for various areas of sustainability should be developed considering the interests of city stakeholders as well as citizens. The positioning of such lab should make it possible to offer a platform where the many, scientifically excellent research initiatives of the city in these areas can cooperate even more strongly with each other. The idea is to make a scientific contribution to the social discourse of the data-driven smart sustainable city within the framework of the innovation lab for urban sustainability. One way to support innovation within smart sustainable cities involves a set of strategic and goal-focused units, focused on specific areas that link big data technology to sustainability, tasked with creating anything from a new solution to a new method, model, or technology. Another innovation initiative, which may not be physically co-located, can involve setting up a group to collaborate with industry and academia.

Setting up an innovation lab involves significant challenges, which pertain to the many questions that the smart sustainable city stakeholders need to ask themselves in the course of creating an innovation lab for sustainability. These questions involve what roles should be filled, what types and combinations of people make the best innovators, what governance model or framework should be applied, which projects should be prioritized, how to establish synergies with the rest of city projects, what kind of infrastructure should be in place, how can ideas and models be tested, and so on.

ICT is being developed to increase the efficiency of energy systems and the delivery of public and social services, to improve transportation and mobility, and to enhance the quality of life, among others. This reflects the notion of the smart sustainable city as a laboratory for innovation or research center. For example, the Research Center on Zero Emission Neighbourhoods in Smart Cities at NTNU, which was established in 2017 by the Research Council of Norway, is a research center for environmentally friendly energy. More specifically, it conducts research on zero emission neighbourhoods in smart cities. Its goal is to develop solutions for future buildings and neighbourhoods with no greenhouse gas (GHG) emissions and thereby contribute to a low carbon society. Its main objective is to develop products and processes that will lead to the realization of sustainable neighbourhoods as to their production, operation, and transformation. In line with the goals of smart sustainable cities, the ZEN research is driven by the vision and convinced that future communities and cities should ensure optimal energy use and be good places for people to live and work in. The main question the ZEN research center is concerned with—which indeed is at the core of how sustainable urban forms should be monitored, understood, and analyzed to improve, advance, and maintain their contribution to the goals of sustainable development—is how the sustainable neighbourhoods of the future should be designed, built, transformed, and managed to reduce their GHG emissions towards zero.

As with most of innovation centers, the idea of the ZEN research center is to bring together like minded people to share ideas and create the future. The partners of this center cover the entire value chain and include representatives from municipal and regional governments, property owners, developers, consultants and architects, ICT companies, contractors, energy companies, manufacturers of materials and products, and governmental organizations. NTNU is the Center’s host and leads it together with SINTEF Building and Infrastructure and SINTEF Energy. In order for the ZEN research center to achieve its high ambitions, the process of strategizing and planning is done together with these partners to:

Develop neighborhood design and planning instruments while integrating science-based knowledge on GHG emissions.

Create new business models, roles, and services that address the lack of flexibility towards markets and catalyze the development of innovations for a broader public use.

Create cost effective and resource and energy efficient buildings by developing low carbon technologies and construction systems based on lifecycle design strategies.

Develop technologies and solutions for the design and operation of energy flexible neighborhoods.

Develop a decision-support tool for optimizing local energy systems and their interaction with the larger system.

Create and manage a series of neighbourhood-scale living labs, which will act as innovation hubs and a testing ground for the solutions developed in the ZEN Research Center.

Similar to ZEB, this research center and other similar initiatives related to different areas of sustainability are at the core of smart sustainable cities in terms of their specific components. Examples of such initiatives relate to the design concepts and typologies characterizing the compact city and the eco-city as combined landscapes and approaches, notably compactness, density, mixed-land use, diversity, sustainable transport, passive solar design, and ecological design. Specifically, the research centers the proposed model is concerned with will focus on the significant themes evident in the current debates on various strategies and their effects and benefits in the context of sustainable urban forms. For example, cleaner modes of transportation-such as bike-sharing systems is a potential area for research and innovation based on mapping how and when people travel so as to know where to invest in such modes, and hence mobilize and align stakeholders.

  • Urban intelligence functions

In the context of this paper, the concept of urban intelligence refers to the planning, development, integration, and deployment of big data computing and underpinning technologies as an ecosystem (both physical and virtual assets) to support the interoperability between resources and technologies and hence the integration of urban systems and the coordination of urban domains to serve the city and its stakeholders and citizens with respect to sustainability dimensions. In short, urban intelligence entails the use of big data analytics and the underlying core enabling technologies to address and overcome the problems and challenges facing cities in the context of sustainability.

As an advanced form of decision support, urban intelligence functions integrate, synthesize, and analyze data flows for the purpose of improving the sustainability, efficiency, resilience, equity, and quality of life in cities. This relates in this context to exploring the notion of smart sustainable cities as innovation labs. Accordingly, the kind of urban intelligence functions that such city should evolve in the form of laboratories that enable its monitoring, planning, design, and development include, but are not limited to, the following:

The efficiency of energy systems.

The improvement of transportation and communication systems.

The improvement of water, power, and sewage systems.

The enhancement of urban metabolism.

The effectiveness of distribution systems.

The robustness and resilience of urban infrastructures in terms of their ability to withstand adverse conditions and to quickly recover from difficulties.

The efficiency and scalability of urban design in terms of forms, structures, and spatial organizations.

The optimal use and accessibility of facilities.

The efficiency of social and public services delivery.

The optimization of ecosystem services provision.

The dynamic, continuous, and short-term forms of planning.

Urban intelligence functions represent new conceptions of how smart sustainable cities function and utilize and combine complexity science and urban science in fashioning new powerful forms of urban simulations models and optimization and prediction methods that can generate urban structures and forms as well as spatial organizations and scale stabilizations that improve sustainability, efficiency, resilience, equity, and the quality of life [ 12 ]. They are best to take the form of laboratories for scientific and social research and innovation directed primarily for improving, advancing, and maintaining the contribution of such cities to sustainability. Urban intelligence labs are intended to work directly with various urban entities (e.g., government agencies, public authorities, organizations, institutions, companies, communities, citizens, etc.) to acquire, process, and analyze data and then derive useful knowledge and insight in the form of applied intelligence. Their core aim is to solve tangible and significant problems of city planning and design through data-driven decision-making. This involves delivering problem-oriented research that serves the dual purpose of advancing the scientific understanding of cities in terms of sustainability and urbanization and how they intertwine with and affect one another, as well as in terms of having a direct impact on decision-making and action taking in the sense of enhancing and advancing planning practices. In this light, the sort of intelligence functions envisaged for smart sustainable cities would be woven into the fabric of institutions whose mandate is to promote, improve, and advance sustainability and create a better quality of life for citizenry. However, the decision support systems associated with new urban intelligence functions and related simulation models and optimization and prediction methods are still in their infancy [ 6 , 12 ], and also much needs to be done to provide the raw material for the development and implementation of such functions across multiple urban domains.

In addition, with the projected advancements and innovations in big data computing and underpinning technologies, the process of building intelligence functions will shift from top-down (expert and professional organizations) to engaging citizens with experts due to the complexity underlying urban planning, design, development, and governance in the context of sustainability. This entails integrating databases and models from across various urban domains for supporting the development of this sort of integrated intelligence functions, with new or refashioned ways at different levels, including visualization of data and urban sustainability problems, using tools for informing and predicting the impacts of future sustainability scenarios, and engaging citizens and their useful, relevant recommendations, all into a form of a holistic system that operates in accordance with sustainability requirements at various spatial and temporal scales [ 12 ].

Bibri [ 13 ] examines and discusses this evolving approach to urbanism in terms of computerized decision support and making, intelligence functions, simulation models, and optimization and prediction methods. It also highlights the potential of the integration of these advanced technologies for facilitating the synergy between the operational functioning, planning, design, and development of smart sustainable cities for the primary purpose of improving, advancing, and maintaining their contribution to the goals of sustainable development. Indeed, at the core of smart sustainable urbanism is the interaction or cooperation of these urban practices to produce a combined effect greater than the sum of their separate effects in the context of sustainability. In this respect, urban planning determines the way urban structures and forms should be designed, which shapes urban operational functioning that in turn drives urban development. This entails using advanced technologies, notably big data computing, as an enabler for such synergy as well as a determinant of its outcomes. This is owing to the underlying powerful engineering solutions as a set of novel applications and sophisticated approaches. Big data analytics and related simulation models and optimization and prediction methods might completely redefine urban problems, as well as offer entirely innovative opportunities to tackle them on the basis of new urban intelligence and planning functions, thereby doing more than merely enhancing existing urban practices.

Big data applications and related issues

Key practical and analytical applications for urban systems and domains.

Smart sustainable cities are increasingly being permeated with big data technologies and their novel applications in terms of their systems and domains [ 9 , 12 , 15 ]. The smart dimension of such cities can be seen as a new ethos added to the era of sustainable urbanism in response to the rise of ICT and the spread of urbanization as major global shifts at play today. The characteristic spirit of the era of smart sustainable urbanism is manifested in the behavior and aspiration of smart sustainable cities towards embracing what big data computing has to offer in order to bring about sustainable development and achieve sustainability. This is due to the tremendous potential of this advanced form of ICT for adding a whole dimension to sustainability in an increasingly technologized, computerized, and urbanized world. The range of the emerging big data applications as novel analytical and practical solutions that can be utilized in this regard is potentially huge, as many as the case situations where big data analytics may be of relevance to enhance some sort of decision or insight in connection with urban systems and domains. In the sequel, the most common big data applications are identified and enumerated in relation to the key systems and domains of smart sustainable cities, and their sustainability effects are elucidated, which are associated with the underlying functionalities pertaining to urban operations, functions, services, designs, strategies, and policies in the context of such cities, as illustrated in Table  1 . However, they are by no means, or intended to be, exhaustive. Moreover, they are synthesized and distilled from the technical literature on smart cities and smart sustainable/sustainable smart cities [ 8 , 9 , 11 , 12 ]. Of relevance to add, as to the technical processes, tools, and other details underpinning the functioning of big data applications, the interested reader can be directed to Bahga and Madisetti [ 4 ], one of the many books available out there on the topic, for a detailed account from a general perspective, and to Bibri [ 16 ] for an overview focusing mainly on smart sustainable cities.

Relevant policy and technology issues

Big data analytics and related applications provide a very rich nexus of possibilities for enhancing urban operations, functions, services, strategies, and policies in terms of sustainability, efficiency, and resilience, which comes with benefits for the quality of life and well-being of citizens. These benefits are associated with smart sustainable cities. One of the core ideas underlying the development and implementation of big data applications in such cities is to harness solutions, improve services, integrate approaches, and enhance outcomes with respect to urban practices and city life [ 9 ]. One way of achieving this is through integrating urban systems, coordinating urban domains, and coupling socio-economic networks using more effective ways of monitoring, understanding, analysing, planning, and governing modern cities. Overall, exposing big data via a socially synergistic and environmentally substantive as well as evolvable, extensible, dynamic, scalable, and reliable big data ecosystem in smart sustainable cities offers a wide range of opportunities with regard to sustainability dimensions and their integration. The advanced forms of ICT and the underlying computational and data analytics are as primordially needed as the interdisciplinary and transdisciplinary knowledge in sustainable smart urban development as a complex area of study.

There is huge potential for using big data analytics to address many of the pressing issues and wicked problems involved in smart sustainable cities through innovative solutions for and sophisticated approaches to, and new practices of, decision-making informed by high levels of intelligence enabled by the analytical outcome of the urban data deluge. Thus, this advanced form of ICT offers such cities more capabilities and resources that can allow them to realize their full potential for meaningful progress as urban development models in response to the upcoming Exabyte Age and urbanization era. In this regard, understanding the characteristics of such cities, identifying the complex sustainability and urbanization issues, and acknowledging the potential of big data analytics and its application facilitate the process of putting in place and maintaining what is technologically and socio-politically required to develop, apply, and mainstream the needed smart applications.

The three main components that policymakers can explore as to how to plan and construct smart sustainable cities are: the construction of public infrastructure, the construction of public platform for such cities, the construction of application systems based on big data analytics, the construction of innovation labs, and the construction of participatory governance models. These all involve issues and challenges that constitute future fields of study. Those components are to be addressed and relevant solutions to be devised, as the existing plans evolve and new ones are developed in response to new urgencies requiring swift actions, as well as more R&D activities and efforts are made in relation to city development in terms of the implementation of cutting-edge technologies together with sustainable design concepts and typologies. This requires clear, reliable, strategic, and astute plans for city development and realisation, rather than piecemeal initiatives, scattered projects, or standalone programs. In this regard, the requirements and objectives of smart sustainable cities for technological, physical, and social infrastructures must be taken into account in such plans, instead of treating each part as its own silo. This holistic approach into city development provides a clearer and more focused perspective on what is needed and of priority to address, and will result in more rounded solutions (well developed in all aspects or complete and balanced for the city), rather than isolated islands of components and applications that could hardly connect with each other. Hence, the efforts to be poured into the development of smart sustainable cities should concentrate on creating a roadmap for success that covers several phases, including, but not limited to, the following:

Create a mission statement that can guide the development of smart sustainable city and help fulfill its long-term goal.

Set up the direction of such city by crafting its vision and identifying its strategic and operational objectives, in particular in relation to technological innovation and sustainable development.

Establish policies, regulations, and rules, as well as determine resources and expertise required to govern big data usage and the use of other advanced forms of ICT.

Build public infrastructures and platforms based on big data analytics and its application to support innovative smart applications. This entails analyzing and assessing the current situation and determining the necessary transformations or changes to reach the desired outcomes in terms of technology and design in line with the vision of sustainability.

Identify priorities with regard to different technology and sustainability dimensions and use them to determine the most important and relevant city components and applications that would offer the greatest effects with the smallest investment possible.

Integrate city infrastructures and activities in terms of operations, functions, services, strategies, and policies and big data applications to develop more efficient urban life and more effective urban environment.

Optimize continuously the operating and organizing processes of urban life and environment based on new advances in big data analytics and its application to identify the needed improvements or changes.

Stimulate and realize new opportunities for R&D by monitoring current progress and its effect and the potentially arising issues and challenges, and thereby creating new requirements and objectives.

Evidenced by the urban world evolving increasingly into becoming fully technologized and computerized based on big data, the prospect has become clear that smart sustainable cities will be enabled and developed using the core enabling technologies of big data analytics, and hence related novel applications to effectively and efficiently cater for the needs of diverse urban constituents as well as meet their aspirations in an unsustainable and rapidly urbanized world. This might well call for funnelling huge investments into the kinds of resources, infrastructures, platforms, and expertise that are required to support the construction and deployment of the core enabling technologies of big data analytics throughout the various design and development stages of smart sustainable cities. This strategic move is deemed essential to reap the sustainability benefits in terms of environmental and socio-economic gains in such cities. To help optimize the city design and development as an endeavor and minimize its costs, it is recommended to include important activities in the process, some of which are presented below:

Developing advanced modeling and simulation systems to help predict potential problems and forecast possible changes, with the primary purpose of mitigating or avoiding any risks that might arise, as well as reducing the implementation and testing costs following city design and development. Simulation models and prediction methods have great potential to modernize smart sustainable city design and development in the future [ 8 , 10 ]. Indeed, using simulations is generally cheaper, safer, and faster than studying real-time processes or conducting real-world experiments. Also, simulations allow a flexible configuration of the parameters within the different sub-processes found in the operational application field of smart sustainable cities as complex systems and dynamically changing environments.

Learning and benefitting from previous experiences in sustainable smart urban planning and development to adopt best practices and follow successful models and avoid problematic approaches.

Benefitting from the eminent experts, scholars, and researchers in the field to investigate new possibilities for more advanced technological systems of suitability to the objectives of smart sustainable cities with regard to sustainability.

Investigating the relevance of big data applications to such cities in this direction, an understanding which will help incorporate the right data into the right applications to make accurate, knowledge-driven decisions and implement them to enhance and optimize urban operations, functions, services, designs, strategies, and policies in line with the goals of sustainable development.

A novel architecture and typology of data-driven smart sustainable cities

Specialized constituents for making up a whole.

There exist a range of city architectures that essentially aim to provide the appropriate infrastructure for big data systems and applications for steering urban processes and enhancing urban practices, and whose components serve to form, compose, or make up a whole. These architectures typically influence the relationship between their components and urban constituents and entities. The architecture of the data–driven smart sustainable city illustrated in Fig. 2 entails specialized urban, technological, organizational, and institutional elements dedicated for improving, advancing, and maintaining the contribution of such city to the goals of sustainable development. It is derived based on the outcome of the above thematic analysis and technical literature. This outcome justifies the relationship between the different layers of the architecture. It is worth pointing out that the layered approach to this architecture is motivated by the scientific literature on smart cities, sustainable cities, and smart sustainable cities. However, a layered approach is only one among other approaches to consider in this regard.

figure 2

An architecture of the data-driven smart sustainable city

Furthermore, underlying the idea of the data-driven smart sustainable city is the process of drawing all the kinds of analytics associated with urban life into a single hub, supported by broader public and open data analytics. This involves creating a city-wide instrumented or centralized system that draws together data streams from many agencies (across city domains) for large scale analytics and then direct it to different centers and labs. Urban operating systems as part of cloud computing infrastructure explicitly link together multiple urban technologies to enable greater coordination of urban systems and domains. Urban operations centers attempt to draw together and interlink urban big data to provide integrated and holistic views and synoptic city intelligence through processing, analyzing, visualizing, and monitoring the vast deluge of urban data that is used for real-time decision-making pertaining to sustainability using big data ecosystems. Strategic planning and policy centers serve as a data analytic hub to weave together data from many diverse agencies to control, manage, regulate, and govern urban life more efficiently and effectively in relation to sustainability. This entails an integration that enables systemwide effects to be understood, analyzed, tracked, and built into the very designs and responses that characterize urban operations, functions, and services. As far as research centers and innovation labs are concerned, they are associated with research and innovation for the purpose of developing and disseminating urban intelligence functions.

Typological dimensions and functions

As a leading paradigm of and holistic approach to urbanism, data-driven smart sustainable cities represent a class of cities which are composed of and monitored by ICT of ubiquitous and pervasive computing and underpinned by big data technology and its novel applications that aim at harnessing physical, economic, and social infrastructures as well as leveraging knowledge and conserving resources through enhanced and optimized operational functioning, planning, design, development, and governance. This occurs in ways that ensure environmental integration, social justice, and economic regeneration as fundamental goals of sustainable development towards achieving sustainability. Smart sustainable cities as an integrated approach to urbanism takes multiple forms of combining the strengths of sustainable cities and smart cities based on how the concept of smart sustainable cities can be conceptualized and operationalized, as well as on the multiple processes of, and pathways towards achieving, their status. As a corollary of this, there is a host of opportunities yet to explore towards new approaches to smart sustainable urbanism. This will result in the multiplicity of models of smart sustainable cities in the future. Below is an exemplar of a model of data-driven smart sustainable cities (Table  2 ) encompassing nine distinct dimensions and functions. This model also shows how various urban systems and domains might connect up as shaped by the use of big data technology and its novel applications.

Challenges and concerns

While there is a growing consensus among urban scholars and planners and urban and data scientists that big data analytics and its application will be a salient factor in the operational functioning, management, planning, design, and development of smart sustainable cities, there still are significant scientific and intellectual challenges as well as concerns that need to be addressed and overcome for building such cities based on big data computing and underpinning technologies, and then for accomplishing the desired outcomes related to sustainability. Such challenges and issues pose interesting and complex research questions, and constitute fertile areas of investigation awaiting interdisciplinary and transdisciplinary teams of scholars, academics, scientists, and experts working in the field of smart sustainable urbanism.

The rising demand for big data analytics and its core enabling technologies, coupled with the growing awareness of the associated potential to transform the way the city can function in the context of sustainability, comes with major challenges and concerns related to the design, engineering, development, implementation, and maintenance of data-driven applications in smart sustainable cities. The challenges are mostly computational, analytical, and technical in nature, and sometimes logistic in terms of the detailed organization and implementation of the complex technical operations involving the installation and deployment of the big data ecosystem and its components as part of the ICT infrastructure of such cities. They include, but are not limited to, the following, as compiled in Table  3 from Bibri [ 12 ]:

For a detailed discussion of the above challenges as well as a number of open issues, the interested reader can be directed to Bibri [ 8 , 11 ]. Of particular importance to highlight is that smart sustainable urbanism, urban science, and big data computing and the underpinning technologies create a number of potential privacy harms for several reasons. Kitchin [ 36 ] addresses five reasons, each of which raises significant challenges to existing approaches to protecting privacy (privacy laws and fair information practice principles), namely:

Datafication, dataveillance and geosurveillance.

Inferencing and predictive privacy harms.

Anonymization and re-identi cation.

Obfuscation and reduced control.

5. Notice and consent is an empty exercise or is absent.

Adding to the above primarily technological challenges are the financial, organisational, institutional, social, political, regulatory, and ethical ones, which are associated with the implementation, retention, and dissemination of big data across the domains of smart sustainable cities [ 11 ]. As an example, controversies over the benefits of big data analytics and its application involve limited access and related digital divides and other ethical concerns about accessibility (see Bibri 11 for an overview). Kitchin [ 33 ] provides a critical reflection on the implications of big data and smart urbanism, examining five emerging concerns, namely:

The politics of big urban data.

Technocratic governance and city development.

Corporatization of city governance and technological lock-ins.

Buggy, brittle and hackable cities.

The panoptic city.

Building smart sustainable cities based on big data computing is of a strategic value as to solving many of the complex challenges and pressing issues of sustainability and urbanization. Many sustainable cities and smart cities across the globe have already started to exploit the potential of big data applications in relation to diverse urban systems and domains. We stand at a threshold of a new era where big data science and analytics is drastically changing the way smart sustainable cities are studied, understood, planned, designed, developed, and governed. The ultimate goal is to improve, advance, and maintain their contribution to sustainability by employing more effective and innovative ways to monitor, understand, probe, and plan them. However, there are currently numerous challenges and concerns that need to be addressed and overcome in this new area of science and technology in relation to smart sustainable urbanism for achieving the desired outcomes.

This paper examined how data-driven smart sustainable cities are being instrumented, datafied, and computerized so as to improve, advance, and maintain their contribution to the goals of sustainable development through enhanced practices. In this respect, different topics have been identified and discussed, namely the integration of data-driven smart cities and sustainable cities, digital instrumentation living labs, innovations labs, urban intelligence functions, urban operating centers and strategic planning and policy offices, data types and the role of open data, and data-driven urbanism and urban science and how they relate to one another from a scientific and scholarly perspective. The essence of the idea of data-driven smart sustainable cities revolves around the need to harness and leverage big data technologies that have hitherto been mostly associated with smart cities but have clear synergies in the functioning of sustainable cities and tremendous potential for enhancing their performance and need to be steered or directed for this purpose so that many new opportunities can be enabled and realized. From a societal standpoint, big data computing and its technological applications are socio-culturally constructed to have a determinant role in instigating major social changes on multiple scales due to its transformational power residing or embodied in its disruptive, synergistic, and substantive effects on different forms of social organization.

Also, this paper highlighted and substantiated the real potential of big data technology for improving, advancing, and maintaining the contribution of smart sustainable cities to the goals of sustainable development by identifying, synthesizing, distilling, and enumerating the key practical and analytical applications of this advanced technology in relation to multiple urban systems and domains with respect to urban operations, functions, services, designs, strategies, and policies. The most common data − driven applications identified include: transport and traffic, mobility, energy, power grid, environment, buildings, infrastructures, urban planning, urban design, academic and scientific research, governance, healthcare, education, and public safety. The potential of big data technology lies in enabling smart sustainable cities to harness and leverage their informational landscape in effectively understanding, monitoring, probing, and planning their systems and environments in ways that enable them to reach the optimal level of sustainability. To put it differently, the use of big data analytics is projected to play a significant role in realizing the key characteristic features of such cities, namely the efficiency of operations and functions, the prudent utilization of natural/environmental resources, the intelligent management of infrastructures and facilities, the improvement of the quality of life and well-being of citizens, and the enhancement of mobility and accessibility.

Moreover, this paper proposed, illustrated, and described an architecture and typology of data-driven smart sustainable cities. Their unique features lie in their novelty in terms of bringing new ingredients and the way they are integrated and affect and shape the relationships between the urban entities specific to smart sustainable cities in light of the use of big data technology and its applicability to sustainability. The proposed architecture and typology are developed in response to the need for improving, advancing, and maintaining the contribution of such cities to the goals of sustainable development.

Concerning the value of this work, the outcome will help strategic city stakeholders understand what they can do and invest in more to advance smart sustainable urbanism on the basis of data-driven solutions and approaches, and also give policymakers an opportunity to identify areas for further improvement while leveraging areas of strength with regard to the future form of such urbanism. In addition, it will enable researchers and scholars to direct their future work to the emerging paradigm of data-driven smart sustainable urbanism, and practitioners and experts to identify common problems and potential ways to solve them, all as part of future research and practical endeavors, respectively.

Lastly, this paper provides a form of grounding for further discussion to debate over the disruptive, synergetic, and transformational effects of big data computing and underpinning technologies on forms of the operational functioning, management, planning, design, development, and governance of smart sustainable cities in the future. Also, it presents a sort of basis for stimulating more in-depth research in the form of both qualitative analyses and quantitative investigations focused on establishing, uncovering, substantiating, and/or challenging the assumptions underlying the relevance of big data technology and its advancements as to accelerating sustainable development.

Availability of data and materials

Not applicable.

Abbreviations

Automatic Meter Reading

Big Data Analytics Services

Big Data Infrastructure Services

Big Data Platform Services

Global Positioning System

Enterprise Resource Planning

Infrastructure-as-a-Service

Norwegian University of Science and Technology

Platform-as-a-Service

Sustainable Development Goal

Software-as-a-Service

United Nations

Zero Emission Buildings

Zero Emission Neighborhoods

Ahvenniemi H, Huovila A, Pinto-Seppä I, Airaksinen M. What are the differences between sustainable and smart cities? Cities. 2017;60:234–45.

Article   Google Scholar  

Almirall E, Wareham J. Living labs: arbiters of mid- and ground-level innovation. Technol Anal Strateg Manage. 2011;23(1):87–102.

Al Nuaimi E, Al Neyadi H, Nader M, Al-Jaroodi J. Applications of big data to smart cities. J Internet Serv Appl. 2015;6(25):1–15.

Google Scholar  

Bahga A, Madisetti V. Big bata science and analytics: a hands-on approach. VPT; 2016.

Batty M. Big data, smart cities and city planning. Dialogues Hum Geogr. 2013;3(3):274–9.

Batty M, Axhausen KW, Giannotti F, Pozdnoukhov A, Bazzani A, Wachowicz M, Ouzounis G, Portugali Y. Smart cities of the future. Eur Phys J. 2012;214:481–518.

Bettencourt LMA. The uses of big data in cities. Santa Fe: Santa Fe Institute; 2014.

Book   Google Scholar  

Bibri SE. Smart sustainable cities of the future: the untapped potential of big data analytics and context aware computing for advancing sustainability. Berlin: Springer; 2018.

Bibri SE. The IoT for smart sustainable cities of the future: an analytical framework for sensor-based big data applications for environmental sustainability. Sustain Cities Soc. 2018;38:230–53.

Bibri SE. A foundational framework for smart sustainable city development: theoretical, disciplinary, and discursive dimensions and their synergies. Sustain Cities Soc. 2018;38:758–794

Bibri SE. On the sustainability of smart and smarter cities in the era of big data: an interdisciplinary and transdisciplinary literature review. J Big Data. 2019;6:25. https://doi.org/10.1186/s40537-019-0182-7 .

Bibri SE. Big data science and analytics for smart sustainable urbanism: unprecedented paradigmatic shifts and practical advancements, Springer, Germany, Berlin; 2019. https://doi.org/10.1007/978-3-030-17312-8 .

Bibri SE. Data-driven smart sustainable urbanism: intelligence functions, simulation models, and complexity sciences. Augmented Human Res. 2019 (in press).

Bibri SE, Krogstie J. The core enabling technologies of big data analytics and context-aware computing for smart sustainable cities: a review and synthesis. J Big Big Data. 2017;4(38):1–50.

Bibri SE, Krogstie J. Smart sustainable cities of the future: an extensive interdisciplinary literature review. Sustain Cities Soc. 2017;31:183–212.

Bibri SE, Krogstie J. ICT of the new wave of computing for sustainable urban forms: their big data and context-aware augmented typologies and design concepts. Sustain Cities Soc. 2017;32:449–74.

Bibri SE, Krogstie J. The big data deluge for transforming the knowledge of smart sustainable cities: a data mining framework for urban analytics, Proceedings of the 3d annual international conference on smart city applications, ACM; 2018, Oct 11–12, Tetouan, Morocco.

Burton E. Measuring urban compactness in UK towns and cities. Environ Plann. 2002;29:219–50.

Chesbrough HW. Open innovation: the new imperative for creating and profiting from technology. Boston: Harvard Business School Press; 2003.

Chourabi H, Nam T, Walker S, Gil-Garcia JR, Mellouli S, Nahon K, Pardo TA, Scholl HJ. Understanding smart cities: an integrative framework. In: The 245th Hawaii international conference on system science (HICSS), HI, Maui; 2012, p. 2289–97 .

Cukier K, Mayer-Schoenberger V. “The Rise of Big Data”. Foreign affairs (May/June). 2013:28–40.

Dempsey N. Revisiting the compact city? Built Environ 2010;36(1):5–8.

Dodge M, Kitchin R. The automatic management of drivers and driving spaces. Geoforum. 2007;38(2):264–75.

Feuer A. The Mayor’s geek squad. New York Times, March 23rd 2013. http://www.nytimes.com/2013/03/24/nyregion/mayor-bloombergs-geek-squad.html . Accessed 9 May 2013.

Graham S, Marvin S. Splintering urbanism: networked infrastructures, technological mobilities and the urban condition. New York: Routledge; 2001.

Hashem IAT, Chang V, Anuar NB, Adewole K, Yaqoob I, Gani A, Ahmed E, Chiroma H. The role of big data in smart city. Int J Inf Manage. 2016;36:748–58.

Hollands RG. Will the real smart city please stand up? City. 2008;12(3):303–20.

Article   MathSciNet   Google Scholar  

Jabareen YR. Sustainable urban forms: their typologies, models, and concepts. J Plann Educ Res. 2006;26:38–52.

Jenks M, Dempsey N. Future forms and design for sustainable cities. Oxford: Elsevier; 2005.

Jenks M, Jones C, editors. Dimensions of the sustainable city, vol 2. London: Springer; 2010.

Khan Z, Anjum A, Soomro K, Tahir MA. Towards cloud based big data analytics for smart future cities. J Cloud Comput Adv Syst Appl. 2015;4:2.

Khanac Z, Pervaiz Z, Abbasi AG. Towards a secure service provisioning framework in a smart city environment. Future Gener Comput Syst. 2017;77:112–35.

Kitchin R. The real-time city? Big data and smart urbanism. Geo J. 2014;79:1–14.

Kitchin R. Data-driven, networked urbanism. 2015. https://doi.org/10.2139/ssrn.2641802 .

Kitchin R. Making sense of smart cities: addressing present shortcomings. Camb J Reg Econ Soc. 2015;8(1):131–6. https://doi.org/10.1093/cjres/rsu027 .

Kitchin R. The ethics of smart cities and urban science. Philos Trans R Soc A. 2016;374:20160115.

Kitchin R. Reframing, reimagining and remaking smart cities; 2016. (The Programmable City Working Paper 20).

Kitchin R, Lauriault TP, McArdle G. Knowing and governing cities through urban indicators, city benchmarking & real-time dashboards. Reg Stud Reg Sci. 2015;2:1–28.

Kitchin R, Dodge M. Code/space: software and everyday life. Cambridge: MIT Press; 2011.

Kitchin R, Coletta C, Evans L, Heaphy L, MacDonncha D. Smart cities, urban technocrats, epistemic communities and advocacy coalitions (The programmable city working paper 26); 2017. http://progcity.maynoothuniversity.ie/2017/03/new-paper-smart-cities-urban-technocrats-epistemic-communities-and-advocacy-coalitions/ .

Kourtit K, Nijkamp P, Arribas-Bel D. Smart cities perspective—a comparative European study by means of self-organizing maps. Innovation. 2012;25(2):229–46.

Konugurthi PK, Agarwal K, Chillarige RR, Buyya R. The anatomy of big data computing. Softw Pract Exp SPE. 2016;46(1):79–105.

Kusiak A. The University of Iowa, “innovation: the living laboratory perspective”. Comput Aided Des Appl. 2007;4(6):863–76.

Marvin S, Luque-Ayala A, McFarlane C, editors. Smart urbanism: Utopian vision or false dawn? London: Routledge; 2016.

Miles Matthew B, Michael Huberman A. Qualitative data analysis: an expanded source book. 2nd ed. Newbury Park: Sage; 1994.

Mishler E. Validation in inquiry-guided research: the role of exemplars in narrative studies. Harvard Educ Rev. 1990;60:415–41.

Niitamo V-P, Kulkki S, Eriksson M, Hribernik KA (2006) State-of-the-art and good practice in the field of living labs, Proceedings of the 12th International Conference on Concurrent Enterprising: Innovative Products and Services through Collaborative Networks, Milan, Italy, 2006, 349–357.

Rapoport E, Vernay AL. Defining the eco-city: a discursive approach. Paper presented at the management and innovation for a sustainable built environment conference, International eco-cities initiative, Amsterdam, The Netherlands; 2011. p. 1–15.

Rathore MM, Won-HwaHong AP, Seo HC, Awan I, Saeed S. Exploiting IoT and big data analytics: defining smart digital city using real-time urban data. J SSC. 2018;40:600–10.

Schumacher J, Feurstein K. Living labs—a new multi-stakeholder approach to user integration, Presented at the 3rd International Conference on Interoperability of Enterprise Systems and Applications (I-ESA’07), Funchal, Madeira, Portugal, 2007.

Singer N. Mission control, built for cities: IBM takes ‘smarter cities’ concept to Rio de Janeiro. New York Times, 3 March 2012. http://www.nytimes.com/2012/03/04/business/ibm-takes-smarter-cities-concept-to-rio-de-janeiro.html . Accessed 9 May 2013.

Townsend A. Smart cities—big data, civic hackers and the quest for a new utopia. New York: Norton & Company; 2013.

United Nations. Transforming our world: the 2030 agenda for sustainable development, New York, NY. 2015. https://sustainabledevelopment.un.org/post2015/transformingourworld .

United Nations. Habitat III Issue Papers, 21—Smart cities (V2.0), New York, NY. 2015. https://collaboration.worldbank.org/docs/DOC-20778 . Accessed 2 May 2017.

United Nations. Big Data and the 2030 agenda for sustainable development. Prepared by A. Maaroof. 2015. http://www.unescap.org/events/call-participants-big-data-and-2030-agendasustainable-development-achieving-development .

Von Hippel E. Lead users: a source of novel product concepts. Manage Sci. 1986;32:791–805.

Download references

Acknowledgements

The study is an integral part of a Ph.D. research endeavor being undertaken at NTNU.

Author information

Authors and affiliations.

Department of Computer Science, The Norwegian University of Science and Technology, Saelands veie 9, NO-7491, Trondheim, Norway

Simon Elias Bibri

Department of Architecture and Planning, The Norwegian University of Science and Technology, Alfred Getz vei 3, Sentralbygg 1, 5th floor, NO-7491, Trondheim, Norway

You can also search for this author in PubMed   Google Scholar

Contributions

The author read and approved the final manuscript.

Author's Information

Simon Elias Bibri is a Ph.D scholar in the area of data-driven smart sustainable cities of the future and Assistant Professor at the Norwegian University of Science and Technology (NTNU), Department of Computer Science and Department of Urban Planning and Design, Trondheim, Norway. He holds the following degrees:

1. Bachelor of Science in computer engineering with a major in software development and computer networks

2. Master of Science—research focused—in computer science with a major in Ambient Intelligence

3. Master of Science in computer science with a major in informatics

4. Master of Science in computer and systems sciences with a major in decision support and risk analysis

5. Master of Science in entrepreneurship and innovation with a major in new venture creation

6. Master of Science in strategic leadership toward sustainability

7. Master of Science in sustainable urban development

8. Master of Science in environmental science with a major in ecotechnology and sustainable development

9. Master of Social Science with a major in business administration (MBA)

10. Master of Arts in communication and media for social change

11. Postgraduate degree (one year of Master courses) in management and economics

12. PhD in computer science and urban planning with a major in data-driven smart sustainable cities of the future

Bibri has earned all his Master’s degrees from different Swedish universities, namely Lund University, West University, Blekinge Institute of Technology, Malmö University, Stockholm University, and Mid-Sweden University.

Before embarking on his long academic journey, Bibri had served as a sustainability and ICT strategist, business engineer, project manager, researcher, and consultant. His current research interests include smart sustainable cities, sustainable cities, smart cities, urban science, sustainability science, complexity science, data–intensive science, data–driven and scientific urbanism, as well as big data computing and its core enabling and driving technologies, namely sensor technologies, data processing platforms, big data applications, cloud and fog computing infrastructures, and wireless communication networks.

Bibri has authored four academic books whose titles are as follows:

1. The Human Face of Ambient Intelligence: Cognitive, Emotional, Affective, Behavioral and Conversational Aspects (525 pages), Springer, 07/2015.

2. The Shaping of Ambient Intelligence and the Internet of Things: Historico-epistemic, Socio-cultural, Politico-institutional and Eco-environmental Dimensions (301 pages), Springer, 11/2015.

3. Smart Sustainable Cities of the Future: The Untapped Potential of Big Data Analytics and Context-Aware Computing for Advancing Sustainability (660 pages), Springer, 03/2018.

4. Big Data Science and Analytics for Smart Sustainable Urbanism: Unprecedented Paradigmatic Shifts and Practical Advancements (505 pages), Springer 06/2019.

Corresponding author

Correspondence to Simon Elias Bibri .

Ethics declarations

Competing interests.

The author declares no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Bibri, S.E. The anatomy of the data-driven smart sustainable city: instrumentation, datafication, computerization and related applications. J Big Data 6 , 59 (2019). https://doi.org/10.1186/s40537-019-0221-4

Download citation

Received : 21 March 2019

Accepted : 17 June 2019

Published : 04 July 2019

DOI : https://doi.org/10.1186/s40537-019-0221-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data-driven smart sustainable urbanism
  • Big data analytics
  • Big data applications
  • Urban science
  • Urban sustainability
  • Sustainable development
  • Innovation labs
  • Urban operation centers

data driven dissertation

Grad Coach

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Research Topic Kickstarter service, which is the perfect starting point for developing a unique, well-justified research topic.

Research Topic Kickstarter - Need Help Finding A Research Topic?

You Might Also Like:

IT & Computer Science Research Topics

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Graduate Thesis Or Dissertation

Data-driven model development and identification of dynamical systems public deposited.

Default

In recent years, data-driven model discovery has become increasingly popular due to rapid advances in computational power, and data processing and storage procedures. This has fostered the development of new algorithms to identify complex systems from data. However, the performance and robustness of the present techniques significantly deteriorate when the data is contaminated with noise. This dissertation considers modern sparse regression techniques to robustly recover governing equations of nonlinear dynamical systems from noisy state measurements. Comprised in three main chapters, we investigate convex  ℓ 1 -regularized least squares methods, denoising strategies to enhance the performance and accuracy of identification algorithms, and non-convex optimization procedures for dynamical system identification. We begin by exploring an iteratively reweighted version of l1-regularized least squares to mitigate noise effects on measurements and conclude that a reweighted approach enhances the accuracy of the dynamical identification process. We also propose a method to recover dynamical constraints given by implicit functions of the state variables. Next, we compare and assess local and global measurement denoising strategies as well as model selection techniques as a pre-processing step to improve the robustness and performance of sparse identification algorithms. We empirically prove that global methods outperform local methods, and that Pareto curves generally yield better regularization parameters than generalized cross-validation. Finally, we present a promising non-convex formulation and suitable optimization algorithms for sparse dynamical system identification that avoids errors arising from numerical differentiation of noisy data. We conclude by discussing potential improvements for non-convex dynamical system identification approaches and provide further research directions.

  • Cortiella, Alexandre
  • Aerospace Engineering
  • Doostan, Alireza
  • Scheeres, Daniel
  • Becker, Stephen
  • Maute, Kurt
  • Hussein, Mahmoud
  • University of Colorado Boulder
  • Machine Learning
  • Applied mathematics
  • Engineering
  • Aerospace engineering
  • Sparse Regression
  • System Identification
  • Data-driven modeling
  • Dissertation
  • In Copyright
  • English [eng]

Relationships

Machine Learning - CMU

PhD Dissertations

PhD Dissertations

[all are .pdf files].

Learning Models that Match Jacob Tyo, 2024

Improving Human Integration across the Machine Learning Pipeline Charvi Rastogi, 2024

Reliable and Practical Machine Learning for Dynamic Healthcare Settings Helen Zhou, 2023

Automatic customization of large-scale spiking network models to neuronal population activity (unavailable) Shenghao Wu, 2023

Estimation of BVk functions from scattered data (unavailable) Addison J. Hu, 2023

Rethinking object categorization in computer vision (unavailable) Jayanth Koushik, 2023

Advances in Statistical Gene Networks Jinjin Tian, 2023 Post-hoc calibration without distributional assumptions Chirag Gupta, 2023

The Role of Noise, Proxies, and Dynamics in Algorithmic Fairness Nil-Jana Akpinar, 2023

Collaborative learning by leveraging siloed data Sebastian Caldas, 2023

Modeling Epidemiological Time Series Aaron Rumack, 2023

Human-Centered Machine Learning: A Statistical and Algorithmic Perspective Leqi Liu, 2023

Uncertainty Quantification under Distribution Shifts Aleksandr Podkopaev, 2023

Probabilistic Reinforcement Learning: Using Data to Define Desired Outcomes, and Inferring How to Get There Benjamin Eysenbach, 2023

Comparing Forecasters and Abstaining Classifiers Yo Joong Choe, 2023

Using Task Driven Methods to Uncover Representations of Human Vision and Semantics Aria Yuan Wang, 2023

Data-driven Decisions - An Anomaly Detection Perspective Shubhranshu Shekhar, 2023

Applied Mathematics of the Future Kin G. Olivares, 2023

METHODS AND APPLICATIONS OF EXPLAINABLE MACHINE LEARNING Joon Sik Kim, 2023

NEURAL REASONING FOR QUESTION ANSWERING Haitian Sun, 2023

Principled Machine Learning for Societally Consequential Decision Making Amanda Coston, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Maxwell B. Wang, 2023

Long term brain dynamics extend cognitive neuroscience to timescales relevant for health and physiology Darby M. Losey, 2023

Calibrated Conditional Density Models and Predictive Inference via Local Diagnostics David Zhao, 2023

Towards an Application-based Pipeline for Explainability Gregory Plumb, 2022

Objective Criteria for Explainable Machine Learning Chih-Kuan Yeh, 2022

Making Scientific Peer Review Scientific Ivan Stelmakh, 2022

Facets of regularization in high-dimensional learning: Cross-validation, risk monotonization, and model complexity Pratik Patil, 2022

Active Robot Perception using Programmable Light Curtains Siddharth Ancha, 2022

Strategies for Black-Box and Multi-Objective Optimization Biswajit Paria, 2022

Unifying State and Policy-Level Explanations for Reinforcement Learning Nicholay Topin, 2022

Sensor Fusion Frameworks for Nowcasting Maria Jahja, 2022

Equilibrium Approaches to Modern Deep Learning Shaojie Bai, 2022

Towards General Natural Language Understanding with Probabilistic Worldbuilding Abulhair Saparov, 2022

Applications of Point Process Modeling to Spiking Neurons (Unavailable) Yu Chen, 2021

Neural variability: structure, sources, control, and data augmentation Akash Umakantha, 2021

Structure and time course of neural population activity during learning Jay Hennig, 2021

Cross-view Learning with Limited Supervision Yao-Hung Hubert Tsai, 2021

Meta Reinforcement Learning through Memory Emilio Parisotto, 2021

Learning Embodied Agents with Scalably-Supervised Reinforcement Learning Lisa Lee, 2021

Learning to Predict and Make Decisions under Distribution Shift Yifan Wu, 2021

Statistical Game Theory Arun Sai Suggala, 2021

Towards Knowledge-capable AI: Agents that See, Speak, Act and Know Kenneth Marino, 2021

Learning and Reasoning with Fast Semidefinite Programming and Mixing Methods Po-Wei Wang, 2021

Bridging Language in Machines with Language in the Brain Mariya Toneva, 2021

Curriculum Learning Otilia Stretcu, 2021

Principles of Learning in Multitask Settings: A Probabilistic Perspective Maruan Al-Shedivat, 2021

Towards Robust and Resilient Machine Learning Adarsh Prasad, 2021

Towards Training AI Agents with All Types of Experiences: A Unified ML Formalism Zhiting Hu, 2021

Building Intelligent Autonomous Navigation Agents Devendra Chaplot, 2021

Learning to See by Moving: Self-supervising 3D Scene Representations for Perception, Control, and Visual Reasoning Hsiao-Yu Fish Tung, 2021

Statistical Astrophysics: From Extrasolar Planets to the Large-scale Structure of the Universe Collin Politsch, 2020

Causal Inference with Complex Data Structures and Non-Standard Effects Kwhangho Kim, 2020

Networks, Point Processes, and Networks of Point Processes Neil Spencer, 2020

Dissecting neural variability using population recordings, network models, and neurofeedback (Unavailable) Ryan Williamson, 2020

Predicting Health and Safety: Essays in Machine Learning for Decision Support in the Public Sector Dylan Fitzpatrick, 2020

Towards a Unified Framework for Learning and Reasoning Han Zhao, 2020

Learning DAGs with Continuous Optimization Xun Zheng, 2020

Machine Learning and Multiagent Preferences Ritesh Noothigattu, 2020

Learning and Decision Making from Diverse Forms of Information Yichong Xu, 2020

Towards Data-Efficient Machine Learning Qizhe Xie, 2020

Change modeling for understanding our world and the counterfactual one(s) William Herlands, 2020

Machine Learning in High-Stakes Settings: Risks and Opportunities Maria De-Arteaga, 2020

Data Decomposition for Constrained Visual Learning Calvin Murdock, 2020

Structured Sparse Regression Methods for Learning from High-Dimensional Genomic Data Micol Marchetti-Bowick, 2020

Towards Efficient Automated Machine Learning Liam Li, 2020

LEARNING COLLECTIONS OF FUNCTIONS Emmanouil Antonios Platanios, 2020

Provable, structured, and efficient methods for robustness of deep networks to adversarial examples Eric Wong , 2020

Reconstructing and Mining Signals: Algorithms and Applications Hyun Ah Song, 2020

Probabilistic Single Cell Lineage Tracing Chieh Lin, 2020

Graphical network modeling of phase coupling in brain activity (unavailable) Josue Orellana, 2019

Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees Christoph Dann, 2019 Learning Generative Models using Transformations Chun-Liang Li, 2019

Estimating Probability Distributions and their Properties Shashank Singh, 2019

Post-Inference Methods for Scalable Probabilistic Modeling and Sequential Decision Making Willie Neiswanger, 2019

Accelerating Text-as-Data Research in Computational Social Science Dallas Card, 2019

Multi-view Relationships for Analytics and Inference Eric Lei, 2019

Information flow in networks based on nonstationary multivariate neural recordings Natalie Klein, 2019

Competitive Analysis for Machine Learning & Data Science Michael Spece, 2019

The When, Where and Why of Human Memory Retrieval Qiong Zhang, 2019

Towards Effective and Efficient Learning at Scale Adams Wei Yu, 2019

Towards Literate Artificial Intelligence Mrinmaya Sachan, 2019

Learning Gene Networks Underlying Clinical Phenotypes Under SNP Perturbations From Genome-Wide Data Calvin McCarter, 2019

Unified Models for Dynamical Systems Carlton Downey, 2019

Anytime Prediction and Learning for the Balance between Computation and Accuracy Hanzhang Hu, 2019

Statistical and Computational Properties of Some "User-Friendly" Methods for High-Dimensional Estimation Alnur Ali, 2019

Nonparametric Methods with Total Variation Type Regularization Veeranjaneyulu Sadhanala, 2019

New Advances in Sparse Learning, Deep Networks, and Adversarial Learning: Theory and Applications Hongyang Zhang, 2019

Gradient Descent for Non-convex Problems in Modern Machine Learning Simon Shaolei Du, 2019

Selective Data Acquisition in Learning and Decision Making Problems Yining Wang, 2019

Anomaly Detection in Graphs and Time Series: Algorithms and Applications Bryan Hooi, 2019

Neural dynamics and interactions in the human ventral visual pathway Yuanning Li, 2018

Tuning Hyperparameters without Grad Students: Scaling up Bandit Optimisation Kirthevasan Kandasamy, 2018

Teaching Machines to Classify from Natural Language Interactions Shashank Srivastava, 2018

Statistical Inference for Geometric Data Jisu Kim, 2018

Representation Learning @ Scale Manzil Zaheer, 2018

Diversity-promoting and Large-scale Machine Learning for Healthcare Pengtao Xie, 2018

Distribution and Histogram (DIsH) Learning Junier Oliva, 2018

Stress Detection for Keystroke Dynamics Shing-Hon Lau, 2018

Sublinear-Time Learning and Inference for High-Dimensional Models Enxu Yan, 2018

Neural population activity in the visual cortex: Statistical methods and application Benjamin Cowley, 2018

Efficient Methods for Prediction and Control in Partially Observable Environments Ahmed Hefny, 2018

Learning with Staleness Wei Dai, 2018

Statistical Approach for Functionally Validating Transcription Factor Bindings Using Population SNP and Gene Expression Data Jing Xiang, 2017

New Paradigms and Optimality Guarantees in Statistical Learning and Estimation Yu-Xiang Wang, 2017

Dynamic Question Ordering: Obtaining Useful Information While Reducing User Burden Kirstin Early, 2017

New Optimization Methods for Modern Machine Learning Sashank J. Reddi, 2017

Active Search with Complex Actions and Rewards Yifei Ma, 2017

Why Machine Learning Works George D. Montañez , 2017

Source-Space Analyses in MEG/EEG and Applications to Explore Spatio-temporal Neural Dynamics in Human Vision Ying Yang , 2017

Computational Tools for Identification and Analysis of Neuronal Population Activity Pengcheng Zhou, 2016

Expressive Collaborative Music Performance via Machine Learning Gus (Guangyu) Xia, 2016

Supervision Beyond Manual Annotations for Learning Visual Representations Carl Doersch, 2016

Exploring Weakly Labeled Data Across the Noise-Bias Spectrum Robert W. H. Fisher, 2016

Optimizing Optimization: Scalable Convex Programming with Proximal Operators Matt Wytock, 2016

Combining Neural Population Recordings: Theory and Application William Bishop, 2015

Discovering Compact and Informative Structures through Data Partitioning Madalina Fiterau-Brostean, 2015

Machine Learning in Space and Time Seth R. Flaxman, 2015

The Time and Location of Natural Reading Processes in the Brain Leila Wehbe, 2015

Shape-Constrained Estimation in High Dimensions Min Xu, 2015

Spectral Probabilistic Modeling and Applications to Natural Language Processing Ankur Parikh, 2015 Computational and Statistical Advances in Testing and Learning Aaditya Kumar Ramdas, 2015

Corpora and Cognition: The Semantic Composition of Adjectives and Nouns in the Human Brain Alona Fyshe, 2015

Learning Statistical Features of Scene Images Wooyoung Lee, 2014

Towards Scalable Analysis of Images and Videos Bin Zhao, 2014

Statistical Text Analysis for Social Science Brendan T. O'Connor, 2014

Modeling Large Social Networks in Context Qirong Ho, 2014

Semi-Cooperative Learning in Smart Grid Agents Prashant P. Reddy, 2013

On Learning from Collective Data Liang Xiong, 2013

Exploiting Non-sequence Data in Dynamic Model Learning Tzu-Kuo Huang, 2013

Mathematical Theories of Interaction with Oracles Liu Yang, 2013

Short-Sighted Probabilistic Planning Felipe W. Trevizan, 2013

Statistical Models and Algorithms for Studying Hand and Finger Kinematics and their Neural Mechanisms Lucia Castellanos, 2013

Approximation Algorithms and New Models for Clustering and Learning Pranjal Awasthi, 2013

Uncovering Structure in High-Dimensions: Networks and Multi-task Learning Problems Mladen Kolar, 2013

Learning with Sparsity: Structures, Optimization and Applications Xi Chen, 2013

GraphLab: A Distributed Abstraction for Large Scale Machine Learning Yucheng Low, 2013

Graph Structured Normal Means Inference James Sharpnack, 2013 (Joint Statistics & ML PhD)

Probabilistic Models for Collecting, Analyzing, and Modeling Expression Data Hai-Son Phuoc Le, 2013

Learning Large-Scale Conditional Random Fields Joseph K. Bradley, 2013

New Statistical Applications for Differential Privacy Rob Hall, 2013 (Joint Statistics & ML PhD)

Parallel and Distributed Systems for Probabilistic Reasoning Joseph Gonzalez, 2012

Spectral Approaches to Learning Predictive Representations Byron Boots, 2012

Attribute Learning using Joint Human and Machine Computation Edith L. M. Law, 2012

Statistical Methods for Studying Genetic Variation in Populations Suyash Shringarpure, 2012

Data Mining Meets HCI: Making Sense of Large Graphs Duen Horng (Polo) Chau, 2012

Learning with Limited Supervision by Input and Output Coding Yi Zhang, 2012

Target Sequence Clustering Benjamin Shih, 2011

Nonparametric Learning in High Dimensions Han Liu, 2010 (Joint Statistics & ML PhD)

Structural Analysis of Large Networks: Observations and Applications Mary McGlohon, 2010

Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy Brian D. Ziebart, 2010

Tractable Algorithms for Proximity Search on Large Graphs Purnamrita Sarkar, 2010

Rare Category Analysis Jingrui He, 2010

Coupled Semi-Supervised Learning Andrew Carlson, 2010

Fast Algorithms for Querying and Mining Large Graphs Hanghang Tong, 2009

Efficient Matrix Models for Relational Learning Ajit Paul Singh, 2009

Exploiting Domain and Task Regularities for Robust Named Entity Recognition Andrew O. Arnold, 2009

Theoretical Foundations of Active Learning Steve Hanneke, 2009

Generalized Learning Factors Analysis: Improving Cognitive Models with Machine Learning Hao Cen, 2009

Detecting Patterns of Anomalies Kaustav Das, 2009

Dynamics of Large Networks Jurij Leskovec, 2008

Computational Methods for Analyzing and Modeling Gene Regulation Dynamics Jason Ernst, 2008

Stacked Graphical Learning Zhenzhen Kou, 2007

Actively Learning Specific Function Properties with Applications to Statistical Inference Brent Bryan, 2007

Approximate Inference, Structure Learning and Feature Estimation in Markov Random Fields Pradeep Ravikumar, 2007

Scalable Graphical Models for Social Networks Anna Goldenberg, 2007

Measure Concentration of Strongly Mixing Processes with Applications Leonid Kontorovich, 2007

Tools for Graph Mining Deepayan Chakrabarti, 2005

Automatic Discovery of Latent Variable Models Ricardo Silva, 2005

data driven dissertation

Dissertation

TITLE: ADVANCING DATA-DRIVEN ENVIRONMENTAL DECISION-MAKING AND GOVERNANCE IN CHINA 

My dissertation is comprised of four articles that seek to understand the strengths and limitations of applying empirical approaches to environmental policy evaluation and decision-making, with a particular focus on China. The first paper examines whether national environmental performance indicators developed using publicly-available data can be used to evaluate countries’ progress toward Millennium Development Goal 7 (MDG7) – to ensure sustainability – and other global environmental goals.  I argue that while global environmental movements such as the landmark 1992 Rio Earth Summit, which put sustainable development at the fore of national policy agendas, had a profound effect in spurring multilateral cooperation on global environmental issues, they overlooked the data requirements and monitoring infrastructure, particularly in the global South, needed to gauge whether targets were being achieved.  Moreover, goals like MDG7 had an effect of generating a path-dependency for indicators, constraining flexibility for the design of better, more policy-relevant indicators that reflect advancements in the scientific understanding of environmental problems.

The remaining three chapters of my dissertation developed in-depth case studies of China’s environmental governance structure, using indicators as a lens by which to understand institutions, politics, and actors.  In collaborating with the Chinese Academy for Environmental Planning, a government think-tank affiliated with the Ministry of Environmental Protection, the second paper in my dissertation systematically evaluated baseline environmental data at the provincial level in China, and provided an assessment of challenges – scientific and political – to developing aggregate indices of environmental performance in China.  Confronting political barriers to information access and data inconsistencies, I conducted around semi-structured interviews of provincial and municipal government officials in environmental protection bureaus and monitoring centers across nine provinces and two municipalities to understand center-local challenges with respect to environmental data collection and reporting for my third article.  This paper revealed major implementation challenges with respect to China’s vertical environmental governance structure, highlighting competing local economic incentives, variable institutional capacity and public awareness that affect high quality data collection.  Recognizing these data limitations, the last paper applied satellite data to develop air quality indicators for fine particulate matter (PM2.5) at the provincial level in China – a first look at subnational, long-term average exposure to PM2.5 prior to the public release of ground-level data in major cities in China.

Publications associated with this dissertation:

Emerson, J., A. Hsu, M. Levy, A. de Sherbinin, V. Mara, D. Esty, and M. Jaiteh. 2012. 2012 Environmental Performance Index and Pilot Trend Environmental Performance Index. New Haven: Yale Center for Environmental Law and Policy. Available: http://epi.yale.edu/downloads

Hsu, A. 2013. Limitations and Challenges of Provincial Environmental Protection Bureaus in China’s Environmental Monitoring, Reporting, and Verification. Environmental Practice , 15(3): 280-292.  http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=9051523.

Hsu, A., A. de Sherbinin, H. Shi. 2012. Seeking truth from facts: the challenge of environmental indicator development in China. Environmental Development . 3: 39-51.  http://dx.doi.org/10.1016/j.envdev.2012.05.001 .

Hsu, A., A. Lloyd, and J.W. Emerson. 2013. What progress have we made since Rio? The 2012 Environmental Performance Index (EPI) and Pilot Trend EPI. Environmental Science and Policy . 33:171–185.  http://dx.doi.org/10.1016/j.envsci.2013.05.011 .

Hsu, A. A. Reuben, D. Shindell, A. de Sherbinin, and M. Levy. 2013. Toward the Next Generation of Air Quality Monitoring. Atmospheric Environment . Atmospheric Environment , 80: 561-570. http://www.sciencedirect.com/science/article/pii/S1352231013005578 .

Yale Center for Environmental Law and Policy (YCELP), the Center for International Earth Science Information Network (CIESIN), Chinese Academy for Environmental Planning, and City University of Hong Kong. 2011. Towards a China Environmental Performance Index. Available at: http://envirocenter.yale.edu/chinaepi .

Recent Comments

  • Meet Our Summer 2023 RAs! - eGreenews on Meet Our Team
  • AI & Climate Change: How Machine Learning is Powering the Net Zero Tracker - eGreenews on Breaking down the IPCC’s latest report: How DDL relates to the need for aggressive, data-driven climate action
  • TCS Sustainathon 2022 – Exclusively For TCSers | TCS Sustainathon on Blockchain energy consumption: Debunking the misperceptions of Bitcoin’s and blockchain’s climate impact
  • Sarah marshall on Student project streamlines climate action database
  • Do blockchains have a future in healthcare? - Alpharmaxim on Blockchain energy consumption: Debunking the misperceptions of Bitcoin’s and blockchain’s climate impact
  • February 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • February 2022
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • February 2021
  • December 2020
  • November 2020
  • September 2020
  • August 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • February 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • February 2016
  • December 2015
  • November 2015
  • October 2015
  • August 2015
  • October 2014
  • September 2014
  • August 2014
  • February 2014
  • September 2013
  • August 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • August 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • August 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • January 2010
  • December 2009
  • Air quality
  • China Sights
  • Data Visualization
  • Environmental Performance Index (EPI)
  • Groundswell
  • Machine Learning
  • Presentations
  • Press Release
  • Publications
  • Remote Sensing
  • Sustainability
  • Sustainable transport
  • The City Fix
  • Uncategorized
  • Urban Index
  • Urbanization
  • Entries feed
  • Comments feed
  • WordPress.org

css.php

More From Forbes

Data consolidation: the key to unlocking ai's transformative power in organizations.

Forbes Technology Council

  • Share to Facebook
  • Share to Twitter
  • Share to Linkedin

Afif Khoury, CEO of SOCi , has forged a career spanning over 25 years at the forefront of technology and data-driven innovation.

In today's rapidly evolving business landscape, the integration of artificial intelligence (AI) presents a significant opportunity for organizations to gain a competitive edge through innovation and efficiency. However, the full potential of AI remains largely untapped, hindered by the fragmented nature of data across various channels and systems. To harness the transformative capabilities of AI, data consolidation and management emerge as a pivotal step.

Currently, for many organizations, data is dispersed across multiple platforms and networks, creating silos that impede holistic insights and strategic decision-making. This fragmented data landscape leads to inefficiencies and obscures valuable insights that could drive organizational growth and competitive advantage.

AI, much like a learning model, thrives on access to comprehensive yet consolidated and structured data. The depth and breadth of information available to AI directly impacts its ability to generate accurate predictions, identify patterns, derive actionable insights and improve ROI. This necessity for comprehensive data is akin to how we understand human learning, where exposure to a diverse range of information shapes intelligence and understanding.

To address this challenge, organizations must prioritize integrating disparate data sources, creating a unified data repository that serves as a foundation for AI-driven insights and decision-making.

Apple Confirms Major iPhone Changes With New App Features Enabled

Aew dynamite results, winners and grades as cm punk destroys jack perry, chiefs rashee rice hit with 8 criminal charges in connection to multi car crash.

Here are practical steps to empower organizations with AI-infused analytics.

Step 1: Implement A Centralized Data Management Platform

Invest in a platform that can aggregate data from all channels and networks, creating a unified repository for AI-driven analytics and insights. Beware of "point solutions" that offer limited data access, as they may restrict the scope and effectiveness of AI-driven strategies. For example, a social media management platform analyzing only content posted by your organization could lead to narrow recommendations.

However, an AI with access to a comprehensive dataset—including keyword traffic, sentiment analysis of reviews, chatbot interactions, survey responses, revenue data, CRM leads and loyalty program insights—provides a holistic view of customer behaviors, enabling more informed decision-making. Consolidated platforms with broad data access are better suited for leveraging AI in data-driven decision-making.

Step 2: Standardize Data Formats

Standardizing data formats is crucial for unleashing AI's full potential. Data inconsistency can hinder AI-driven strategies, making it essential to ensure uniformity across all data types. For example, customer interaction data, like tweets and comments, requires standardization for effective analysis. Techniques like NLP for text and CNNs for images depend on data pre-processing to interpret information accurately.

Moreover, Gartner warns that 85% of AI projects may fail due to data biases, underscoring the need for representative and unbiased data. To achieve this, companies should implement a common data model across all channels, supported by automation tools for efficient and accurate data transformation.

Step 3: Foster A Data-Driven Culture

Encouraging a data-driven culture is essential for successful AI deployment. Companies that base their strategies on data-driven insights can significantly improve their ROI . However, it's not enough to simply promote data-driven decision-making; organizations must cultivate a culture that values and utilizes data effectively. This involves providing resources such as workshops and access to analytical tools to empower employees.

For instance, Google Analytics Academy offers free courses to help marketing teams leverage data efficiently. By fostering a data-driven mindset and providing the necessary support, organizations can enhance their analytical capability and strategic agility, leading to greater success in AI-infused analytics.

Step 4: Leverage AI For Personalization At Scale

AI's capability to analyze large datasets allows for hyper-localized personalization, enabling businesses to tailor marketing efforts to individual preferences at scale. For instance, a retail chain with multiple storefronts can utilize AI to review each store's marketing data, gaining precise insights into local customer behavior.

This level of personalization has been shown to significantly enhance customer engagement and satisfaction, with studies indicating that 80% of consumers are more likely to make a purchase when offered personalized experiences. By leveraging AI to sift through massive datasets and identify trends, businesses can drive higher conversion rates and foster customer loyalty in a manner previously unattainable.

Step 5: Continuously Monitor And Optimize

AI systems excel in the real-time identification of patterns and shifts in consumer behavior, enabling swift adjustments to marketing campaigns to maintain relevance and effectiveness in a rapidly changing market. Regularly updating AI models with new data and adjusting strategies based on performance analytics ensures that marketing efforts remain efficient and aligned with business objectives. McKinsey emphasizes the significance of real-time optimization in marketing , citing potential increases in marketing ROI ranging from 15% to 40%.

In conclusion, as we continue to navigate the complexities of unlocking AI’s transformative power, let us not forget that at the heart of every advanced technology—from AI to machine learning—lies a simple truth: Knowledge is power. And in the context of technology adoption, that power is derived from the comprehensive, consolidated data we choose to feed into our systems.

By breaking down data silos and creating a unified data repository, organizations can harness AI's transformative capabilities to drive innovation, efficiency and success across all aspects of their operations.

Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?

Afif Khoury

  • Editorial Standards
  • Reprints & Permissions

We've detected unusual activity from your computer network

To continue, please click the box below to let us know you're not a robot.

Why did this happen?

Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review our Terms of Service and Cookie Policy .

For inquiries related to this message please contact our support team and provide the reference ID below.

Watch CBS News

Inflation runs hot for third straight month, driven by gas prices and rent

By Aimee Picchi

Edited By Alain Sherter

Updated on: April 10, 2024 / 8:01 PM EDT / CBS News

Inflation remains the stickiest of problems for the U.S. economy, with the March consumer price index coming in hotter than expected — the third straight month that prices have accelerated. Gasoline prices and rent contributed over half the monthly increase, the government said  on Wednesday.

Prices in March rose 3.5% on an annual basis, higher than the 3.4% expected by economists polled by financial data services company FactSet. It also represents a jump from February's increase of 3.2% and January's bump of 3.1% on a year-over-year basis. 

The latest acceleration in prices complicates the picture for the Federal Reserve, which has been monitoring economic data to determine whether inflation is cool enough to allow it to cut interest rates. But inflation, which measures the rate of price changes in goods and services bought by consumers, has remained stubborn in 2024, stalling the progress made last year to bring down the annual growth rate to the Fed's goal of 2%.

"This marks the third consecutive strong reading and means that the stalled disinflationary narrative can no longer be called a blip," said Seema Shah, chief global strategist at Principal Asset Management, in an email.  

Shah added, "In fact, even if inflation were to cool next month to a more comfortable reading, there is likely sufficient caution within the Fed now to mean that a July cut may also be a stretch, by which point the U.S. election will begin to intrude with Fed decision making."

Stocks fell on the report, with S&P 500 down 45 points, or 0.9%, to 5,164.96. The Dow Jones Industrial Average slumped 1% while the tech-heavy Nasdaq slipped 0.9%.

What does this mean for the Federal Reserve?

The higher inflation measures threaten to torpedo the prospect of multiple interest rate cuts this year. Fed officials have made clear that with the economy healthy,  they're in no rush  to cut their benchmark rate despite their earlier projections that they would reduce rates three times this year.

At the start of 2024, Wall Street traders had projected that the Fed would cut its key rate up to six or seven times this year. In March, Fed officials signaled that they envisioned three rate cuts. But elevated inflation readings for January and February — along with signs that economic growth remains healthy — led several Fed officials to suggest that  fewer rate cuts  may occur this year.

On Thursday, a Federal Reserve official raised the possibility the central bank may not cut interest rates at all in 2024, deflating Wall Street's expectations that several reductions could be in store later this year. 

"If we continue to see inflation moving sideways, it would make me question whether we needed to do those rate cuts at all," said Federal Reserve Bank of Minneapolis President Neel Kashkari last week.

Where inflation is spiking

Gas prices surged 1.7% from February to March and clothing costs 0.7%. The average cost of auto insurance jumped 2.6% last month and is up a dramatic 22% from a year ago, partly reflecting purchases of higher-priced vehicles.

A report earlier this year found U.S. drivers are paying an average of $2,543 annually , or $212 per month, for car insurance — an increase of 26% from last year. Rates are also rising due to the impact of severe weather events, which have become more frequent due to climate change.

Grocery costs, though, were unchanged last month and are 2.2% higher than they were a year ago, providing some relief to consumers after the huge spikes in food prices in 2022 and early 2023.

The surge in inflation that followed the pandemic jacked up the cost of food, gas, rent and many other items. Though inflation has since plummeted from its peak of 9.1% in June 2022, average prices are still well above where they were before the pandemic.

—With reporting by the Associated Press.

Aimee Picchi is the associate managing editor for CBS MoneyWatch, where she covers business and personal finance. She previously worked at Bloomberg News and has written for national news outlets including USA Today and Consumer Reports.

More from CBS News

Florida abortion measure backers raise nearly $2.25 million

Sea lion euthanized at Miami Seaquarium, 3rd animal to die under facility's care in nearly a year

Florida woman sentenced to a month in jail for selling Biden's daughter's diary

Arrest made in deadly shooting outside North Lauderdale sport bar

  • Share full article

Advertisement

Supported by

In Battle Over Health Care Costs, Private Equity Plays Both Sides

As medical practices owned by private equity firms fuel overbilling, a payment tool also backed by such investors helps insurers boost their profits.

Andrew Faehnle and his teenager pose on a couch, their arms around each other’s shoulders.

By Chris Hamby

Insurance companies have long blamed private-equity-owned hospitals and physician groups for exorbitant billing that drives up health care costs. But a tool backed by private equity is helping insurers make billions of dollars and shift costs to patients.

The tool, Data iSight, is the premier offering of a cost-containment firm called MultiPlan that has attracted round after round of private equity investment since positioning itself as a central player in the lucrative medical payments field. Today Hellman & Friedman, the California-based private equity giant, and the Saudi Arabian government’s sovereign wealth fund are among the firm’s largest investors.

The evolution of Data iSight, which recommends how much of each medical bill should be paid, is an untold chapter in the story of private equity’s influence on American health care.

A New York Times investigation of insurers’ relationship with MultiPlan found that countering predatory billing is just one aspect of the collaboration. Low payments have burdened patients with unexpectedly large bills, slashed pay for doctors and other medical professionals and left employers that fund health plans with high, often unanticipated fees — all while making the country’s biggest health insurance companies a lot of money.

Often, when someone gets insurance through an employer and sees a doctor outside the plan’s network, the insurer routes the bill to MultiPlan to recommend an amount to pay. Both MultiPlan and the insurer receive processing fees from the employer, usually based on the size of the final payment: the smaller the payout, the bigger the fees.

This business model has made Data iSight a cash cow. Of the handful of tools MultiPlan offers insurers, Data iSight consistently makes the most frugal recommendations, typically resulting in the highest fees.

MultiPlan, which has been publicly traded since 2020, did not respond to detailed questions about Data iSight. A statement issued by an outside public relations firm said MultiPlan’s payment recommendations were fair and “widely accepted.” It said the company was “committed to lowering out-of-network costs,” including by using “data-driven tools to determine fair reimbursements.”

In recent years, concern over private equity’s investments in medical practices has grown, as studies have documented rising bills. Insurers and MultiPlan say that Data iSight is a necessary counterweight.

Caught between these moneyed interests are patients, who are mostly in the dark. If they encounter Data iSight’s name, it is typically in the fine print of dense paperwork. Those who have complained said they got little more than assurances that the calculations were rigorous and fair.

For Mary Lavigne, who has chronic pain, chiropractor appointments near Irvine, Calif., almost doubled in cost. Nadia Salim’s Boston-area therapy appointments also became almost twice as expensive. And Andrew Faehnle was on the hook for more than two-thirds of an ambulance bill after his 14-year-old was rushed to an emergency room in Anaheim, Calif. In each case, insurance statements cited Data iSight.

“I thought, ‘Who the heck are these people?’” Mr. Faehnle said. “I started Googling, ‘What’s Data iSight?’”

‘The Time Seemed Right’

MultiPlan’s business model is based on simple math: Take the amount a doctor charges, subtract MultiPlan’s recommended payout, and you have what the firm identifies as a savings or discount. Usually, MultiPlan and the insurer each collect a percentage of that declared savings as a processing fee.

This arrangement helps insurers profit from the most common way Americans get health coverage: through an employer that pays medical claims with its own money, using an insurer only as an administrator. Using MultiPlan, insurers cut medical bills, then charge employers for doing so.

For decades, MultiPlan determined payments primarily through negotiations. The discounts were modest but came with an agreement not to collect more from patients.

After MultiPlan’s founder, Donald Rubin, sold it in 2006, the company’s new private equity owners began a move toward automated pricing that executives would later call “MultiPlan 2.0.”

In 2010, it bought Viant, an Illinois-based firm that used algorithms to recommend reimbursements . But for some types of care, Viant’s calculations used a database of billed amounts. So if medical providers charged more over time, the recommended payments were also likely to rise.

A small firm in Grapevine, Texas, had developed an alternative strategy. Rather than start with a bill and negotiate it down, Tom Galas, a former insurance executive, wanted to calculate the cost of care and negotiate it up.

Mr. Galas bought an analytics firm called Data Advantage in 2005 and assigned a team at his company, National Care Network, to execute his vision. The result was Data iSight.

It drew on data that medical facilities submitted to the federal government and techniques developed by Medicare to estimate treatment costs. It then threw in some extra money, meant to allow a fair profit. The goal was to save insurers and employers money without paying so little that providers would sue them or go after patients for the balance.

In 2011, Mr. Galas sold to MultiPlan.

“The industry was condensing,” he said. “The time seemed right.”

Though he considered Data iSight revolutionary, he said, even he didn’t anticipate what it would become.

‘MultiPlan Is Magic’

Executives from the country’s major insurers gathered in Laguna Beach, Calif., in 2019 and heard from Dale White, a MultiPlan executive vice president.

He presented a slide showing the cover of a self-help book, “Life Is Magic,” that had been digitally altered to show Mr. White’s face and to read “MultiPlan Is Magic.” The slide added: “We have a few things up our sleeve, too.”

The firm’s annual revenues had reached about $1 billion, and three sets of private equity investors had cashed in. After buying MultiPlan for just over $3 billion in 2010 from the Carlyle Group, the firms BC Partners and Silver Lake sold it for a reported $4.4 billion in 2014 to Starr Investment Holdings and Partners Group, which sold it two years later to Hellman & Friedman for a reported $7.5 billion.

Hellman & Friedman, which owned the company when it went public in 2020, declined to comment.

Fueling the growth was Data iSight. The annual revenue it brought MultiPlan grew from $23 million in 2012 to more than $323 million in 2019, according to an investor presentation in 2020. The next year, the chief executive, Mark Tabak, told investors that Data iSight was MultiPlan’s top moneymaker among its biggest insurance customers.

While the company continued to offer other tools, it pitched Data iSight as an “industry-leading” and “state-of-the-art” way to “maximize savings.”

For insurers, the tool came with trade-offs: lower payments but potentially more patient complaints. They rolled it out gradually. The nation’s largest insurer by revenue, UnitedHealthcare, began using it in 2016 for certain plans and treatments, documents show.

As Data iSight spread, patients, doctors and medical facilities began receiving unwelcome surprises. Some practices that had negotiated contracts with MultiPlan found that they no longer received their agreed-upon rate, and patients were no longer protected from big bills.

Brett Lockhart had spine surgery at a facility near Cocoa, Fla., that had a negotiated rate with MultiPlan. When his insurer used Data iSight, he found himself on the hook for nearly $300,000. The bill is the subject of litigation and remains unpaid.

‘Crazy Low’ Payments

There was more to MultiPlan’s rising fortunes than just an increase in the number of claims. The average fee from each claim also grew, executives told investors.

In a presentation shortly before it became a publicly traded company in 2020, MultiPlan stressed that its tools were “scalable”: Reducing payments by just half a percent could yield an additional $10 million in profits, the company said.

After MultiPlan fell short of a revenue target in 2022, Mr. White, who had become chief executive, assured investors that the company had an “action plan” that included “aggressively implementing new initiatives with our customers to help them cope with accelerating health care costs.”

A change to Data iSight’s methodology, he said, should produce an additional $6 million in revenue.

MultiPlan has told investors it plans further “enhancements” to the tools, including use of artificial intelligence.

As patients and providers have demanded an explanation for declining payments, MultiPlan has fought to keep details about Data iSight confidential, contending in lawsuits that the information is proprietary.

Interviews and documents, some obtained after The Times petitioned federal courts, offer some insights .

Data iSight starts by using Medicare’s methods for setting rates. But subsequent calculations are less transparent. MultiPlan says it applies multipliers that allow for a fair profit for hospitals and something approximating a fair market rate for physicians. The documents show that MultiPlan allows insurers to cap prices and set what they consider fair profit margins for medical facilities.

MultiPlan has pitched Data iSight as an alternative to simply paying marked-up Medicare rates, an option some insurers offer. Paying around 120 percent of the government-set rate “sounds fair, maybe even generous,” one MultiPlan document said, but this is “inherently misleading” because “the average consumer does not understand just how low Medicare rates are.”

Interviews and documents, however, indicate that Data iSight’s recommended prices are sometimes about 160 to 260 percent of Medicare rates — amounts former MultiPlan employees described as “ridiculously low” and “crazy low.”

Even rates that may sound reasonable can strain medical practices. For example, UnitedHealthcare, citing Data iSight, offered Dr. Darius Kohan roughly 350 percent of the Medicare rate for a surgery to repair a patient’s eardrum. It amounted to $3,855.36.

Dr. Kohan, who has a small practice in Manhattan, said skimpy payments were forcing him to consider joining a large hospital system or private-equity-backed group.

“I am a dinosaur, but my patients like that,” he said. “I may not be able to sustain it.”

Chris Hamby is an investigative reporter for The Times, based in Washington. More about Chris Hamby

Hallucinations are the bane of AI-driven insights. Here’s what search can teach us about trustworthy responses, according to Snowflake’s CEO

Sridhar Ramaswamy is the CEO of Snowflake.

Businesses are eager to capitalize on the power of generative AI, but they are wrestling with the question of trust: How do you build a generative AI application that provides accurate responses and doesn’t hallucinate? This issue has vexed the industry for the past year, but it turns out that we can learn a lot from an existing technology: search.

By looking at what search engines do well (and what they don’t), we can learn to build more trustworthy generative AI applications. This is important because generative AI can bring immense improvements in efficiency, productivity, and customer service–but only when enterprises can be sure their generative AI apps provide reliable and accurate information.

In some contexts, the level of accuracy required from AI is lower. If you’re building a program that decides which ad to display next on a web page, an AI program that’s mostly accurate is still valuable. But if a customer asks your AI chatbot how much their invoice is this month or an employee asks how many PTO days they have left, there is no margin for error.

Search engines have long sought to provide accurate answers from vast troves of data, and they are successful in some areas and weaker in others. By taking the best aspects of search and combining them with new approaches that are better suited for generative AI in business, we can solve the trust problem and unlock the power of generative AI for the workplace

Sorting the wheat from the chaff

One area where search engines perform well is sifting through large volumes of information and identifying the highest-quality sources. For example, by looking at the number and quality of links to a web page, search engines return the web pages that are most likely to be trustworthy. Search engines also favor domains that are known to be trustworthy, such as federal government websites, or established news sources such as the BBC.

In business, generative AI apps can emulate these ranking techniques to return reliable results. They should favor the sources of company data that have been most frequently accessed, searched, or shared. And they should strongly favor sources that are known to be trustworthy, such as corporate training manuals or a human resources database, while disfavoring less reliable sources.

LLMs are an interlocutor, not an oracle

Many foundational large language models (LLMs) have been trained on the wider Internet, which as we all know contains both reliable and unreliable information. This means that they’re able to address questions on a wide variety of topics, but they have yet to develop the more mature, sophisticated ranking methods that search engines use to refine their results. That’s one reason why many reputable LLMs can hallucinate and provide incorrect answers.

One of the learnings here is that developers should think of LLMs as a language interlocutor, rather than a source of truth. In other words, LLMs are strong at understanding language and formulating responses, but they should not be used as a canonical source of knowledge. To address this problem, many businesses train their LLMs on their own corporate data and on vetted third-party data sets, minimizing the presence of bad data. By adopting the ranking techniques of search engines and favoring high-quality data sources, AI-powered applications for businesses become far more reliable.

The humility to say ‘I don’t know’

Search has also gotten quite good at understanding context to resolve ambiguous queries. For example, a search term like “swift” can have multiple meanings–the author, the programming language, the banking system, the pop sensation, and so on. Search engines look at factors like geographic location and other terms in the search query to determine the user’s intent and provide the most relevant answer.

However, when a search engine can’t provide the right answer, because it lacks sufficient context or a page with the answer doesn’t exist–it will try to do so anyway. For example, if you ask a search engine, “What will the economy be like 100 years from now,” or “How will the Kansas City Chiefs perform next season,” there may be no reliable answer available. But search engines are based on a philosophy that they should provide an answer in almost all cases, even if they lack a high degree of confidence.

This is unacceptable for many business use cases, and so generative AI applications need a layer between the search (or prompt) interface and the LLM that studies the possible contexts and determines if it can provide an accurate answer or not. If this layer finds that it cannot provide the answer with a high degree of confidence, it needs to disclose this to the user. This greatly reduces the likelihood of a wrong answer, helps to build trust with the user, and can provide them with an option to provide additional context so that the gen AI app can produce a confident result.

This layer between the user interface and the LLM can also employ a technique called Retrieval Augmented Generation, or RAG, to consult an external source of trusted data that exists outside of the LLM.

Show your work

Explainability is another weak area for search engines, but one that generative AI apps must employ to build greater trust. Just as high school teachers tell their students to show their work and cite sources, generative AI applications must do the same. By disclosing the sources of information, users can see where information came from and why they should trust it. Some of the public LLMs have started to provide this transparency and it should be a foundational element of generative AI-powered tools used in business.

Going in with our eyes open

Despite every effort, it will be challenging to build AI applications that make very few mistakes. And yet the benefits are too significant to sit on the sidelines and hope that competitors don’t surge ahead. That puts an onus on business users to approach AI tools with their eyes open. Just as the internet has changed how people relate to news and news sources, business users must develop an educated skepticism and learn to look for signs of trustworthy AI. That means demanding transparency from the AI applications we use, seeking explainability, and being conscious of potential biases.

We’re on an exciting journey to a new class of applications that will transform our work and careers in ways that we can’t yet anticipate. But to be valuable in business, these applications must be reliable and trustworthy. Search engines laid some of the groundwork for surfacing accurate responses from large volumes of data, but they are designed with different use cases in mind. By taking the best of search and adding new techniques to ensure greater accuracy, we can unlock the full potential of generative AI in business.

Sridhar Ramaswamy is the CEO of Snowflake.

More must-read commentary published by  Fortune :

  • Glassdoor CEO : ‘Anonymous posts will always stay anonymous’
  • We analyzed 46 years of consumer sentiment data–and found that  today’s ‘vibecession’ is just men  starting to feel as bad about the economy as women historically have
  • 90% of homebuyers have historically opted to work with a real estate agent or broker. Here’s why that’s unlikely to change, according to the  National Association of Realtors
  • Intel CEO : ‘Our goal is to have at least 50% of the world’s advanced semiconductors produced in the U.S. and Europe by the end of the decade’

The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of  Fortune .

Latest in Commentary

  • 0 minutes ago

Doctors are examining individual human hearts in ways never before possible—and it's changing health care.

My 2-year-old daughter needed a pacemaker. That spurred me to engineer a virtual heart, and now customized 3D simulations are saving lives

Andy Dunn, American entrepreneur and the co-founder of Bonobos Inc

Ask Andy: Can a startup build a strong culture around remote work?

Border patrol agents close a section of border fencing in El Paso, Texas on Mar. 25.

El Paso is at the epicenter of America’s immigration crisis–but the border holdups are causing a much bigger supply chain nightmare

The world is going through a demographic transition–and it could mean a tight labor market for the foreseeable future.

The number of job vacancies around the world is still unusually high–and there is no end in sight to the global labor shortage

Coworkers in discussion at conference table in office

America is debating whether to raise the retirement age—but boomers are already working well into their sixties and seventies

Most popular.

data driven dissertation

$2.3 billion hedge fund manager on his move from New York to Florida: ‘I know of no business that has generated long term success by driving away its highest paying customers’

data driven dissertation

In-N-Out’s billionaire heiress says she stood in line for 2 hours to land a job at her own store when she was just a teenager to shake the ‘stigma of being the owner’s kid’ and ‘earn respect’

data driven dissertation

Billionaire and Virgin Group founder, Richard Branson’s wealth has tumbled by more than half since 2021 to $3bn as SPAC problems gave him ‘a big jolt from the side through COVID’

data driven dissertation

The Trump donor whom Biden can’t fire is running the U.S. Postal Service directly into the ground—just what everyone warned about when he was confirmed during the pandemic

data driven dissertation

Donald Trump booted from prestigious list of billionaires after Truth Social parent’s swan dive

data driven dissertation

Air Canada pilots land a Boeing 737 in Idaho after another in-flight emergency

AEP Ohio predicts power demands to double around Columbus, driven by data center needs

data driven dissertation

Data centers in central Ohio are gobbling up vast amounts of electricity so fast that American Electric Power expects demand for power to double between 2018 and 2028.

"We are seeing unprecedented growth in the demand for power from data centers in central Ohio," the power company said in response to questions from The Dispatch. "We expect the total demand for power in central Ohio to double between 2018 and 2028, mostly because of new data centers."

Data centers are basically warehouses that hold rows and rows of computer equipment storing photos, videos, emails and other information used by consumers and businesses.

More: Data center growth expands across Columbus region, state as electricity concerns rise

But the surge in demand for electricity is being driven by the rise of artificial intelligence that businesses and governments depend on to operate, said Kenny McDonald, president and CEO of the Columbus Partnership.

Running ChatGPT, for example, requires for more power than a Google search, he said.

"The energy needed to get an answer back on that query is a lot more," he said.

The most recent forecast from the Public Utilities Commission of Ohio in 2021 predicted future demand for electricity to be mostly flat in the state. The next forecast that is due out in a few months likely will tell a different story.

PJM Interconnect , the operator of the grid that oversees the flow of electricity in all or parts of 13 states, including Ohio, and the District of Columbia, is warning of potential shortfalls in electricity in the future caused by the retirement of big power plants and growing demand for electricity from data centers, new manufacturing and the growth of electric vehicles.

The growth in data centers in the region is quickly taking up capacity in the region, according to AEP.

"We are working to build new infrastructure to add capacity, but this process will take time. For now, we are carefully managing electric demand on our lines and equipment, and power usage from data centers does not pose a threat to the electric grid in central Ohio," the company said.

"We are committed to working with our customers, stakeholders, and regulators to find ways to serve new data centers while maintaining safe, reliable, and affordable electric service to all our other central Ohio customers."

AEP said it is seeing a small amount of data center development in other parts of its Ohio service territory, but most of its data center growth is here.

"Utilities across the country are experiencing similar circumstances," the company said. "The industry is focused on providing reliable power for all customers while supporting economic growth."

[email protected]

@BizMarkWilliams

IMAGES

  1. Data Analysis

    data driven dissertation

  2. Stages Of Dissertation Research Process With Data Collection

    data driven dissertation

  3. Writing the Best Dissertation Data Analysis Possible

    data driven dissertation

  4. Dissertation Research Methodology Secondary Data Archives

    data driven dissertation

  5. Data analysis section of dissertation. How to Use Quantitative Data

    data driven dissertation

  6. 5 Pro Hacks To Write Data Analysis Dissertation Example

    data driven dissertation

VIDEO

  1. Data-Driven Design Overview

  2. Data driven planning

  3. Qualitative Data Analysis Workshop

  4. What our Alumni say about work & life at HITS #1

  5. book your #dissertation #assignments today to score distinction #assignmenthelp #ukuniversities #uk

  6. Qualitative Data Analysis Workshop

COMMENTS

  1. PDF The Rapid Adoption of Data-Driven Decision Making

    Better data creates opportunities to make better decisions. New digital technologies have vastly increased the scale and scope of data available to managers. We find that between 2005 and 2010, the share of manufacturing plants that adopted data-driven decision-making nearly tripled to 30 percent.

  2. PDF Data-driven Optimization Under Uncertainty in The Era of Big

    This dissertation deals with the development of fundamental data-driven optimization under uncertainty, including its modeling frameworks, solution algorithms, and a wide variety of applications. Specifically, three research aims are proposed, including data-driven distributionally robust optimization for hedging against distributional

  3. PDF Distributed and Data-Driven Decision-Making for Sustainable Power Systems

    Distributed and Data-Driven Decision-Making for Sustainable Power Systems Citation Chen, Xin. 2022. Distributed and Data-Driven Decision-Making for Sustainable Power Systems. Doctoral dissertation, Harvard University Graduate School of Arts and Sciences. Permanent link ... DistributedandData-Driven Decision-Making forSustainablePowerSystems

  4. CREATING A CULTURE OF DATA-DRIVEN DECISION-MAKING Kevin Rogers Doctoral

    are many times unable to become data-driven despite their technical abilities. The study aims to provide greater insight into these phenomena by analyzing an organization in the transportation industry in the United States. Trucking organizations provide a great backdrop for studying data-driven cultures (Alameen et al., 2016; Roth, 2016).

  5. Distributed and Data-Driven Decision-Making for Sustainable Power Systems

    Accordingly, this dissertation develops data-driven and distributed decision-making algorithms to tackle these two challenges. In particular, this dissertation studies four key problems in sustainable power systems, including frequency regulation, voltage control, power flexibility aggregation, and demand response in Chapter 2 - Chapter 5 ...

  6. Full article: DECAS: a modern data-driven decision theory for big data

    6. DECAS as a modern data-driven decision theory. The proposed theory was named DECAS, or the theory encompassing the Decision-making process, dEcision maker, deCision, dAta, and analyticS. DECAS is an incremental qualitative theory which aims to add to the previous concepts of classical decision making.

  7. PDF Journey From Data Into Instruction: How Teacher Teams Engage in Data

    2008). In this manner, policy makers and school leaders argue that data-driven instruction supports both student and teacher learning and thus can drive improvement efforts and organizational learning in schools (Goldring & Berends, 2008). The use of interim assessments is one of the most common forms of data-driven

  8. Data-driven sequential decision making by understanding and adopting

    In this dissertation, we are interested in, perhaps, one of the most natural forms of learning that humans engage in: learning from observations. We would like to focus on algorithms that enable data-driven learning of sequential decision making policies by observing optimal behavior demonstrated by other rational agents.

  9. Early Childhood Educators' use of Students' Assessments for Data-driven

    Data-driven decision making (DDDM) is a process of making decisions based on data rather than intuition or observation alone (Miller, 2019). DDDM is the systematic analysis of student data from internal and/or external sources of a school to drive teachers' educational planning and practices (Prenger & Schildkamp, 2018). In alignment

  10. Data-Driven Learning and Resource Allocation in Healthcare Operations

    This dissertation is broadly about sequential decision-making and statistical learning under limited resources. In this area, we treat sequentially arriving individuals, each of which should be assigned to the most appropriate resource. ... We provide data-driven and personalized methodologies for this class of problems. Our data-driven methods ...

  11. (PDF) Data-Driven Decisions in Smart Cities: A Digital ...

    Finding innovative ways to use this data helps improve city management and urban development. A data-driven city utilizes datafication to optimize its operations, functions, services, strategies ...

  12. Marketing in a data-driven digital world: Implications for the role and

    How data-driven marketing helped expand the scope and role of marketing?2.1. Creativity. In 1879, one of the pioneer advertising agencies, N.W. Ayer & Son, ... Shah's research has been a finalist or winner of six best paper awards and three dissertation-based awards. He is a 2015 MSI Young Scholar, 2018 recipient of Vardarajan Early Career ...

  13. (PDF) Data-driven decision making

    The data driven decision making could be defined as "the. practice of basing decisions on the analysis of the data rather. than pur ely on intuition" [16. In this sense, we can quickly. note ...

  14. PDF Linguistic Knowledge in Data-Driven Natural Language Processing

    The central goal of this thesis is to bridge the divide between theoretical linguistics—the scien-tific inquiry of language—and applied data-driven statistical language processing, to provide deeper insight into data and to build more powerful, robust models. To corroborate the practi-

  15. Data-Driven Requirements Elicitation: A Systematic Literature Review

    Requirements engineering has traditionally been stakeholder-driven. In addition to domain knowledge, widespread digitalization has led to the generation of vast amounts of data (Big Data) from heterogeneous digital sources such as the Internet of Things (IoT), mobile devices, and social networks. The digital transformation has spawned new opportunities to consider such data as potentially ...

  16. The anatomy of the data-driven smart sustainable city: instrumentation

    Data-driven smart sustainable cities 'Data-driven smart sustainable cities' is a term that has recently gained traction in academia, government, and industry to describe cities that are increasingly composed and monitored by ICT of ubiquitous and pervasive computing and thus have the ability of using advanced technologies by city operations centers, planning and policy offices, research ...

  17. Research Topics & Ideas: Data Science

    If you're just starting out exploring data science-related topics for your dissertation, thesis or research project, you've come to the right place. In this post, we'll help kickstart your research by providing a hearty list of data science and analytics-related research ideas, including examples from recent studies.. PS - This is just the start…

  18. Graduate Thesis Or Dissertation

    In recent years, data-driven model discovery has become increasingly popular due to rapid advances in computational power, and data processing and storage procedures. ... This dissertation considers modern sparse regression techniques to robustly recover governing equations of nonlinear dynamical systems from noisy state measurements.

  19. PhD Dissertations

    PhD Dissertations [All are .pdf files] Probabilistic Reinforcement Learning: Using Data to Define Desired Outcomes, and Inferring How to Get There Benjamin Eysenbach, 2023. Data-driven Decisions - An Anomaly Detection Perspective Shubhranshu Shekhar, 2023. METHODS AND APPLICATIONS OF EXPLAINABLE MACHINE LEARNING Joon Sik Kim, 2023. Applied Mathematics of the Future Kin G. Olivares, 2023

  20. Dissertation

    TITLE: ADVANCING DATA-DRIVEN ENVIRONMENTAL DECISION-MAKING AND GOVERNANCE IN CHINA My dissertation is comprised of four articles that seek to understand the strengths and limitations of applying empirical approaches to environmental policy evaluation and decision-making, with a particular focus on China.

  21. Data Consolidation: The Key To Unlocking AI's Transformative ...

    Step 2: Standardize Data Formats. Standardizing data formats is crucial for unleashing AI's full potential. Data inconsistency can hinder AI-driven strategies, making it essential to ensure ...

  22. Q1 2024 PitchBook-NVCA Venture Monitor

    The PitchBook-NVCA Venture Monitor, sponsored by J.P. Morgan, Dentons, and Deloitte, presents this data and more, diving into the themes and trends of the current US venture market. The Q1 2024 PitchBook-NVCA Venture Monitor presents a data-driven overview of the key trends defining the US venture capital landscape.

  23. US-Europe Gripes on China Overcapacity Aren't All Backed by Data

    Connecting decision makers to a dynamic network of information, people and ideas, Bloomberg quickly and accurately delivers business and financial information, news and insight around the world

  24. Inflation runs hot for third straight month, driven by gas prices and

    April 10, 2024 / 9:06 AM EDT / CBS News. Inflation remains the stickiest of problems for the U.S. economy, with the March consumer price index coming in hotter than expected — the third straight ...

  25. In Battle Over Health Care Costs, Private Equity Plays Both Sides

    By Chris Hamby. April 7, 2024. Insurance companies have long blamed private-equity-owned hospitals and physician groups for exorbitant billing that drives up health care costs. But a tool backed ...

  26. Hallucinations are the bane of AI-driven insights. Here's what search

    By adopting the ranking techniques of search engines and favoring high-quality data sources, AI-powered applications for businesses become far more reliable. The humility to say 'I don't know'

  27. AEP Ohio sees surging demand for electricity driven by data centers

    Columbus Dispatch. Data centers in central Ohio are gobbling up vast amounts of electricity so fast that American Electric Power expects demand for power to double between 2018 and 2028. "We are ...