• Open access
  • Published: 29 May 2021

Big data quality framework: a holistic approach to continuous quality management

  • Ikbal Taleb 1 ,
  • Mohamed Adel Serhani   ORCID: orcid.org/0000-0001-7001-3710 2 ,
  • Chafik Bouhaddioui 3 &
  • Rachida Dssouli 4  

Journal of Big Data volume  8 , Article number:  76 ( 2021 ) Cite this article

30k Accesses

33 Citations

4 Altmetric

Metrics details

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper.

Introduction

Big Data is universal [ 1 ], it consists of large volumes of data, with unconventional types. These types may be structured, unstructured, or in a continuous motion. Either it is used by the industry and governments or by research institutions, a new way to handle Big Data from a technology perspective to research approaches in its management is highly required to support data-driven decisions. The expectation from Big Data analytics varies from trends finding to pattern discovery in different application domains such as healthcare, businesses, and scientific exploration. The aim is to extract significant insights and decisions. Extracting this precious information from large datasets is not an easy task. A devoted planning and appropriate selection of tools and techniques are available to optimize the exploration of Big Data.

Owning a huge amount of data does not often lead to valuable insights and decisions since Big Data does not necessarily mean Big insights. In fact, it can complicate the processes involved in fulfilling such expectations. Also, a lot of resources may be required, in addition to adapting the existing analytics algorithms to cope with Big Data requirements. Generally, data is not ready to be processed as it is. It should go through many stages, including cleansing and pre-processing, before undergoing any refining, evaluation, and preparation treatment for the next stages along its lifecycle.

Data Quality (DQ) is a very important aspect of Big Data for assessing the aforementioned pre-processing data transformations. This is because Big Data is mostly obtained from the web, social networks, and the IoT, where they may be found in a structured or unstructured form with no schema and eventually with no quality properties. Exploring data profiling, and more specifically, DQ profiling is essential before data preparation and pre-processing for both structured and unstructured data. Also, a DQ assessment should be conducted for all data-related content, including attributes and features. Then, an analysis of the assessment results can provide the necessary elements to enhance, control, monitor, and enforce the DQ along the Big Data lifecycle; for example, maintaining high Data Quality (conforming to its requirements) in the processing phase.

Data Quality has been an active and attractive research area for several years [ 2 , 3 ]. In the context of Big Data, quality assessment processes are hard to implement, since they are time- and cost-consuming, especially for the pre-processing activities. These issues have got intensified since the available quality assessment techniques were developed initially for well-structured data and are not fully appropriate for Big Data. Consequently, new Data Quality processes must be carefully developed to assess the data origin, domain, format, and type. An appropriate DQ management scheme is critical when dealing with Big Data. Furthermore, Big Data architectures do not incorporate quality assessment practices throughout the Big Data lifecycle apart from pre-processing. Some new initiatives are still limited to specific applications [ 4 , 5 , 6 ]. However, the evaluation and estimation of Big Data Quality should be handled in all phases of the Big Data lifecycle from data inception to its analytics, thus support data-driven decisions.

The work presented in this paper is related to Big Data Quality management through the Big Data lifecycle. The objective of such a management perspective is to provide users or data scientists with a framework capable of managing DQ from its inception to its analytics and visualization, therefore support decisions. The definition of acceptable Big Data quality depends largely on the type of applications and Big Data requirements. The need for a quality Big Data evaluation before engaging in any Big Data related project is imminent. This is because the high costs involved in processing useless data at an early stage of its lifecycle can be prevented. More challenges to the data quality evaluation process may occur when dealing with unstructured, schema-less data collected from multiples sources. Moreover, a Big Data Quality Management Framework can provide quality management mechanisms to handle and ensure data quality throughout the Big Data lifecycle by:

Improving the processes of the Big Data lifecycle to be quality-driven, in a way that it integrates quality assessment (built-in) at every stage of the Big Data architecture.

Providing quality assessment and enhancement mechanisms to support cross-process data quality enforcement.

Introducing the concept of Big Data Quality Profile (DQP) to manage and trace the whole data pre-processing procedures from data source selection to final pre-processed data and beyond (processing and analytics).

Supporting profiling of data quality and quality rules discovery based on quantitative quality assessments.

Supporting deep quality assessment using qualitative quality evaluations on data samples obtained using data reduction techniques.

Supporting data-driven decision making based on the latest data assessments and analytics results.

The remainder of this paper is organized as follows. In Sect. " Overview and background ", we provide ample detail and background on Big Data and data quality, besides, the introduction of the problem statement, and the research objectives. The research literature related to Big Data quality assessment approaches is presented in Sect. " Related research studies ". The components of the proposed framework and an explanation of their main functionalities are described in Sect. " Big data quality management framework ". Finally, implementation discussion and dataflow management are detailed in Sect. " Implementations: Dataflow and quality processes development ", whereas Sect. " Conclusion " concludes the paper and points to our ongoing research developments.

Overview and background

An exponential increase in global inter-network activities and data storage has triggered the Big Data Era. Moreover, application domains, including Facebook, Amazon, Twitter, YouTube, Internet of Things Sensors, and mobile smartphones, are the main players and data generators. The amount of data generated daily is around 2.5 quintillion bytes (2.5 Exabyte, 1 EB = 1018 Bytes).

According to IBM, Big Data is a high-volume, high-velocity, and high-variety information asset that demands cost-effective, innovative forms of information processing for enhanced insights and decision-making. It is used to describe a massive volume of both structured and unstructured data; therefore, Big Data processing using traditional database and software tools is a difficult task. Big Data also refers to the technologies and storage facilities required by an organization to handle and manage large amounts of data.

Originally, in [ 7 ], the McKinsey Global Institute identifies three Big Data characteristics commonly known as ''3Vs'' for Volume, Variety, and Velocity [ 1 , 7 , 8 , 9 , 10 , 11 ]. These characteristics have been extended to more dimensions, moving to 10 Vs (Volume, Velocity, Variety, Veracity, Value, Vitality, Viscosity, Visualization, Vulnerability) [ 12 , 13 , 14 ].

In [ 10 , 15 , 16 ], the authors define important Big Data systems architectures. The data in Big Data comes from (1) heterogeneous data sources (e-Gov: Census data, Social networking: Facebook, and Web: Google page rank data), (2) data in different formats (video, text), and (3) data of various forms (unstructured: raw text data with no schema, and semi-structured: metadata, graph structure as text). Moreover, data travels through different stages, composing the Big Data lifecycle. Many aspects of Big Data architectures were compiled from the literature. Our enhanced design contributions are illustrated in Fig.  1 and described as follows:

Data generation: this is the phase of data creation. Many data sources can generate this data such as electrophysiology signals, sensors used to gather climate information, surveillance devices, posts to social media sites, videos and still images, transaction records, stock market indices, GPS location, etc.

Data acquisition: it consists of data collection, data transmission, and data pre-processing [ 1 , 10 ]. Due to the exponential growth and availability of heterogeneous data production sources, an unprecedented amount of structured, semi-structured, and unstructured data is available. Therefore, the Big Data Pre-Processing consists of typical data pre-processing activities: integration, enhancements and enrichment, transformation, reduction, discretization, and cleansing .

Data storage: it consists of the data center infrastructure, where the data is stored and distributed among several clusters and data centers, spread geographically around the world. The software storage is supported by the Hadoop ecosystem to ensure a certain degree of fault tolerance storage reliability and efficiency through replication. The data storage stage is responsible for all input and output data that circulates within the lifecycle.

Data analysis: (Processing, Analytics, and Visualization); it involves the application of data mining and machine learning algorithms to process the data and extract useful insights for better decision making. Data scientists are the most valuable users of this phase since they have the expertise to apply what is needed, on what must be analyzed.

figure 1

Big data lifecycle value chain

Data quality, quality dimensions, and metrics

The majority of studies in the area of DQ originate from the database context [ 2 , 3 ] and management research communities. According to [ 17 ], DQ is not an easy concept to define. Its definition is data domain awareness. There is a consensus that data quality always depends on the quality of the data source [ 18 ]. However, it highlights that enormous quality issues are hidden inside data and their values.

In the following, the definitions of data quality, data quality dimensions, and quality metrics and their measurements are given:

Data quality: It has many meanings that are related to data context, domain, area, and the fields from which it is used [ 19 , 20 ]. Academia interprets DQ differently than industry. In [ 21 ], data quality is reduced to “The capability of data to satisfy stated and implied needs when used under specified conditions”. Also, DQ is defined as “fitness for use”. Yet, [ 20 ] define data quality as the property corresponding to quality management, which is appropriate for use or meeting user needs.

Data quality dimensions: DQD’s are used to measure, quantify, and manage DQ [ 20 , 22 , 23 ]. Each quality dimension has a specific metric, which measures its performance. There are several DQDs, which can be organized into 4 categories according to [ 24 , 25 ], intrinsic, contextual, accessibility, and representational [ 14 , 15 , 22 , 24 , 26 , 27 ]. Two important categories (intrinsic and contextual) are illustrated in Fig.  2 . Examples of intrinsic quality dimensions are illustrated in Table 1 .

Metrics and measurements: Once the data is generated, its quality should be measured. This means that a data-driven strategy is considered to act on the data. Hence, it is mandatory to measure and quantify the DQD. Structured or semi-structured data is available as a set of attributes represented in columns or rows, and their values are respectively recorded. In [ 28 ], a quality metric, as a quantitative or categorical representation of one or more attributes, is defined. Any data quality metric should define whether the values of an attribute respect a targeted quality dimension. The author [ 29 ], quoted that data quality measurement metrics tend to evaluate binary results: correct or incorrect, or a value between 0 and 100 (with 100% representing the highest). This applies to some quality dimensions such as accuracy, completeness, consistency, and currency. Examples of DQD metrics are illustrated in Table 2 .

figure 2

Data quality dimensions

DQD’s must be relevant to data quality problems that have been identified. Thus, a metric tends to measure if attributes comply with defined DQD’s. These measurements are performed for each attribute, given their type and data ranges of values collected from the data profiling process. The measurements produce DQD’s scores for the designed metrics of all attributes [ 30 ]. Specific metrics need to be defined, to estimate specific quality dimensions of other data types such as images, videos, and audio [ 5 ].

Big data characteristics and data quality

The main Big Data characteristics, commonly named as V’s, are initially, Volume, Velocity, Variety, and Veracity. Since the Big Data inception, 10 V’s have been defined, and probably new Vs will be adopted [ 12 ]. For example, veracity tends to express and describe the trustworthiness of data, mostly known as data quality. The accuracy is often related to precision, reliability, and veracity [ 31 ]. Our tentative mapping among these characteristics, data, and data quality, is shown in Table 3 . It is based on the intuitive studies accomplished by [ 5 , 32 , 33 ]. In these studies, the authors attempted to link the V’s to the data quality dimensions. In another study, the authors [ 34 ] addressed the mapping of DQD Accuracy with the Big Data characteristic Volume and showed that the data size has an impact on DQ.

Big data lifecycle: where quality matters?

According to [ 21 , 35 ], data quality issues may appear in each phase of the Big Data value chain. Addressing data quality may follow different strategies, as each phase has its features either improving the quality of existing data or/and refining, reassessing, redesigning the whole processes, which generate and collect data, aiming at improving their quality.

Big Data quality issues were addressed by many studies in the literature [ 36 , 37 , 38 ]. These studies generally elaborated on the issues and proposed generic frameworks with no comprehensive approaches and techniques to manage quality across the Big Data lifecycle. Among these, generic frameworks are presented in [ 5 , 39 , 40 ].

In Fig.  3 , it is illustrated where data quality can and must be addressed in the Big Data value chain phases/stages from (1) to (7).

In the data generation phase, there is a need to define how and what data is generated.

In the data transmission phase, the data distribution scheme relies on the underlying networks. Unreliable networks may affect data transfer. Its quality is expressed by data loss and transmission errors.

Data collection refers to where, when, and how the data is collected and handled. Well-defined structured constraint verification on data must be established.

The pre-processing phase is one of the main focus points of the proposed work. It follows a data-driven strategy, which is largely focused on data. An evaluation process provides the necessary means to ensure the quality of data for the next phases. An evaluation of the DQ before (Pre) and after (Post) pre-processing on data samples is necessary to strengthen the DQP.

In the Big Data storage phase, some aspects of data quality, such as storage failure, are handled by replicating data on multiple storages. The latter is also valid for data transmission when a network fails to transmit data.

In the Data Processing and Analytics phases, the quality is influenced by both the applied process and data quality itself. Among the various data mining and machine learning algorithms and techniques suitable for Big Data, those that converge rapidly and consume fewer cloud resources will be highly adopted. The relation between DQ and the processing methods is substantial. A certain DQ requirement on these methods or algorithms might be imposed to ensure efficient performance.

Finally, for an ongoing iterative value chain, the visualization phase seems to be only a representation of the data in a fashionable way such as a dashboard. This helps the decision-makers to have a clear picture of the data and its valuable insights. Finally, in this work, Big Data is transformed into useful Small Data, which is easy to visualize and interpret.

figure 3

Where quality matters in big data lifecycle?

Data quality issues

Data quality issues generally appear when the quality requirements are not met on the data values [ 41 ]. These issues are due to several factors or processes having occurred at different levels:

Data source level: unreliability, trust, data copying, inconsistency, multi-sources, and data domain.

Generation level: human data entry, sensors’ readings, social media, unstructured data, and missing values.

Process level (acquisition: collection, transmission).

In [ 21 , 35 , 42 ], many causes of poor data quality were enumerated, and a list of elements, which affect the quality and DQD’s was produced. This list is illustrated in Table 4 .

Related research studies

Research directions on Big Data differ between industry and academia. Industry scientists mainly focus on the technical implementations, infrastructures, and solutions for Big Data management, whereas researchers from academia tackle theoretical issues of Big Data. Academia’s efforts mainly include the development of new algorithms for data analytics, data replication, data distribution, and optimization of data handling. In this section, the literature review is classified into 3 categories, which are described in the following sub-sections.

Data quality assessment approaches

Existing studies on data quality have been approached from different perspectives. In the majority of the papers, the authors agree that data quality is related to the phases or processes of its lifecycle [ 8 ]. Specifically, data quality is highly related to the data generation phases and/or with its origin. The methodologies adopted to assess data quality are based on traditional data strategies and should be adapted to Big Data. Moreover, the application domain and type of information (Content-based, Context-based, or Rating-based) affects the way the quality evaluation metrics are designed and applied. In content-based quality metrics, the information itself is used as a quality indicator, whereas in context-based metrics meta-data is used as quality indicators.

There are two main strategies to improve data quality according to [ 20 , 23 ]: data-driven and process-driven. The first strategy handles the data quality in the pre-processing phase by applying some pre-processing activities (PPA) such as cleansing, filtering, and normalization. These PPAs are important and occur before the data processing stage, preferably as early as possible. However, the process-driven quality strategy is applied to each stage of the Big Data value chain.

Data quality assessment was discussed early in the literature [ 10 ]. It is divided into two main categories: subjective and objective. Moreover, an approach that combines these two categories to provide organizations with usable data quality metrics to evaluate their data was proposed. However, the proposed approach was not developed to deal with Big Data.

In summary, Big Data quality should be addressed early in the pre-processing stage during the data lifecycle. The aforementioned Big Data quality challenges have not been investigated in the literature from all perspectives. There are still many open issues, which must be addressed especially at the pre-processing stage.

Rule-based quality methodologies

Since the data quality concept is context-driven, it may differ from an application domain to another. The definition of quality rules involves establishing a set of constraints on data generation, entry, and creation. Poor data can always exist, and rules are created or discovered to correct or eliminate this data. Rules themselves are only one part of the data quality assessment approach. The necessity to establish a consistent process for creating, discovering, and applying the quality rules should consider the following:

Characterize the quality of data being good or bad from its profile and quality requirements.

Select the data quality dimensions that apply to the data quality assessment context.

Generate quality rules based on data quality requirements, quantitative, and qualitative assessments.

Check, filter, optimize, validate, run, and test rules on data samples for efficient rules’ management.

Generate a statistical quality profile with quality rules. These rules represent an overview of successful valid rules with the expected quality levels.

Hereafter, the data quality rules are discovered from data quality evaluation. These rules will be used in Big Data pre-processing activities to improve the quality of data. The discovery process reveals many challenges, which should consider different factors, including data attributes, data quality dimensions, data quality rules discovery, and their relationship with pre-processing activities.

In (Lee et al., 2003), the authors concluded that the data quality problems depend on data, time, and context. Quality rules are applied to the data to solve and/or avoid quality problems. Accordingly, quality rules must be continuously assessed, updated, and optimized.

Most studies on the discovery of data quality rules come from the database community. These studies are often based on conditional functional dependencies (CFDs) to detect inconsistencies in data. CFDs are used to formulate data quality rules, which are generally expressed manually and discovered automatically using several CFD approaches [ 3 , 43 ].

Data quality assessment in Big Data has been addressed in several studies. In [ 32 ], a Data Quality-in-Use model was proposed to assess the quality of Big Data. Business rules for data quality are used to decide on which data these rules must meet the pre-defined constraints or requirements. In [ 44 ], a new quality assessment approach was introduced and involved both the data provider and the data consumer. The assessment was mainly based on data consistency rules provided as metadata.

The majority of research studies on data quality and discovery of data quality rules are based on CFD’s and database. In Big Data quality, the size, variety, and veracity of data are key characteristics that must be considered. These characteristics should be processed to reduce the quality assessment time and resources since they are handled before the pre-processing phase. Regarding quality rules, it is fundamental to consider these rules to eliminate poor data and enforce quality on existing data, while following a data-driven quality context.

Big data pre-processing frameworks

The pre-processing of data before performing any analytics is primeval. However, several challenges have emerged at this crucial phase of the Big Data value chain [ 10 ]. Data quality is one of these challenges, which must be highly considered in the Big Data context.

As pointed out in [ 45 ], data quality problems arise when dealing with multiple data sources. This increases the requirements for data cleansing significantly. Additionally, the large size of datasets, which arrive at an uncontrolled speed, generates an overhead on the cleansing processes. In [ 46 , 47 , 48 ], NADEEF, an extensible data cleaning system, was proposed. The extension for Big Data cleaning based on NADEEF was presented in [ 49 ] for streaming data. The system deals with data quality from the data cleaning activity using data quality rules and functional dependencies rules [ 14 ].

Numerous other studies on Big Data management frameworks exist. In these studies, the authors surveyed and proposed Big Data management models dealing with storage, pre-processing, and processing [ 50 , 51 , 52 ]. An up-to-date review of the techniques and methods for each process involved in the management processes is also included.

The importance of quality evaluation in Big Data Management has not been, generally, addressed. In some studies, Big Data characteristics are the only recommendations for quality. However, no mechanisms have been proposed to map or handle quality issues that might be a consequence of these Big Data Vs. A Big Data Management Framework, which includes data quality management, must be developed to cope with end-to-end quality management across the Big Data lifecycle.

Finally, it is worth mentioning that research initiatives and solutions on Big Data quality are still in their preliminary phase; there is much to do on the development and standardization of Big Data quality. Big Data quality is a multidisciplinary, complex, and multi-variant domain, where new evaluation techniques, processing and analytics algorithms, storage and processing technologies, and platforms will play a key role in the development and maturity of this active research area. We anticipate that researchers from academia will contribute to the development of new Big Data quality approaches, algorithms, and optimization techniques, which will advance beyond the traditional approaches used in databases and data warehouses. Additionally, industries will lead development initiatives of new platforms, solutions, and technologies optimized to support end-to-end quality management within the Big Data lifecycle.

Big data quality management framework

The purpose of proposing a Big Data Quality Management Framework (BDQMF) is to address the quality at all stages of the Big Data lifecycle. This can be achieved by managing data quality before and after the pre-processing stage while providing feedback at each stage and loop back to the previous phase, whenever possible. We also believe that data quality must be handled at data inception. However, this is not considered in this work.

To overcome the limitations of the existing Big Data architectures for managing data quality, a Big Data Quality pre-processing approach is proposed: a Quality Framework [ 53 ]. In our framework, the quality evaluation process tends to extract the actual quality status of Big Data and proposes efficient actions to avoid, eliminate, or enhance poor data, thus improving its quality. The framework features the creation and management of a DQP and its repository. The proposed scheme deals with data quality evaluation before and after the pre-processing phase. These practices are essential to ensure a certain quality level for the next phases while maintaining the optimal cost of the evaluation.

In this work, a quantitative approach is used. This approach consists of an end-to-end data quality management system that deals with DQ through the execution of pre-pre-processing tasks to evaluate BDQ on data. It starts with data sampling, data and DQ profiling, and gathering user DQ requirements. It then proceeds to DQD evaluation and discovery of Quality rules from quality scores and requirements. Each data quality rule is represented by one-to-many Pre-Processing Functions (PPF’s) under a specific Pre-Processing Activity (PPA). A PPA, such as cleansing, aims at increasing data quality. Pre-processing is applied to Big Data samples and re-evaluated once again to update and certify that the quality profile is complete. It is applied to the whole Big Dataset, not only to data samples. Before pre-processing, the DQP is tuned and revisited by quality experts for endorsement based on an equivalent data quality report. This report states the quality scores of the data, not the rules.

Framework description

The BDQM framework is illustrated in Fig.  4 , where all the components cooperate, relying on the Data Quality Profile. It is initially created as a Data Profile and is progressively extended from the data collection phase to the analytics phase to capture important quality-related information. For example, it contains quality requirements, targeted data quality dimensions, quality scores, and quality rules.

figure 4

Big data sources

Data lifecycle stages are part of the BDQMF. Generated feedbacks in all the stages are analyzed and used to correct, improve the data quality, and detect any DQ management related failures. The key components of the proposed BDQMF include:

Big Data Quality Project (Data Sources, Data Model, User/App Quality Requirements, Data domain),

Data Quality Profile and its Repository,

Data Preparation (Sampling and Profiling),

Exploratory Quality Profiling,

Quality Parameters and Mapping,

Quantitative Quality Evaluation,

Quality Control,

Quality Rules Discovery,

Quality Rules Validation,

Quality Rules Optimization,

Big Data Pre-Processing,

Data Processing,

Data Visualization, and

Quality Monitoring.

A detailed description of each of these components is provided hereafter.

Framework key components

In the following sub-sections, each component is described. Its input(s) and output(s), its main functions, and its roles and interactions with the other framework’s components, are also described. Consequently, at each Big Data stage, the Data Quality Profile is created, updated, and adapted until it achieves the quality requirements already set by the users or applications at the beginning of the Big Data Quality Project.

Big data quality project module

The Big Data Quality Project Module contains all the elements that define the data sources, and the quality requirements set by either the Big Data users or Big Data applications to represent the quality foundations of the Big Data project. As illustrated in Error! Reference source not found., any Big Data Quality Project should specify a set of quality requirements as targeted quality goals (Fig. 5 ).

It represents the first module of the framework. The Big Data quality project represents the starting point of the BDQMF, where specifications of the data model, data sources, and targeted quality goals for DQD and data attributes are defined. These requirements are represented as data quality scores/ratios, which express the acceptance level of the evaluated data quality dimensions. For example, 80% of data accuracy, 60% data completeness, and 85% data consistency are judged by quality experts as accepted levels (or tolerance ratios). These levels can be relaxed using a range of values, depending on the context, the application domain, and the targeted processing algorithm’s requirements.

Let us denote by BDQP(DS , DS’ , Req) a Big Data Quality Project Request that initiates many automatic processes:

A data sampling and profiling process.

An exploratory quality profiling process, which is included in many quality assessment procedures.

A pre-processing phase is eventually considered if the resulted quality scores are not met.

The BDQP contains the input dataset DS , output dataset DS’ , and Req . The Quality requirements are presented as a tuple of sets Req  = ( D , L , A ), where:

D represents a set of data quality dimensions DQD’s (e.g., accuracy, consistency): \({D}=\left\{{{\varvec{d}}}_{0},\dots ,{{\varvec{d}}}_{{\varvec{i}}},\dots ,{{\varvec{d}}}_{{\varvec{m}}}\right\},\)

L is a set of DQD acceptance (tolerance) level ratios (%) set by the user or the application related to the quality project and associated with each DQD, respectively: \({L}=\left\{{{\varvec{l}}}_{0},\dots ,{{\varvec{l}}}_{{\varvec{i}}},\dots ,{{\varvec{l}}}_{{\varvec{m}}}\right\},\)

A is the set of targeted data attributes. If it is not specified, the DQD’s are assessed for the dataset, which includes all possible attributes, since some dimensions need more detailed requirements to be assessed. Therefore, it depends on the DQD and the attribute type: \({A}=\left\{{{\varvec{a}}}_{0},\dots ,{{\varvec{a}}}_{{\varvec{i}}},\dots ,{{\varvec{a}}}_{{\varvec{m}}}\right\}\)

The Data quality requirements might be updated with some more aspects, whereas the profiling component provides well-detailed information about the data ( DQP Level 0 ). This update is performed within the quality mapping component and interfaces with user experts to refine, reconfirm, and restructure their data quality parameters over the data attributes.

Data sources: There are multiple Big Data sources. Most of them are generated from the new media (e.g., social media) based on the internet. Other data sources are based on the context of new technologies such as the cloud, sensors, and IoT. A list of Big Data sources is illustrated in Error! Reference source not found.

Data users, data applications, and quality requirements: This module identifies and specifies the input sources of the quality requirements parameters for the data sources. These sources include user’s quality requirements (e.g., Domain Experts, Researchers, Analysts, and Data scientists) or application quality requirements. (Applications may vary from simple data processing to machine learning applications or AI-based applications). For the users, a dashboard-like interface is used to capture user’s data requirements and other quality information. This interface can be enriched with information from the data sources as attributes and their types, if available. This can efficiently guide users to the inputs and ensure the right data is used. This phase can be initiated after sample profiling or exploratory quality profiling. Otherwise, a general quality request is entered in the form of targeted Data Quality dimensions and their expected quality scores after the pre-processing phase. All the quality requirements parameters and settings are recorded in the Data Quality Profile ( DQP 0 ). DQP Level 0 is created when the quality project is set.

The quality requirements are specifically set as quality score ratios, goals, or targets to be achieved by the BDQMF. They are expressed as targeted DQDs in the Big Data Quality Project.

Let us denote by Req , a set of quality requirements presented as Req = \(\left\{{{\varvec{r}}}_{0},\dots ,{{\varvec{r}}}_{{\varvec{i}}},\dots ,{{\varvec{r}}}_{{\varvec{m}}}\right\}\) and constructed with a tuple ( D , L, A ). The Req quality requirements list is identified by elements, where each of these elements is a quality requirement characterized by \({{\varvec{r}}}_{{\varvec{i}}}=\left({{\varvec{d}}}_{{\varvec{i}}},{{\varvec{l}}}_{{\varvec{i}}},{{\varvec{a}}}_{{\varvec{i}}}\right)\) ; \({{\varvec{r}}}_{{\varvec{i}}}\) represents a \({{\varvec{d}}}_{{\varvec{i}}}\) in the DQD with a minimum accepted ratio level \({{\varvec{l}}}_{{\varvec{i}}}\) for all or a sub-list of selected attributes \({{\varvec{a}}}_{{\varvec{i}}}.\)

The initial DQP originating from this module is a DQP Level 0, containing the following tuple, as illustrated in Fig.  6 : BDQP (DS, DS’, Req) with Req  =  ( D , L, A )

Data models and data domains

Data models: If the Data is structured, then a schema is provided to add more detailed quality settings for all attributes. In other cases, if there are no such attributes or types, the data is considered as unstructured data, and its quality evaluation will consist of a set of general Quality Indicators (QI). In our Framework, these QI are provided especially for the cases, where a direct identification of DQD’s is not available for an easy quality assessment.

Data domains: Each data domain has a unique set of default quality requirements. Some are very sensitive to accuracy and completeness; others, prioritize data currency and higher timeliness. This module adds value to users or applications when it comes to quality requirements elicitation.

figure 6

BDQP and quality requirements settings

figure 7

Exploratory quality profiling modules

Data quality profile creation: Once the Big Data Quality Project (BDQP) is initiated, the DQP level 0 (DQP0) is created and consists of the following elements, as illustrated in Fig. 7 :

Data sources information, which may include datasets, location, URL, origin, type, and size.

Information about data that can be created or extracted from metadata if available, such as database schema, data attributes names and types, data profile, or basic data profile.

Data domains such as business, health, commerce, or transportation.

Data users, which may include the names and positions of each member of the project, security credentials, and data access levels.

Data application platforms, software, programming languages, or applications that are used to process the data. These may include R, Python, Java, Julia, Orange, Rapid Miner, SPSS, Spark, and Hadoop.

Data quality requirements: for each dataset, its expected quality ratios, and tolerance levels are accepted; otherwise, the data is discarded or repaired. It can also be set as a range of quality tolerance levels. For example, the DQD completeness is defined as equal to or higher than 67%, which means the acceptance ratio of missing values, is equal to or less than 33% (100% –67%).

Data quality profile (DQP) and repository (DQPREPO)

We describe hereafter the content of DQP and the DQP repository and the DQP levels captured through the lifecycle of framework processes.

  • Data quality profile

The data quality profile is generated once a Big Data Quality Project is created. It contains, for example, information about the data sources, domain, attributes, or features. This information may be retrieved from metadata, data provenance, schema, or from the dataset itself. If not available, data preparation (sampling and profiling) is needed to collect and extract important information, which will support the upcoming processes, as the Data Profile (DP) is created.

An Exploratory Quality Profiling will generate a quality rules proposal list. The DP is updated with these rules and converted into a DQP. This will help the user to obtain an overview of some DQDs and make better attributes selection based on this first quality approximation with a ready-to-use list of rules for pre-processing.

The User/App quality requirements (Quality tolerance levels, DQDs, and targeted attributes) are set and added to the DQP. Updated and tuned-up previously proposed quality rules are more likely, or a complete redefinition of the quality requirement parameters is performed.

The mapping and selection phase will update the DQP with a DQES, which contains the set of attributes to be evaluated for a set of DQDs, using a set of metrics from the DQP repository.

The Quantitative Quality Evaluation component assesses the DQ and updates the DQES with DQD Scores.

The DQES scores pass through quality control if validated. The DQP is executed in the pre-processing stage and confirmed in the repository.

If the scores (based on the quality requirements) are not valid, a quality rules discovery, validation, and optimization will be added/updated to the DQP configuration to obtain a valid DQD score that satisfies the quality requirements.

A continuous quality monitoring is performed for an eventual DQ failure that triggers a DQP update.

The DQP Repository: The DQPREPO contains detailed data quality profiles per data source and dataset. In the following, an information list managed by the repository is presented:

Data Quality User/App requirements.

Data Profiles, Metadata, and Data Provenance.

Data Quality Profiles (e.g. Data Quality Evaluation Schemes, and Data Quality Rules).

Data Quality Dimensions and related Metrics (metrics formulas and aggregate functions).

Data Domains (DQD’s, BD Characteristics).

DQD’s vs BD Characteristics.

Pre-processing Activities (e.g. Cleansing, and Normalizing) and functions (to replace missing values).

DQD’s vs DQ Issues vs PPF: Pre-processing Functions.

DQD’s priority processing in Quality Rules.

At every stage, module, task, or process, the DQP repository is incrementally updated with quality-related information. This includes, for example, quality requirements, DQES, DQD scores, data quality rules, Pre-Processing activities, activity functions, DQD metrics, and Data Profiles. Moreover, the DQP’s are organized per Data Domain and datatype to allow reuse. Adaptation is performed in the case of additional Big Datasets.

In Table 5 , an example of DQP Repository managed information along with its preprocessing activities (PPA) and their related functions (PPAF), is presented.

DQP lifecycle (Levels) : The DQP goes through the complete process flow of the proposed BDQMF. It starts with the specification of the Big Data Quality Project and ends with quality monitoring as an ongoing process that closes the quality enforcement loop and triggers other processes, which handle DQP adaptation, upgrade, or reuse. In Table 6 , the various DQP levels and their interaction within the BDQM Framework components are described. Each component involves process operations applied to the DQP.

Data preparation: sampling and profiling

Data preparation generates representative Big Data samples that serve as an entry for profiling, quality evaluation, and quality rules validation.

Sampling: Several sampling strategies can be applied to Big Data as surveyed in [ 54 , 55 ]. In this work, the authors evaluated the effect of sampling methods on Big Data and concluded that the sampling of large datasets reduces the run-time and computational footprint of link prediction algorithms, maintaining an adequate prediction performance. In statistics, the Bootstrap sampling technique evaluates the sampling distribution of an estimator using sampling, which replaces the original samples. In the Big Data context, Bootstrap sampling has been studied in several works [ 56 , 57 ]. In the proposed data quality evaluation scheme, it was decided to use the Bag of Little Bootstrap (BLB) [ 58 ]. This combines the results of bootstrapping multiple small subsets of a Big Data dataset. The BLB algorithm employs an original Big Dataset, which is used to generate small samples without replacements. For each generated sample, another set of samples is created by re-sampling with replacements.

Profiling: The data profiling module performs the data quality screening based on statistics and information summary [ 59 , 60 , 61 ]. Since profiling is meant to discover data characteristics from data sources, it is considered as a data assessment process that provides a first summary of the data quality reported in its data profile. Such information includes, for example, data format description, different attributes their types, values, and basic quality dimensions’ evaluations, data constraints (if any), and data ranges (max and min, a set of specific values or subsets).

More precisely, the information about the data is presented in two types: technical and functional data. This information can be extracted from the data itself without any additional representation using metadata or any descriptive header file or by parsing the data using analysis tools. This task may become very costly in Big Data. Therefore, to avoid costs generated by the data size, the same sampling process (based on BLB) is used. Thus, the data is reduced to a representative population sample, in addition to the combination of profiling results. More precisely, a data profile in the proposed framework is represented as a data quality profile of the first level ( DQP1 ), which is generated after the profiling phase. Moreover, data profiling provides some useful information that leads to significant data quality rules, usually named as data constraints. These rules are mostly equivalent to a structured-data schema, which is represented as technical and functional rules.

According to [ 61 ], there are many activities and techniques used to profile the data. These may range from online, incremental, and structural, to continuous profiling. Profiling tasks aim at discovering information about the data schema. Some data sources are already provided with their data profiles, sometimes with minimal information. In the following, some other techniques are introduced. These techniques can enrich and bring value-added information to a data profile:

Data provenance inquiry : it tracks the data origin and provides information about data transformations, data copying, and its related data quality through the data lifecycle [ 62 , 63 , 64 ].

Metadata : it provides descriptive and structural information about the data. Many data types, such as images, videos, and documents, use metadata to provide deep information about their contents. Metadata can be represented in many formats, including XML, or it can be extracted directly from the data itself without any additional representation.

Data parsing (supervised/manual/automatic) : data parsing is required since not all the data has a provenance or metadata that describes the data. The hardest way to gather extra information about the data is to parse it. Automatic parsing can be initially applied. Then, it is tuned and supervised manually by a data expert. This task may become very costly when Big Data is concerned, especially in the case of unstructured data. Consequently, a data profile is generated to represent only certain parts of the data that make sense. Therefore, multiple data profiles for multiple data partitions must be taken into consideration.

Data profile : it is generated early in the Big Data Project as DQP Level 0 (Data profile in its early form) and upgraded as a data quality profile within the data preparation component as DQP Level 1. Then, it is updated and extended through all the components of the Big Data Quality Management Framework until it reaches a DQP Level 2 . The DQP Level 8 is the profile applied to the data in the pre-processing phase with its quality rules and related activities to output a pre-processed data conformed to the quality requirements.

Exploratory quality profiling

Since a data-driven approach that uses a quantitative approach to quality dimensions’ evaluation from the data itself is followed, two evaluation steps are adopted: Quantitative Quality Evaluation based on user requirements and Exploratory Quality Profiling.

The exploratory quality profiling component is responsible for automatic data quality dimensions’ exploration without user interventions. The Quality Rules Proposals module, which produces a list of actions to elevate data quality, is based on some elementary DQDs that fit all varieties and data types.

A list of quality rules proposition, which is based on the quality evaluation of the most likely considered DQDs (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is performed based on the data itself and using predefined scenarios. These scenarios are meant to increase data quality for some basic DQDs. In Fig. 7 , the steps involved in the exploratory quality profiling for quality rules proposals generation are depicted. DQP1 is extended to DQP2, after adding the Data Quality Rules Proposal ( DQRP ), which is generated by the “quality rules proposals” process.

This module is part of the DQ profiling process, which varies the DQD tolerance levels from min to max scores and applies a systematic list of predefined quality rules. These predefined rules are a set of actions applied to the data when the measured DQD scores are not in the tolerance level defined by the min, max value scores. The actions vary from deleting only attributes, discarding only observations, or a combination of both. After these actions, a re-evaluation of the new DQD scores will lead to a quality rules proposal (DQRP) with known DQD target scores after performing an analysis. In Table 7 , some examples of these predefined rules scenarios for the DQD completeness ( dqd  =  Comp ) with an execution priority for each set of grouped actions, are described. The DQD levels are set to vary from a 5% to 95% tolerance score with a granularity step of 5. They can be set differently according to the DQD choice and its sensitivity to the data model and domain. The selection of the best-proposed data quality rules is based on the KNN algorithm using Euclidean distance (Deng et al. 2016.; [ 65 ]. It gives the closest quality rules parameters that achieve (by default) high completeness with less data reduction. The process might be refined by specifying other quality parameters.

A list of quality rules proposal based on quality evaluation of the most likely considered DQD’s (e.g., completeness, accuracy, and uniqueness), is produced. This preliminary assessment is based on the data itself using predefined scenarios. The quality rules are meant to increase data quality for some basic DQD’s. In Fig.  8 , the modules involved in the exploratory quality profiling for quality rules proposals generation, are illustrated.

figure 8

Quality rules proposals with exploratory quality profiling

Quality mapping and selection

The quality mapping and selection module of the BDQM framework is responsible for mapping data features or attributes to DQD’s to target pre-required quality evaluation scores. It generates a Data Quality Evaluation Scheme ( DQES ) and then adds it (updates) to the DQP. The DQES contains the DQD’s of the appropriate attributes to be evaluated using adequate metric formulas. The DQES, as a part of DQP, contains (for each of the selected data attributes) the following list, which is considered essential for the quantitative quality evaluation:

The attributes: all or a selected list,

The data quality dimensions (DQD’s) to be evaluated for each selected attribute,

Each DQD has a metric that returns the quality score, and

The quality requirement scores for each DQD needed in the score’s validation.

These requirements are general and target many global quality levels. The mapping component acts as a refinement of the global settings with precise qualities’ goals. Therefore, a mapping must be performed between the data quality dimensions and targeted data features/attributes before proceeding with the quality assessment. Each DQD is measured for each attribute and sample. The mapping generates a DQES , which contains Quality Evaluation Requests ( QER ) Q x . Each QER Q x targets a data quality dimension (DQD) for an attribute, all attributes, or a set of selected attributes, where x is the number of requests.

Quality mapping: Many approaches are available to accomplish an efficient mapping process. These include automatic, interactive, manual, and based on quality rules proposals techniques:

Automatic : it completes the alignment and comparison of the data attributes (from DQP) with the data quality requirements (either per attribute type, or name). A set of DQDs is associated with each attribute for quality evaluation. It results in a set of associations to be executed and evaluated in the quality assessment component.

Interactive : it relies on experts’ involvement to refine, amend, or confirm the previous automated associations.

Manual : it uses a similar but advanced dashboard to that illustrated in Error! Reference source not found. and a more detailed one in the attribute level.

Quality rules proposals : the proposal list collected from the DQP2 is used to obtain an understanding of the impact of a DQD level and the data reduction ratio. These quality insights help decide which DQD is best when compared to the quality requirements.

Quality selection (of DQD, Metrics and Attributes): It consists of a selection of an appropriate quality metric to evaluate data quality dimensions for an attribute of a Big Data sample set and returns a count of correct values, which comply with the metric formula. Each metric will be computed if the attribute values reflect the DQD constraints. For example, accuracy can be defined as a count of correct attributes in a certain range of values [v 1 , v 2 ]. Similarly, it can be defined to satisfy a certain number of constraints related to the type of data such as zip code, email, social security number, dates, or addresses.

Let us define the tuple DQES (S, D, A, M) . Most of the information is provided by the BDQP(DS , DS’ , Req) with Req  =  ( D , L, A ) parameters. The profiling information is used to select the appropriate quality metrics \({{\varvec{m}}}_{{\varvec{l}}}\) to evaluate the data quality dimensions \({{\varvec{q}}}_{{\varvec{l}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a weight \({{\varvec{w}}}_{{\varvec{j}}}\) . In addition to the previous settings, let us consider the following: S : S ( DS , N , n, R ) \(\to\) \({{\varvec{S}}}_{{\varvec{i}}}\) a sampling strategy

Let us denote by M , a set of quality metrics \({\varvec{M}}=\left\{{{\varvec{m}}}_{1},..,{{\varvec{m}}}_{{\varvec{l}}},..,{{\varvec{m}}}_{{\varvec{d}}}\right\}\) where \({{\varvec{m}}}_{{\varvec{l}}}\) is a quality metric that measures and evaluates a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for each value of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) in the sample \({{\varvec{s}}}_{{\varvec{i}}}\) and returns 1, if correct, and 0, if not. Each \({{\varvec{m}}}_{{\varvec{l}}}\) metric will be computed if the value of the attribute reflects the \({{\varvec{q}}}_{{\varvec{l}}}\) constraint. For example, the accuracy of an attribute is defined as a range of values between 0 and 100. Otherwise, it is incorrect. If the same DQD \({{\varvec{q}}}_{{\varvec{l}}}\) is evaluated for a set of attributes, and if the weights are all equal, a simple mean is computed. The metric \({{\varvec{m}}}_{{\varvec{l}}}\) will be evaluated to measure if each attribute has its \({{\varvec{m}}}_{{\varvec{l}}}\) correct. This is performed for each instance (cell or row) of the sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

Let us denote by \({{{\varvec{M}}}_{{\varvec{l}}}}^{\left(i\right)}, i=1,\dots ,{\varvec{N}}\) , a metric total \({{\varvec{m}}}_{{\varvec{l}}}\) , which evaluates and counts the number of observations that satisfy this metric, for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) of N samples from the dataset DS . The proportion of observations under the adequacy rule is calculated by:

The proportion of observations under the adequacy rule in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) is given by:

The total proportion of observations under the adequacy rule for all samples is given by:

where \({{\varvec{M}}}_{{\varvec{l}}}\) characterizes the \({{\varvec{q}}}_{{\varvec{l}}}\) mean score for the whole dataset.

Let \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) represents a request for a quality evaluation, which results in the mean quality score for a DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for a measurable attribute \({{\varvec{a}}}_{{\varvec{k}}}\) calculated by M l . The process by which Big Data samples are evaluated for a DQD \({{\varvec{q}}}_{{\varvec{j}}}\) in a sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) with a metric \({{\varvec{m}}}_{{\varvec{l}}}\) , providing a \({{\varvec{q}}}_{{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}\) score for each sample (described below in Quantitative Quality Evaluation ). Then, a sample mean \({{\varvec{q}}}_{{\varvec{l}}}\) is the final score for \({{\varvec{a}}}_{{\varvec{k}}}\) .

Let us denote a process, which sorts and combines the requests of a quality evaluation (QER) by DQD or by an attribute, resulting in a re-arrangement of the \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) tuple into two types, depending on the evaluation selection group parameter:

Per DQD identified as \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) where AList(a z ) represents the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) ( z:1…R ) to be evaluated for the DQD \({{\varvec{q}}}_{{\varvec{l}}}\) .

Per attributes identified as Q x (a k , DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l )) , where DList( \({{\varvec{q}}}_{{\varvec{l}}}\) , m l ) represents the data quality dimensions \({{\varvec{d}}}_{{\varvec{l}}}\) ( l:1… d ) to be evaluated for the attribute \({{\varvec{a}}}_{{\varvec{k}}}\) .

In some cases, the type of combination is automatically selected for a certain DQD, such as consistency, when all the attributes are constrained towards specific conditions. The combination is either based on attributes or DQD’s, and the DQES will be constructed as follows:

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) or.

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ,…,…)

The completion of the quality mapping process updates the DQP Level 2 with a DQES set as follows (Also illustrated in Error! Reference source not found.):

DQES ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{k}}},{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) ,…,…) , where x ranges from 1 to a defined number of evaluation requests. Each Q x element is a quality evaluation request of an attribute \({{\varvec{a}}}_{{\varvec{k}}}\) for a quality dimension \({{\varvec{q}}}_{{\varvec{l}}}\) , with a DQD metric m l .

The output of this phase generates a DQES score, which contains the mean score for each DQ dimension for one or many attributes. The mapping and selection data flow initiated using Big Data quality project (BDQP) settings is illustrated in Fig.  9 . This is accomplished either using the same BDQP Req or defining more detailed and refined quality parameters and a sampling strategy. Two types of DQES can be yielded:

Data Quality Dimension-wise evaluation of a list of attributes or

Attribute-wise evaluation of many DQD’s. As described before, the quality mapping and selection component generates a DQES evaluation scheme for the dataset, identifying which DQD and attributes tuples to evaluate using a specific quality metric. Therefore, a more detailed and refined set of parameters can also be set, as described in previous sections. In the following, the steps that construct the DQES in the mapping component are depicted:

The QMS function extracts the Req parameters from BDQP as (D, L, A) .

A quality evaluation request \(\left({a}_{k},{q}_{l},{m}_{l}\right)\) , is generated from the (D, A) tuple.

A list is constructed with these quality evaluation requests.

A list sorting is performed either by DQD or by Attributes producing two types of lists:

A combination of requests per DQD generates quality requests for a set of attributes \(\left(AList\left({a}_{z}\right),{q}_{l},{m}_{l}\right)\) .

A combination of requests per attribute generates quality requests for a set of DQD’s \(\left({a}_{k},DList({q}_{l},{m}_{l})\right)\) .

A DQES is returned based on the evaluation selection group parameter (per DQD, per attribute).

figure 9

DQES parameters settings

Quantitative quality evaluation

The Authors in [ 66 ], addressed how to evaluate a set of DQDs over a set of attributes. According to this study, the evaluation of Big Data quality is applied and iterated to many samples. The aggregation and combination of DQD’s scores are performed after each iteration. The evaluation scores are added to the DQES, which results in updating the DQP. We proposed an algorithm, which computes the quality scores for a dataset based on a certain quality mapping and quality metrics.

This algorithm is based on quality metrics evaluation using scores after collecting and validating the scores with quality requirements and generating quality rules from these scores [ 66 , 67 ]. There are rules related to each pre-processing activity, such as data cleaning rules, which eliminate data, and data enrichment, which replaces or adds data. Other activities, such as data reduction, reduce the data size by decreasing the number of features or attributes that have certain characteristics such as low variance, and highly correlated features.

In this phase, all the information collected from previous components (profiling, mapping, DQES) is included in the data quality profile level 3. The important elements are the set of samples and the data quality evaluation scheme, which are executed on each sample to evaluate its quality attributes for a specific DQD.

DQP Level 3 provides all the information needed about the settings represented by the DQES to proceed with the quality evaluation. The DQES contains the following:

The selected DQDs and their related metrics.

The selected attributes with the DQD to be evaluated.

The DQD selection, which is based on the Big Data quality requirements expressed early when initiating a Big Data Quality Project.

Attributes selection is set in the quality selection mapping component (3).

The quantitative quality evaluation methodology is described as follows:

The selected DQD quality metrics will measure and evaluate the DQD for each attribute observation in each sample from the sample set. For each attribute observation, it returns a value 1, if correct, or 0, if incorrect.

Each metric will be computed if all the sample observations attribute values reflect the constraints. For example, the metric accuracy of an attribute defines that a range of values between 20 and 70 is valid. Otherwise, it is invalid. The count of correct values out of the total sample observations is the DQD ratio represented by a percentage (%). This is performed for all selected attributes and their selected DQDs.

The sample mean from all samples for each evaluated DQD represents a Data Quality Score (DQS) estimation \(\left(\overline{DQS }\right)\) of a data quality dimension of the data source.

DQP Level 4 : an update to the DQP level 3 includes a data quality evaluation scheme (DQES) with the quality scores per DQD and per attribute ( DQES  +  Scores ).

In summary, the quantitative quality evaluation starts with sampling, DQD’s and DQDs metrics selection, mapping with data attributes, quality measurements, and the sample mean DQD’s ratios.

Let us denote by \({{\varvec{Q}}}_{{\varvec{x}}}\) Score (quality score), the evaluation results of each quality evaluation request \({{\varvec{Q}}}_{{\varvec{x}}}\) in the DQES . Two types of DQES, depending on the evaluation type, which means two kind of results scores organized per DQD of all attributes or per attribute for all DQD’s, can be identified:

\({{\varvec{Q}}}_{{\varvec{x}}}\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\to\) \({{\varvec{Q}}}_{{\varvec{x}}}\) ScoreList \(\left({\varvec{A}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}}\right)\) or.

\({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) \(\to\) Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right)\right)\)

where \({\varvec{z}}=1,\dots ,{\varvec{r}},\boldsymbol{ }{\varvec{r}}\) is the number of selected attributes, and \({\varvec{l}}=1,\dots ,{\varvec{d}},\) \({\varvec{d}}\) is the number of selected DQD’s.

The quality evaluation generates quality scores \({{\varvec{Q}}}_{{\varvec{x}}}\) Score . A quality scoring model is used to assess these results. It is provided in the form of quality requirements to comprehend the resulted scores, which are expressed as quality acceptance level percentages. These quality requirements might be a set of values, or an interval in which values are accepted or rejected, or a single score ratio percentage. The analysis of these scores against quality requirements leads to the discovery and generation of quality rules for attributes violating the quality requirements.

The quantitative quality evaluation process follows the steps described below for the case of the evaluation of a DQD’s list among several attributes ( \({{\varvec{Q}}}_{{\varvec{x}}}\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}})\right)\) ):

N samples (of size n ) are generated from the dataset DS using a BLB-based bootstrap sampling approach.

For each sample \({{\varvec{s}}}_{{\varvec{i}}}\) generated in step 1, and

For each \({{\varvec{a}}}_{{\varvec{z}}}\) ( \({\varvec{z}}=1,\dots ,{\varvec{r}}\) ) selected attribute in DQES in step 1, evaluate all the DQD’s in the DList using their related metrics to obtain Q x ScoreList \(\left({{\varvec{a}}}_{{\varvec{z}}},{\varvec{D}}{\varvec{L}}{\varvec{i}}{\varvec{s}}{\varvec{t}}\left({{\varvec{q}}}_{{\varvec{l}}},{{\varvec{m}}}_{{\varvec{l}}},{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\right),{{\varvec{s}}}_{{\varvec{i}}}\right)\) for each sample \({{\varvec{s}}}_{{\varvec{i}}}\) .

For all the samples scores, evaluate the sample mean of all N samples for each attribute \({{\varvec{a}}}_{{\varvec{z}}}\) related to the \({{\varvec{q}}}_{{\varvec{l}}}\) evaluation scores, as \(\stackrel{-}{{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}}.\)

For the dataset DS , evaluate the quality score mean \({\overline{{\varvec{q}}} }_{{\varvec{l}}}\) for each DQD for all attributes \({{\varvec{a}}}_{{\varvec{z}}}\) , as follows:

The illustration in Fig.  10 shows that the \({{\varvec{q}}}_{{\varvec{z}}{\varvec{l}}}{{\varvec{s}}}_{{\varvec{i}}}{\varvec{S}}{\varvec{c}}{\varvec{o}}{\varvec{r}}{\varvec{e}}\) is the evaluation of DQD \({{\varvec{q}}}_{{\varvec{l}}}\) for the sample \({{\varvec{s}}}_{{\varvec{i}}}\) for an attribute \({{\varvec{a}}}_{{\varvec{z}}}\) with a metric m l \(\boldsymbol{ }{\overline{{\varvec{q}}} }_{{\varvec{z}}{\varvec{l}}}\) represents the quality score sample mean for the attributes \({{\varvec{a}}}_{{\varvec{z}}}\) .

figure 10

Big data sampling and quantitative quality evaluation

Quality control

The quality control is initiated when the quality evaluation results are available and reported in the DQES in DQP Level 4 . During quality control, all the quality scores with the quality requirements of the Big Data project are checked. If any detected anomalies or a non-conformance are found, the quality control component forwards a DQP Level 5 to the data quality rules discovery component.

At this point, various cases are highlighted. An iteration process is performed until the required quality levels are satisfied, or the experts decide to stop the quality evaluation process and re-evaluate their requirements. At each phase, there is a kind of quality control, even if it is not explicitly specified, within each quality process.

The quality control acts in the following cases:

Case 1: This case applies when the quality is estimated, and no rules are yet included in the DQP Level 4 (the DQP is considered as a report, since the data quality is still inspected, and only reports are generated with no actions yet to be performed).

In the case of accepted quality scores, no quality actions need to be applied to data. The DQP Level 4 remains unchanged and acts as a full data quality report, which is updated with positive validation of the data per quality requirement. However, it might include some simple pre-processing such as attribute selection and filtering. According to the data analytics requirements and expected results planned in the Big Data project, more specific data pre-processing actions are performed but not related to quality in this case.

In the case when quality scores are not accepted, the DQP Level 4 DQES scores are analyzed, and the DQP is updated with a quality error report about the related DQD scores and their data attributes. DQP Level 5 is created, and it will be analyzed by the quality rules discovery component for the pre-processing activities to be executed on the data.

Case 2: In the presence of a DQP Level 6 that contains a quality evaluation request of the pre-processed samples with discovered quality rules, the following situations may occur:

When the quality control checks that the DQP Level 6 rules are valid and satisfy the quality requirements, the DQP Level 6 is updated to DQP Level 7 and confirmed as the final data quality profile, which will be applied to the data in the pre-processing phase. DQP Level 7 is considered as important if it contains validated quality rules.

When the quality control is not totally or partially satisfied, the DQP Level 6 is sent back for an adaptation of the quality selection and mapping component with valid and invalid quality rules, quality scores, and error reports. These reports highlight with an unacceptable score interval the quality rules that have not satisfied the quality requirements. The quality selection and mapping component provide automatic or manual analysis and assessment of the unsatisfied quality rules concerning their targeted DQD’s, attributes, and quality requirements. An adaptation of quality requirements is needed to re-validate these rules. Finally, the user experts have the final word to continue or break the process and proceed to the pre-processing phase with the valid rules. As part of the framework reuse specification, the invalid rules are kept within the DQP for future re-evaluation.

Case 3: The control component will always proceed based on the quality scores and quality requirements for both input and pre-processed data. Continuous control and monitoring are responsible for initiating DQP updates and adaptation if the quality requirements are relaxed.

Quality rules, discovery, validation, optimization, and execution

In [ 67 ] work, it was reported that if the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Each activity has a set of functions targeting different types of data in order to increase its DQD ratio and the whole Data Quality (of the Data source or the Dataset(s)).

When Quality Rules ( QR) are applied to a sample set S , a pre-processed sample set S’ is generated. A quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. Then, an optimization scheme is applied to the list of valid quality rules before their application on production data. The predefined optimization schemes vary from (1) rules priority to (2) rules redundancy, (3) rules removal, (4) rules grouping per attribute, or (5) per DQD’s, or (6) per duplicate rules.

Quality rules discovery: The discovery is based on the DQP Level 5 from the quality control component. An analysis of the quality scores is initiated, and an error report is extracted. If the DQD scores do not conform to the quality requirements, then failed scores are used to discover data quality rules. When executed on data, these rules enhance its quality. They are based on known pre-processing activities such as data cleansing. Error! Reference source not found. illustrates the several modules of the discovery component from DQES DQDs scores analysis versus requirements, attributes pre-processing activities combination for each targeted DQD, and the rules generation.

For example, an attribute having a 50% score of missing data is not accepted for a required score of 20% or less. This initiates the generation of a quality rule, which consists of a data cleansing activity for observations that do not satisfy the quality requirements. The data cleansing or data enrichment activity is selected from the Big Data quality profile repository. The quality rule will target all the related attributes marked for pre-processing to reduce the 50% to 20% for the DQD completeness. Moreover, in the case of completeness, not only cleansing can be applied to missing values, but many alternatives are available for pre-processing activities. These activities are related to completeness such as missing values replacement activity with many functions for several replacements’ methods like the mean, mode, and the median.

The pre-processing activities are provided by the repository to achieve the required data quality. Many possibilities for pre-processing activities selection are available:

Automatic , by discovering and suggesting a set of activities or DQ rules.

Predefined , by selecting ready-to-use quality rules proposals from the exploratory quality profiling component, predefined pre-processing activity functions from the repository, indexed by DQDs.

Manual, giving the expert the ability to query the exploratory quality profiling results for the best rules, achieving the required quality using KNN-based filtering.

Quality rules validation: The generated quality rules from the discovery components are set in the DQP Level 6. As illustrated in Error! Reference source not found., the rules validation component process starts when the DQR list is applied to the sample set S , resulting in a pre-processed sample set S’ , which is generated by the related pre-processing activities. Then, a quality evaluation process is invoked on S’ , generating DQD scores for S’ . Thus, a score comparison between S and S’ is conducted to filter only qualified and valid rules with a higher percentage of success among data. After analyzing this score, two sets of rules are identified: successful and failed rules.

Quality rules optimization: After the set of discovered valid quality rules is selected, an optimization process is activated to reorganize and filter the rules. This is due to the nature of the evaluation parameters set in the mapping component and the refinement of the quality requirement. These choices with the rule’s validation process will produce a list of individual quality rules that, if applied as generated, might have the following consequences:

Redundant rules.

Ineffective rules due to the order of execution.

Multiple rules, which target the same DQD with the same requirements.

Multiple rules, which target the same attributes for the same DQD and requirements.

Rules, which drop attributes or rows, must be applied first or have a higher priority to avoid applying rules on data items that are meant to be dropped (Table 8 ).

The quality rules optimization component applies an optimization scheme to the list of valid quality rules before their application to production data in the pre-processing phase. The predefined optimization schemes vary according to the following, as illustrated in Error! Reference source not found.:

Rules execution priority per attribute or DQD, per pre-processing activity, or pre-processing function.

Rules redundancy removal per attributes or DQDs.

Rules grouping, combination, per activity, per attribute, per DQD’s, or duplicates.

For invalid rules, the component consists of several actions, including rules removal or rules adaptation from previously generated proposals in the exploratory quality profiling component for the same targeted tuple (attributes, DQDs).

Quality rules optimization: The Quality Rules execution consists of pre-processing data using the DQP, which embeds the data quality rules that enhance the quality to reach the agreed requirements. As part of the monitoring module, a sampling set from the pre-processed data is used to re-assess the quality and detect eventual failures.

Quality monitoring

Quality Monitoring is a continuous quality control process, which relies on the DQP. The purpose of monitoring is to validate the DQP across all the Big Data lifecycle processes. The QP repository is updated during and after the complete lifecycle as well as after the user’s feedback data, quality requirements, and mapping.

As illustrated in Fig.  11 , the monitoring process takes a scheduled snapshot of the pre-processed Big Data all along the BDQMF for the BDQ project. This data snapshot is a set of samples that have their quality evaluated in the BDQMF component (4). Then, quality control is conducted on the quality scores, and an update is performed to the DQP. The quality report may highlight the quality failure and its ratio evolution through multiple sampling snapshots of data.

figure 11

Quality monitoring component

The monitoring process strengthens and enforces the quality across the Big Data value chain using the BDQM framework while reusing the data quality profile information. For each quality monitoring iteration on the datasets from the data source, quality reports are added to the data quality profile, updating it to a DQP Level 10 .

Data processing, analytics, and visualization

This process involves the application of algorithms or methodologies, which extract insights from the ready-to-use data, with enhanced quality. Then, the value of processed data is projected visually as a dashboard and graphically enhanced charts for the decision-makers to act economically. Big Data visualization approaches are of high importance for the final exploitation of the data.

Implementations: Dataflow and quality processes development

In this section, we overview the dataflow across the various processes of the framework, we also highlight the implemented quality management processes along with the supporting application interfaces developed to support main processes. Finally, we describe the ongoing processes’ implementations and evaluations.

Framework dataflow

In Fig.  12 , we illustrate the whole process flow of the framework, from the inception of the quality project in its specification and requirements to the quality monitoring phase. As an ongoing process, monitoring is a part of the quality enforcement loop and may trigger other processes that handle several quality profile operations like DQP adaptation, upgrade, or reuse.

figure 12

Big data quality management framework data flow

In Table 9 , we enumerate and detail the multiple processes and their interactions within the BDQM Framework components including their inputs and outputs after executing related activities with the quality profile (DQP), as detailed in the previous section.

Quality management processes’ implementation

In this section, we describe the implementation of our framework's important components, processes, and their contributions towards the quality management of Big Data across its lifecycle.

Core processes implementation

As depicted above, core framework processes have been implemented and evaluated, in the following, we describe how these components have been implemented and evaluated.

Quality profiling : one of the central components of our framework is the data quality profile (DQP). Initially, the DQP implements a simple data profile of a Big Data set as an XML file (DQP Sample illustrated in Fig.  13 ).

figure 13

Example of data quality profile

After traversing several framework component’s processes, it is updated to a data quality profile. The data quality evaluation process is one of the activities that updates the DQP with quality scores that are later used to discover data quality rules. These rules, when applied to the original data, will ensure an output data set with higher quality. The DQP is finally executed by the pre-processing component. Through the end of the lifecycle, the DQP contains all pieces of information such as data quality rules that target a set of data sources with multiple datasets, data attributes and data quality dimensions such as accuracy, and pre-processing activities like data cleansing, data integration, and data normalization. The Data Quality Profile (DQP) contains all the information about the Data, its Quality, the User Quality Requirements, DQD’s, Quality Levels, Attributes, the Data Quality Evaluation Scheme (DQES), Quality Scores, and the Data Quality Rules. The DQP is stored in the DQP repository, which contains the following modules, and performs many tasks related to DQP. In the following, the DQP lifecycle and its repository are described.

Quality requirement dashboard : developed as a web-based application as shown in Fig.  14 below to capture user’s requirements and other quality information. Such requirements include for instance data quality dimension requirements specification. This application can be extended with extra information about data sources such as attributes and their types. The user is guided through the interface to specify the right attributes’ values and also given the option to upload an XML file containing the relationship between attributes. The recorded requirements are finally saved to a data quality profile level 0 which will be used in the next stage of the quality management process.

figure 14

Quality requirements dashboard

Data preparation and sampling : The framework operations start when the quality project's minimal specifications are set. It initiates and provides a data quality summary named data quality profile (DQP) by running an exploratory quality profiling assessment on data samples (using BLB sampling algorithm). This DQP is projected to be the core component of the framework and every update and every result regarding the quality is noted/recorded. The DQP is stored in a quality repository and registered in the Big Data’s provenance to keep track of data changes due to quality enhancements.

Data quality mapping and rule discovery components : data quality mapping alleviates and adds more data quality control to the whole data quality assessment process. The implemented mapping links and categorizes all the quality project required elements, from Big Data quality characteristics, pre-processing activities, and their related techniques functions, to data quality rules, dimensions, and their metrics. The Data Quality Rules’ discovery from evaluation results implementation reveals the required actions and transformations that when applied on the data set will accomplish the targeted quality level. These rules are the main ingredients of pre-processing activities. The role of a DQ rule is to undertake the sources of bad quality by defining a list of actions related to each quality score. The DQ rules are the results of systematic and planned data quality assessment analysis.

Quality profile repository (QPREPO) : Finally, our framework implements the QPREPO to manage the data quality profiles for different data types and domains and to adapt or optimize existing profiles. This repository manages the data quality dimensions with their related metrics, and the pre-processing activities, and their activity functions. A QPREPO entry is implemented for each Big Data quality project with the related DQP containing information’s about each dataset, data source, data domain, and data user. This information is essential for DQP reuse, adaptation, and enhancement for the same or different data sources.

Implemented approaches for quality assessment.

The framework uses various approaches for quality assessment: (1) Exploratory Quality Profiling; (2) a Quantitative Quality Assessment approach using DQD metrics; and it's anticipated to add a new component for (3) a Qualitative quality assessment.

Exploratory Quality Profiling implements an automatic quality evaluation that is done systematically on all data attributes for basic DQDs. The resulted in calculated scores are used to generate quality rules for each quality tolerance ratio variation. These rules are then applied to other data samples and the quality is reassessed. An analysis of the results provides an interactive quality-based rules search using several ranking algorithms (maximization, minimization, applying weight).

The Quantitative Quality Assessment implements a quick data quality evaluation strategy supported through sampling and profiling processes for Big Data. The evaluation is conducted by measuring the data quality dimensions (DQDs) for attributes using specific metrics to calculate a quality score.

The Qualitative Quality Assessment approach implements a deep quality assessment to discover hidden quality aspects and their impact on the Big Data Lifecycle outputs. These quality aspects must be quantified into scores and mapped with related attributes and DQD’s. This quantification is achieved by applying several feature selection strategies and algorithms to data samples. These qualitative insights are combined with those obtained before the quantitative quality evaluation early in the Quality management process.

Framework development, deployment, and evaluation

Development, deployment, and evaluation of our BDQMF framework follow a systematic modular approach where various components of the framework are developed and tested independently then integrated with the other components to compose the integrated solution. Most of the components are implemented in R and |Python using SparkR and PySpark libraries respectively. The supporting files like the DQP, DQES, and configuration files are written in XML and JSON formats. Big Data quality project requests and constraints including the data sources and the quality expectation are implemented within the solution where more than one module might be involved. The BDQMF components are deployed following Apache Hadoop and Spark ecosystem architecture.

The BDQMF deployed modules implementation description and developed APIs are listed in the following:

Quality setting mapper (QSP): it implements an interface for automatic selection and mapping of DQD’s and dataset attributes from the initial DQP.

Quality settings parser (QSP): responsible for parsing and loading parameters to the execution environment from DQP settings to data files. It is also used to extract quality rules and scores from the DQES in the DQP.

Data loader (DL): implements filtering, selecting, and loading all types of data files required by the BDQMF including datasets from data sources into the Spark environment (e.g. DataFrames, tables), it will be used by various processes or it will persist in the database for further reuse. For data selection the uses SQL to retrieve only attributes being set in the DQP settings.

Data samples generator (DSG): it generates data samples from multiple data sources.

Quality inspector and profiler (QIP): it is responsible for all qualitative and quantitative quality evaluations among data samples for all the BDQMF lifecycle phases. The inspector assesses all the default and required DQD’s, and all quality evaluations are set into the DQES within the DQP file.

Preprocessing activities and functions execution engine (PPAF-E ): all the repository preprocessing activities along with their related functions are implemented as APIs in python and R. When requested this library will load the necessary methods and execute them within the preprocessing activities for rules validation and rules execution in phase 9.

Quality rules manager (QRM): it is one of the important modules of the framework. It implements and deliver the following features:

Analyzes Quality results

Discovers and generates Quality rules proposals.

Quality rules validation among requirements settings.

Quality rules refinement and optimizations

Quality rules ACID operations in the DQP files and the repository.

Quality monitor (QM) : it is responsible for monitoring, triggering, and reporting any quality change all over the Big Data lifecycle to assure the efficiency of quality improvement of the discovered data quality rules.

BDQMF-Repo: is the repository where all the quality-related files, settings, requirements, results are stored. The repo is using HBase or Mongo DB to fulfill requirements of the Big Data ecosystem environments and scalability for intensive data updates.

Big data quality has attracted the attention of researchers regarding Big Data as it is considered the key differentiator, which leads to high-quality insights and data-driven decisions. In this paper, a Big Data Quality Management Framework for addressing end-to-end Quality in the Big Data lifecycle was proposed. The framework is based on a Data Quality Profile, which is augmented with valuable information while traveling across different stages of the framework, starting from Big Data project parameters, quality requirements, quality profiling, and quality rules proposals. The exploratory quality profiling feature, which extracts quality information from the data, helped in building a robust DQP with a quality rules proposal and a step over for the configuration of the data quality evaluation scheme. Moreover, the extracted quality rules proposals are of high benefit for the quality dimensions mapping and attribute selection component. This fact supports the users with quality data indicators characterized by their profile.

The framework dataflow shows that any Big Data set quality is evaluated through the exploratory quality profiling component and the quality rules extraction and validation towards an improvement in its quality. It is of great importance to ensure the right selection of a combination of targeted DQD levels, observations (rows), and attributes (columns) for efficient quality results, while not sacrificing vital data because of considering only one DQD. The resulted quality profile based on the quality assessment results confirms that the contained quality information significantly improves the quality of Big Data.

In future work, we plan to extend the quantitative quality profiling with qualitative evaluation. We also plan to extend the framework to cope with unstructured Big Data quality assessment.

Availability of data and materials

Data used in this work is available with the first author and can be provided up on request. The data includes sampling data, pre-processed data, etc.

Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw Appl. 2014;19:171–209. https://doi.org/10.1007/s11036-013-0489-0 .

Article   Google Scholar  

Chiang F, Miller RJ. Discovering data quality rules. Proceed VLDB Endowment. 2008;1:1166–77.

Yeh, P.Z., Puri, C.A., 2010. An Efficient and Robust Approach for Discovering Data Quality Rules, in: 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI). Presented at the 2010 22nd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 248–255. https://doi.org/10.1109/ICTAI.2010.43

Ciancarini, P., Poggi, F., Russo, D., 2016. Big Data Quality: A Roadmap for Open Data, in: 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). Presented at the 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), pp. 210–215. https://doi.org/10.1109/BigDataService.2016.37

Firmani D, Mecella M, Scannapieco M, Batini C. On the meaningfulness of “big data quality” (Invited Paper). Data Sci Eng. 2016;1:6–20. https://doi.org/10.1007/s41019-015-0004-7 .

Rivas, B., Merino, J., Serrano, M., Caballero, I., Piattini, M., 2015. I8K|DQ-BigData: I8K Architecture Extension for Data Quality in Big Data, in: Advances in Conceptual Modeling, Lecture Notes in Computer Science. Presented at the International Conference on Conceptual Modeling, Springer, Cham, pp. 164–172. https://doi.org/10.1007/978-3-319-25747-1_17

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H., 2011. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute 1–137.

Chen CP, Zhang C-Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Inf Sci. 2014;275:314–47.

Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S. The rise of “big data” on cloud computing: Review and open research issues. Inf Syst. 2015;47:98–115. https://doi.org/10.1016/j.is.2014.07.006 .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87. https://doi.org/10.1109/ACCESS.2014.2332453 .

Wielki J. The Opportunities and Challenges Connected with Implementation of the Big Data Concept. In: Mach-Król M, Olszak CM, Pełech-Pilichowski T, editors. Advances in ICT for Business. Springer International Publishing: Industry and Public Sector, Studies in Computational Intelligence; 2015. p. 171–89.

Google Scholar  

Ali-ud-din Khan, M., Uddin, M.F., Gupta, N., 2014. Seven V’s of Big Data understanding Big Data to extract value, in: American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of The. Presented at the American Society for Engineering Education (ASEE Zone 1), 2014 Zone 1 Conference of the, pp. 1–5. https://doi.org/10.1109/ASEEZone1.2014.6820689

Kepner, J., Gadepally, V., Michaleas, P., Schear, N., Varia, M., Yerukhimovich, A., Cunningham, R.K., 2014. Computing on masked data: a high performance method for improving big data veracity, in: 2014 IEEE High Performance Extreme Computing Conference (HPEC). Presented at the 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. https://doi.org/10.1109/HPEC.2014.7040946

Saha, B., Srivastava, D., 2014. Data quality: The other face of Big Data, in: 2014 IEEE 30th International Conference on Data Engineering (ICDE). Presented at the 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. https://doi.org/10.1109/ICDE.2014.6816764

Gandomi A, Haider M. Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage. 2015;35:137–44.

Pääkkönen P, Pakkala D. Reference architecture and classification of technologies, products and services for big data systems. Big Data Research. 2015;2:166–86. https://doi.org/10.1016/j.bdr.2015.01.001 .

Oliveira, P., Rodrigues, F., Henriques, P.R., 2005. A Formal Definition of Data Quality Problems., in: IQ.

Maier, M., Serebrenik, A., Vanderfeesten, I.T.P., 2013. Towards a Big Data Reference Architecture. University of Eindhoven.

Caballero, I., Piattini, M., 2003. CALDEA: a data quality model based on maturity levels, in: Third International Conference on Quality Software, 2003. Proceedings. Presented at the Third International Conference on Quality Software, 2003. Proceedings, pp. 380–387. https://doi.org/10.1109/QSIC.2003.1319125

Sidi, F., Shariat Panahy, P.H., Affendey, L.S., Jabar, M.A., Ibrahim, H., Mustapha, A., 2012. Data quality: A survey of data quality dimensions, in: 2012 International Conference on Information Retrieval Knowledge Management (CAMP). Presented at the 2012 International Conference on Information Retrieval Knowledge Management (CAMP), pp. 300–304. https://doi.org/10.1109/InfRKM.2012.6204995

Chen, M., Song, M., Han, J., Haihong, E., 2012. Survey on data quality, in: 2012 World Congress on Information and Communication Technologies (WICT). Presented at the 2012 World Congress on Information and Communication Technologies (WICT), pp. 1009–1013. https://doi.org/10.1109/WICT.2012.6409222

Batini C, Cappiello C, Francalanci C, Maurino A. Methodologies for data quality assessment and improvement. ACM Comput Surv. 2009;41:1–52. https://doi.org/10.1145/1541880.1541883 .

Glowalla, P., Balazy, P., Basten, D., Sunyaev, A., 2014. Process-Driven Data Quality Management–An Application of the Combined Conceptual Life Cycle Model, in: 2014 47th Hawaii International Conference on System Sciences (HICSS). Presented at the 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 4700–4709. https://doi.org/10.1109/HICSS.2014.575

Wand Y, Wang RY. Anchoring data quality dimensions in ontological foundations. Commun ACM. 1996;39:86–95. https://doi.org/10.1145/240455.240479 .

Wang, R.Y., Strong, D.M., 1996. Beyond accuracy: What data quality means to data consumers. Journal of management information systems 5–33.

Cappiello, C., Caro, A., Rodriguez, A., Caballero, I., 2013. An Approach To Design Business Processes Addressing Data Quality Issues.

Hazen BT, Boone CA, Ezell JD, Jones-Farmer LA. Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. Int J Prod Econ. 2014;154:72–80. https://doi.org/10.1016/j.ijpe.2014.04.018 .

Caballero, I., Verbo, E., Calero, C., Piattini, M., 2007. A Data Quality Measurement Information Model Based On ISO/IEC 15939., in: ICIQ. pp. 393–408.

Juddoo, S., 2015. Overview of data quality challenges in the context of Big Data, in: 2015 International Conference on Computing, Communication and Security (ICCCS). Presented at the 2015 International Conference on Computing, Communication and Security (ICCCS), pp. 1–9. https://doi.org/10.1109/CCCS.2015.7374131

Woodall P, Borek A, Parlikad AK. Data quality assessment: The hybrid approach. Inf Manage. 2013;50:369–82. https://doi.org/10.1016/j.im.2013.05.009 .

Goasdoué, V., Nugier, S., Duquennoy, D., Laboisse, B., 2007. An Evaluation Framework For Data Quality Tools., in: ICIQ. pp. 280–294.

Caballero, I., Serrano, M., Piattini, M., 2014. A Data Quality in Use Model for Big Data, in: Indulska, M., Purao, S. (Eds.), Advances in Conceptual Modeling, Lecture Notes in Computer Science. Springer International Publishing, pp. 65–74. https://doi.org/10.1007/978-3-319-12256-4_7

Cai L, Zhu Y. The challenges of data quality and data quality assessment in the big data era. Data Sci J. 2015. https://doi.org/10.5334/dsj-2015-002 .

Philip Woodall, A.B., 2014. An Investigation of How Data Quality is Affected by Dataset Size in the Context of Big Data Analytics.

Laranjeiro, N., Soydemir, S.N., Bernardino, J., 2015. A Survey on Data Quality: Classifying Poor Data, in: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC). Presented at the 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 179–188. https://doi.org/10.1109/PRDC.2015.41

Liu, J., Li, J., Li, W., Wu, J., 2016. Rethinking big data: A review on the data quality and usage issues. ISPRS Journal of Photogrammetry and Remote Sensing, Theme issue “State-of-the-art in photogrammetry, remote sensing and spatial information science” 115, 134–142. https://doi.org/10.1016/j.isprsjprs.2015.11.006

Rao, D., Gudivada, V.N., Raghavan, V.V., 2015. Data quality issues in big data, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2654–2660. https://doi.org/10.1109/BigData.2015.7364065

Zhou, H., Lou, J.G., Zhang, H., Lin, H., Lin, H., Qin, T., 2015. An Empirical Study on Quality Issues of Production Big Data Platform, in: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE). Presented at the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE), pp. 17–26. https://doi.org/10.1109/ICSE.2015.130

Becker, D., King, T.D., McMullen, B., 2015. Big data, big data quality problem, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), IEEE, Santa Clara, CA, USA, pp. 2644–2653. https://doi.org/10.1109/BigData.2015.7364064

Maślankowski, J., 2014. Data Quality Issues Concerning Statistical Data Gathering Supported by Big Data Technology, in: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (Eds.), Beyond Databases, Architectures, and Structures, Communications in Computer and Information Science. Springer International Publishing, pp. 92–101. https://doi.org/10.1007/978-3-319-06932-6_10

Fürber, C., Hepp, M., 2011. Towards a Vocabulary for Data Quality Management in Semantic Web Architectures, in: Proceedings of the 1st International Workshop on Linked Web Data Management, LWDM ’11. ACM, New York, NY, USA, pp. 1–8. https://doi.org/10.1145/1966901.1966903

Corrales DC, Corrales JC, Ledezma A. How to address the data quality issues in regression models: a guided process for data cleaning. Symmetry. 2018;10:99.

Fan, W., 2008. Dependencies revisited for improving data quality, in: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170.

Kläs, M., Putz, W., Lutz, T., 2016. Quality Evaluation for Big Data: A Scalable Assessment Approach and First Evaluation Results, in: 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA). Presented at the 2016 Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA), pp. 115–124. https://doi.org/10.1109/IWSM-Mensura.2016.026

Rahm E, Do HH. Data cleaning: Problems and current approaches. IEEE Data Eng Bull. 2000;23:3–13.

Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N., 2013. NADEEF: A Commodity Data Cleaning System, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13. ACM, New York, NY, USA, pp. 541–552. https://doi.org/10.1145/2463676.2465327

Ebaid A, Elmagarmid A, Ilyas IF, Ouzzani M, Quiane-Ruiz J-A, Tang N, Yin S. NADEEF: A generalized data cleaning system. Proceed VLDB Endowment. 2013;6:1218–21.

Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S., 2014. NADEEF/ER: generic and interactive entity resolution. ACM Press, pp. 1071–1074. https://doi.org/10.1145/2588555.2594511

Tang N. Big Data Cleaning. In: Chen L, Jia Y, Sellis T, Liu G, editors. Web Technologies and Applications. Lecture Notes in Computer Science: Springer International Publishing; 2014. p. 13–24.

Chapter   Google Scholar  

Ge M, Dohnal V. Quality management in big data informatics. 2018;5:19. https://doi.org/10.3390/informatics5020019 .

Jimenez-Marquez JL, Gonzalez-Carrasco I, Lopez-Cuadrado JL, Ruiz-Mezcua B. Towards a big data framework for analyzing social media content. Int J Inf Manage. 2019;44:1–12. https://doi.org/10.1016/j.ijinfomgt.2018.09.003 .

Siddiqa A, Hashem IAT, Yaqoob I, Marjani M, Shamshirband S, Gani A, Nasaruddin F. A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl. 2016;71:151–66. https://doi.org/10.1016/j.jnca.2016.04.008 .

Taleb, I., Dssouli, R., Serhani, M.A., 2015. Big Data Pre-processing: A Quality Framework, in: 2015 IEEE International Congress on Big Data (BigData Congress). Presented at the 2015 IEEE International Congress on Big Data (BigData Congress), pp. 191–198. https://doi.org/10.1109/BigDataCongress.2015.35

Cormode, G., Duffield, N., 2014. Sampling for Big Data: A Tutorial, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14. ACM, New York, NY, USA, pp. 1975–1975. https://doi.org/10.1145/2623330.2630811

Gadepally, V., Herr, T., Johnson, L., Milechin, L., Milosavljevic, M., Miller, B.A., 2015. Sampling operations on big data, in: 2015 49th Asilomar Conference on Signals, Systems and Computers. Presented at the 2015 49th Asilomar Conference on Signals, Systems and Computers, pp. 1515–1519. https://doi.org/10.1109/ACSSC.2015.7421398

Liang F, Kim J, Song Q. A bootstrap metropolis-hastings algorithm for bayesian analysis of big data. Technometrics. 2016. https://doi.org/10.1080/00401706.2016.1142905 .

Article   MathSciNet   Google Scholar  

Satyanarayana, A., 2014. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality, in: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE). Presented at the 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), IEEE, Toronto, ON, Canada, pp. 1–6. https://doi.org/10.1109/CCECE.2014.6901029

Kleiner, A., Talwalkar, A., Sarkar, P., Jordan, M., 2012. The big data bootstrap. arXiv preprint

Dai, W., Wardlaw, I., Cui, Y., Mehdi, K., Li, Y., Long, J., 2016. Data Profiling Technology of Data Governance Regarding Big Data: Review and Rethinking, in: Latifi, S. (Ed.), Information Technolog: New Generations. Springer International Publishing, Cham, pp. 439–450. https://doi.org/10.1007/978-3-319-32467-8_39

Loshin, D., 2010. Rapid Data Quality Assessment Using Data Profiling 15.

Naumann F. Data profiling revisited. ACM. SIGMOD Record. 2014;42:40–9.

Buneman, P., Davidson, S.B., 2010. Data provenance–the foundation of data quality.

Glavic, B., 2014. Big Data Provenance: Challenges and Implications for Benchmarking, in: Specifying Big Data Benchmarks. Springer, pp. 72–80.

Wang, J., Crawl, D., Purawat, S., Nguyen, M., Altintas, I., 2015. Big data provenance: Challenges, state of the art and opportunities, in: 2015 IEEE International Conference on Big Data (Big Data). Presented at the 2015 IEEE International Conference on Big Data (Big Data), pp. 2509–2516. https://doi.org/10.1109/BigData.2015.7364047

Hwang W-J, Wen K-W. Fast kNN classification algorithm based on partial distance search. Electron Lett. 1998;34:2062–3.

Taleb, I., Kassabi, H.T.E., Serhani, M.A., Dssouli, R., Bouhaddioui, C., 2016. Big Data Quality: A Quality Dimensions Evaluation, in: 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld). Presented at the 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 759–765. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld.2016.0122

Taleb, I., Serhani, M.A., 2017. Big Data Pre-Processing: Closing the Data Quality Enforcement Loop, in: 2017 IEEE International Congress on Big Data (BigData Congress). Presented at the 2017 IEEE International Congress on Big Data (BigData Congress), pp. 498–501. https://doi.org/10.1109/BigDataCongress.2017.73

Deng, Z., Zhu, X., Cheng, D., Zong, M., Zhang, S., n.d. Efficient kNN classification algorithm for big data. Neurocomputing. https://doi.org/10.1016/j.neucom.2015.08.112

Firmani, D., Mecella, M., Scannapieco, M., Batini, C., 2015. On the Meaningfulness of “Big Data Quality” (Invited Paper), in: Data Science and Engineering. Springer Berlin Heidelberg, pp. 1–15. https://doi.org/10.1007/s41019-015-0004-7

Lee YW. Crafting rules: context-reflective data quality problem solving. J Manag Inf Syst. 2003;20:93–119.

Download references

Acknowledgements

Not applicable.

This work is supported by fund #12R005 from ZCHS at UAE University.

Author information

Authors and affiliations.

College of Technological Innovation, Zayed University, P.O. Box 144534, Abu Dhabi, United Arab Emirates

Ikbal Taleb

College of Information Technology, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Mohamed Adel Serhani

Department of Statistics, College of Business and Economics, UAE University, P.O. Box 15551, Al Ain, United Arab Emirates

Chafik Bouhaddioui

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, H4B 1R6, Canada

Rachida Dssouli

You can also search for this author in PubMed   Google Scholar

Contributions

IT conceived the main conceptual ideas related to Big data quality framework and proof outline. He designed the framework and their main modules, he also worked on the implementation and validation of some of the framework’s components. MAS supervised the study and was in charge of direction and planning, he also contributed to couple of sections including the introduction, abstract, the framework and the implementation and conclusion section. CB contributed to data preparation sampling and profiling, he also reviewed and validated all formulations and statistical modeling included in this work. RD contributed in the review and discussion of the core contributions and their validation. All authors read and approved the final manuscript.

Authors’ information

Dr. Ikbal Taleb is currently an Assistant Professor, College of Technological Information, Zayed University, Abu Dhabi, U.A.E. He got his Ph.D. in information and systems engineering from Concordia University in 2019, and MSc. in Software Engineering from the University of Montreal, Canada in 2006. His research interests include data and Big data quality, quality profiling, quality assessment, cloud computing, web services, and mobile web services.

Prof. M. Adel Serhani is currently a Professor, and Assistant Dean for Research and Graduate Studies College of Information Technology, U.A.E University, Al Ain, U.A.E. He is also an Adjunct faculty in CIISE, Concordia University, Canada. He holds a Ph.D. in Computer Engineering from Concordia University in 2006, and MSc. in Software Engineering from University of Montreal, Canada in 2002. His research interests include: Cloud for data intensive e-health applications, and services; SLA enforcement in Cloud Data centers, and Big data value chain, Cloud federation and monitoring, Non-invasive Smart health monitoring; management of communities of Web services; and Web services applications and security. He has a large experience earned throughout his involvement and management of different R&D projects. He served on several organizing and Technical Program Committees and he was the program Co-Chair of International Conference in Web Services (ICWS’2020), Co-chair of the IEEE conference on Innovations in Information Technology (IIT´13), Chair of IEEE Workshop on Web service (IWCMC´13), Chair of IEEE workshop on Web, Mobile, and Cloud Services (IWCMC´12), and Co-chair of International Workshop on Wireless Sensor Networks and their Applications (NDT´12). He has published around 130 refereed publications including conferences, journals, a book, and book chapters.

Dr. Chafik Bouhaddioui is an Associate Professor of Statistics in the College of Business and Economics at UAE University. He got his Ph.D. from University of Montreal in Canada. He worked as lecturer at Concordia University for 4 years. He has a rich experience in applied statistics in finance in private and public sectors. He worked as assistant researcher in Finance Ministry in Canada. He worked as Senior Analyst in National Bank of Canada and developed statistical methods used in stock market forecasting. He joined in 2004 a team of researchers in finance group at CIRANO in Canada to develop statistical tools and modules in finance and risk analysis. He published several papers in well-known journals in multivariate time series analysis and their applications in economics and finance. His area of research is diversified and includes modeling and prediction in multivariate time series, causality and independence tests, biostatistics, and Big Data.

Prof. Rachida Dssouli is a full professor and Director of Concordia Institute for Information Systems Engineering, Faculty of Engineering and Computer Science, Concordia University. Dr. Dssouli received a Master (1978), Diplome d'études Approfondies (1979), Doctorat de 3eme Cycle in Networking (1981) from Université Paul Sabatier, Toulouse, France. She earned her PhD degree in Computer Science (1987) from Université de Montréal, Canada. Her research interests are in Communication Software Engineering a sub discipline of Software Engineering. Her contributions are in Testing based on Formal Methods, Requirements Engineering, Systems Engineering, Telecommunication Service Engineering and Quality of Service. She published more than 200 papers in journals and referred conferences in her area of research. She supervised/ co-supervised more than 50 graduate students among them 20 PhD students. Dr. Dssouli is the founding Director of Concordia Institute for Information and Systems Engineering (CIISE) June 2002. The Institute hosts now more than 550 graduate students and 20 faculty members, 4 master programs, and a PhD program.

Corresponding author

Correspondence to Mohamed Adel Serhani .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Taleb, I., Serhani, M.A., Bouhaddioui, C. et al. Big data quality framework: a holistic approach to continuous quality management. J Big Data 8 , 76 (2021). https://doi.org/10.1186/s40537-021-00468-0

Download citation

Received : 06 February 2021

Accepted : 15 May 2021

Published : 29 May 2021

DOI : https://doi.org/10.1186/s40537-021-00468-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Big data quality
  • Quality assessment
  • Quality metrics and scores
  • Pre-processing

research paper on quality analysis

ORIGINAL RESEARCH article

Analysis on the spatiotemporal evolutions of groundwater hydrochemitry and water quality caused by over-extraction and seawater intrusion in eastern coastal china provisionally accepted.

  • 1 China Institute of Water Resources and Hydropower Research, China

The final, formatted version of the article will be published soon.

The over-extraction of groundwater has resulted in seawater intrusion and the southward migration of the saltwater interface, gradually deteriorating the groundwater quality in the Weibei Plain. In this research, groundwater samples were gathered from 46 monitoring wells for shallow groundwater during the years 2006, 2011, 2016, and 2021. The hydrochemical features of regional groundwater and the factors influencing the issue were subjected to statistical analysis. Additionally, the assessment of spatiotemporal variations in groundwater quality was conducted using the customized entropy-weighted water quality index (EWQI) method. The relationship between groundwater over-extraction and the southward intrusion of the saltwater interface was compared. The results of this paper revealed that the Weibei Plain has been in a state of long-term over-extraction of groundwater from 2000 to 2021, with an average annual over-extraction of 118.49 million m 3 . The groundwater depression cone areas in the northern part of the study area increased from 3247.37 km 2 to 4581.34 km 2 from 2006 to 2021, with the center of the cone experiencing a drop in groundwater level from -22m to -85m. The saltwater interface shifted southward by 711.71 km 2 from 2006 to 2021. In groundwater, the high concentrations of TH, TDS, and Cl -were primarily related to the seawater intrusion, while higher concentrations of NO3 -were mainly determined by frequent agricultural production. The groundwater hydrochemical types in the study area transitioned from predominantly HCO3•Ca-Mg type in 2006 to HCO3-Na type and SO4•Cl-Ca•Mg type in 2021 due to seawater intrusion. The results of PCA and HCA show the effects of seawater intrusion, human activities, and rock weathering on groundwater hydrochemistry. The evaluation results based on the EWQI revealed that the average value of the samples in 2021 was 101.36, which belonged to Class IV water quality standards, representing the poorest water quality among the four years. The southward migration of the saltwater interface led to the deterioration of groundwater quality in groundwater depression cone areas, which gradually worsened from 2006 to 2021. The maximum increase in EWQI value was 174.68 during the period, shifting from Class III water quality to Class V water quality.

Keywords: Seawater Intrusion, Groundwater over-extraction, Water Quality, hydrochemistry, spatiotemporal evolutions

Received: 25 Feb 2024; Accepted: 27 Mar 2024.

Copyright: © 2024 Chen, Wu, Pan and Shi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Mx. Chu Wu, China Institute of Water Resources and Hydropower Research, Beijing, China

People also looked at

research paper on quality analysis

New Journal of Chemistry

Evaluating the consistency of rice and paddy quality using four-dimensional fingerprint analysis †.

ORCID logo

* Corresponding authors

a School of Pharmacy, Shenyang Pharmaceutical University, Shenyang, Liaoning 110016, China E-mail: [email protected] , [email protected]

b School of Life Science and Biopharmaceutics, Shenyang Pharmaceutical University, Shenyang, Liaoning 110016, China E-mail: [email protected]

The aim of this paper is to evaluate the quality of 13 batches of rice and 17 batches of paddy. Differential scanning calorimetry curves, Fourier transform infrared spectroscopy, ultraviolet spectroscopy and electrochemical curves of 30 batches of rice and paddy were collected, and this paper utilized the quantitative fingerprints evaluated by using the systematic quantitative fingerprinting method, macro qualitative similarity ( S m ) and macro quantitative similarity ( P m ) to reflect the internal differences of the samples. A new quality assessment method was explored to quantify the obtained fingerprint profiles and to compare the correlation between the original fingerprints and the quantum fingerprints using a t -test. Using the parameters S m and P m as analysis criteria, hierarchical clustering analysis (HCA), orthogonal partial least squares discriminant analysis (OPLS-DA) and the equal-weight integrated evaluation method were used to evaluate the quality of the four analysis methods.

Graphical abstract: Evaluating the consistency of rice and paddy quality using four-dimensional fingerprint analysis

Supplementary files

  • Supplementary information PDF (116K)

Article information

Download citation, permissions.

research paper on quality analysis

Evaluating the consistency of rice and paddy quality using four-dimensional fingerprint analysis

Y. Ren, G. Li, T. Yang and G. Sun, New J. Chem. , 2024, Advance Article , DOI: 10.1039/D3NJ05593K

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page .

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page .

Read more about how to correctly acknowledge RSC content .

Social activity

Search articles by author.

This article has not yet been cited.

Advertisements

Book cover

International Conference Interdisciplinarity in Engineering

Inter-ENG 2023: The 17th International Conference Interdisciplinarity in Engineering pp 173–182 Cite as

Analysis of Defects in Quality Management in a Company from the Automotive Industry

  • Petruța Blaga   ORCID: orcid.org/0000-0001-6405-8509 11  
  • Conference paper
  • First Online: 29 March 2024

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 928))

The paper addresses a topic related to quality management in a company in the automotive industry. The purpose of the paper is to investigate a complaint received from the customer within the analysis team in the production department, highlighting the defect analysis procedures, the way defects are treated within the company, the relationship established with the customer who complains about certain defects identified in the products, as well as defect management within the company.

  • quality management
  • defect handling procedures
  • defect analysis

This is a preview of subscription content, log in via an institution .

Blaga, P.: The importance of human resources in the continuous improvement of the production quality. Procedia Manuf. 46 , 287–293 (2020)

Article   Google Scholar  

Blaga, P.: Analysis of defects in quality management. In: International Conference Literature, Discourse and Multicultural Dialogue – LDMD 10th Edition, vol. 10, pp. 7–13 (2022)

Google Scholar  

Blaga, P., Boer, J.: Increasing human resource efficiency in the production process. Proc. Tech. 12 , 469–475 (2014)

Candea, S., Veres, C., Blaga, P., Nutiu, E. Effects of using combined approach of quality circles and 7 steps method in automotive industry: a case study. In: The 15th International Conference Interdisciplinarity in Engineering, pp. 289–297 (2022)

Pop, L.D., Nagy, E.: Improving product quality by implementing ISO/TS 16949. Proc. Tech. 19 , 1004–1011 (2015)

Pop, L.D.: Study on creating a simplified model of quality management system in a SME from the central region of Romania. Proc. Tech. 22 , 1084–1091 (2016)

Priya, S.K., Jayakumar, V., Kumar, S.S.: Defect analysis and lean six sigma implementation experience in an automotive assembly line. Mater. Today-Proc. 22 , 948–958 (2020)

Tanco, M., Mateo, R., Santos, J., Jaca, C., Viles, E.: On the relationship between continuous improvement programmes and their effect on quality defects: an automotive case study. Total Qual. Manag. Bus. Excell. 23 , 277–290 (2012)

Download references

Author information

Authors and affiliations.

George Emil Palade University of Medicine, Pharmacy, Science, and Technology of Targu Mures, Gh. Marinescu Street 38, 540142, Târgu Mureș, România

Petruța Blaga

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Petruța Blaga .

Editor information

Editors and affiliations.

Faculty of Engineering and Information Technology, “George Emil Palade” University of Medicine, Pharmacy, Science and Technology of Targu Mures, Târgu Mureș, Romania

Liviu Moldovan

Adrian Gligor

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Blaga, P. (2024). Analysis of Defects in Quality Management in a Company from the Automotive Industry. In: Moldovan, L., Gligor, A. (eds) The 17th International Conference Interdisciplinarity in Engineering. Inter-ENG 2023. Lecture Notes in Networks and Systems, vol 928. Springer, Cham. https://doi.org/10.1007/978-3-031-54671-6_13

Download citation

DOI : https://doi.org/10.1007/978-3-031-54671-6_13

Published : 29 March 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-54670-9

Online ISBN : 978-3-031-54671-6

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Sensors (Basel)

Logo of sensors

Milk Source Identification and Milk Quality Estimation Using an Electronic Nose and Machine Learning Techniques

1 School of Artificial Intelligence, Hebei University of Technology, Tianjin 300130, China; moc.361@111nilgnafum (F.M.); nc.ude.tcub.liam@uyug (Y.G.)

2 School of Engineering, Merz Court, Newcastle University, Newcastle upon Tyne NE1 7RU, UK; [email protected]

In this study, an electronic nose (E-nose) consisting of seven metal oxide semiconductor sensors is developed to identify milk sources (dairy farms) and to estimate the content of milk fat and protein which are the indicators of milk quality. The developed E-nose is a low cost and non-destructive device. For milk source identification, the features based on milk odor features from E-nose, composition features (Dairy Herd Improvement, DHI analytical data) from DHI analysis and fusion features are analyzed by principal component analysis (PCA) and linear discriminant analysis (LDA) for dimension reduction and then three machine learning algorithms, logistic regression (LR), support vector machine (SVM), and random forest (RF), are used to construct the classification model of milk source (dairy farm) identification. The results show that the SVM model based on the fusion features after LDA has the best performance with the accuracy of 95%. Estimation model of the content of milk fat and protein from E-nose features using gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and random forest (RF) are constructed. The results show that the RF models give the best performance (R 2 = 0.9399 for milk fat; R 2 = 0.9301 for milk protein) and indicate that the proposed method in this study can improve the estimation accuracy of milk fat and protein, which provides a technical basis for predicting the quality of milk.

1. Introduction

Milk contains more than 100 chemical ingredients such as water, fat, phospholipids, proteins, lactose, inorganic salts, and other primary compounds [ 1 , 2 ]. The composition of milk is very complex. The mixture of lower fatty acids, acetones, acetaldehydes, carbon dioxide, and other volatile substances affects the odor of milk. Among them, sulfide is the main component of fresh milk odor. The flavor substances in milk are influenced by many factors, mainly produced by four forms, one of which is the reaction of milk fat, milk protein, and carbonic acid, etc. Triacylglycerols, fatty acids, diacylglycerides, saturated/polyunsaturated, and phospholipids in milk fat are directly related to the flavor of milk [ 3 , 4 ]. The degradation products of protein, fat, and lactose in milk are fatty acids, sulfur-containing amino acids, thiamine, etc. The decomposition process of these substances will produce volatile compounds [ 5 , 6 , 7 ]. Due to the different feed and growth environment of the cows from each dairy farm, the odor of milk produced is quite different [ 8 ]. The content of milk protein and milk fat plays a significant role in milk quality evaluation. The process of degradation for milk fat and milk protein or the interaction between derivatives can affect the milk’s odor compounds [ 9 ]. Therefore, the establishment of the milk detection model is of considerable significance to the identification of milk source and improvement of milk quality.

The traditional method to identify milk’s geographical origin is through physical tracking methods such as recording by experimenters. In recent years, many chemical analysis methods have been used to distinguish the origin of milk, such as the stable isotope ratio analysis method [ 10 , 11 ], the trace element content analysis method, and the nuclear magnetic resonance method [ 12 ]. At present, domestic and foreign research by near-infrared spectroscopy [ 13 ], microorganism physicochemical analysis [ 14 , 15 ], and DHI laboratory testing have achieved excellent results in quantitative detection of milk components [ 16 ]. However, these methods still have the disadvantages of high cost, low detection efficiency, vulnerability to damage, and cannot realize real-time detection of milk products. Therefore, it is essential to find a fast and efficient non-destructive testing method.

As a new gas detection and analysis technology, E-nose has reliable portability and simple operation, making food non-destructive testing easier [ 17 , 18 , 19 ]. The E-nose is a low cost digital electronic device that can mimic human olfaction. It can quickly evaluate complex, volatile gas mixtures and has been used in milk recognition, differentiation, and detection [ 20 , 21 ]. Bougrini et al. [ 22 ] used a hybrid E-nose and a voltammetric E-tongue to distinguish different pasteurized milk brands and their storage day. Tong et al. [ 23 ] analyzed the concentration of volatile substances in pre-heated skimmed milk using an E-nose and found that there was a good relationship between volatile compounds and sensory attributes through partial least squares regression (PLSR) model analysis. Although E-nose has been applied to the detection of dairy products, its performance still needs improvement. Analyzing E-nose signals using advanced machine learning techniques would enhance detection and estimation performance [ 24 ].

Therefore, this study proposes a fast identification method based on E-nose technology and machine learning techniques for milk source (dairy farm) identification and milk quality estimation. The developed E-nose system is mainly composed of a gas sensor array consisting of seven metal oxide semiconductor (MOS) sensors (FIGARO, Osaka, Japan) and a data acquisition module consisting of Arduino hardware and software modules. The collected gas information is transmitted to the PC through analog-to-digital conversion. After the data are preprocessed, pattern recognition algorithms are used for modeling to achieve the detection target. Based on three different classification algorithms: logic regression (LR), support vector machine (SVM), and random forest (RF), the milk source identification models are developed and compared. Gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and RF are used to construct models to estimate the content of milk fat and milk protein by utilizing historical data from E-nose measurements and DHI analytical measurements.

2. Materials and Methods

2.1. the developed e-nose.

The developed E-nose is an electronic system that mimics the animal’s olfactory organs and uses sensor array responses to identify odors. The working process is as follows: firstly, the gas sensors of the sensitive element react with the sample gas, and then the response signal is transmitted to PC through analog-to-digital (A/D) converter, after data preprocessing, the model is built in combination with pattern recognition algorithm to complete detection. The E-nose developed (length: 30 cm, width: 20 cm, height: 20 cm) in this study is composed of a gas sensor array module, signal acquisition and data acquisition module, and signal processing and pattern recognition module, as shown in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is sensors-20-04238-g001.jpg

Structure diagram of E-nose system.

The E-nose device designed in this study is divided into two layers. The upper layer is for gas collection, gas and sensor reaction, which includes transmission pipes, filter devices, intake pumps, exhaust pumps, and a gas chamber containing gas sensor arrays. Moreover, the upper wall is provided with two power ports and two air holes, respectively. The power ports are for providing power to the air pump. The air holes are divided into air intake holes and exhaust holes connected to the sampling test tube or the external environment. The lower layer is for the collection of response signals and data, including the Arduino development board and expansion board, and the USB connection port is set on the lower layer wall to be responsible for power supply and data transmission.

The sensitivity of each sensor in the array to the measured gas is different, so the system uses its response resistance value to identify the odor. In this study, the metal oxide semiconductor (MOS) sensors are selected as the E-nose gas sensors because of its advantages of fast response speed, high sensitivity, and strong stability. Ghasemi et al. [ 25 ] selected the E-nose device composed of TGS (Taguchi gas sensor) 2600, TGS2610, TGS2620 (FIGARO, Osaka, Japan) sensors to classify different types of cheeses. Sivalingam et al. [ 26 ] developed an E-nose prototype with an array of TGS 2620, 822, 813 (FIGARO, Osaka, Japan) sensors for real-time quality analysis of raw milk. Based on the characteristics of sensors and the above applications, seven MOS sensors (FIGARO, Osaka, Japan) were built into the E-nose gas sensor array. Table 1 shows the names of the gas sensors and the corresponding sensitive substances.

Gas sensor information in E-nose system.

The sensitive element of the Figaro sensor is composed of SnO 2 semiconductors. When the sample volatiles enter the collection system from the sampling tube and contact the heated metal oxide sensor array, the sensor resistivity G changes, and the ratio with the initial resistivity G 0 , G/G 0 (relative resistivity) changes accordingly. When the gas concentration becomes larger, G/G 0 deviates from 1 (greater than or less than 1). If the gas concentration is lower than the detection limit or there is no induction gas, it is close to or equal to 1.

The signal and data acquisition module uses Arduino hardware and software for design. In the developed E-nose system, the Arduino functions are: (1) obtaining the response values of the sensors; (2) processing data and communicating with the computer. The microcontroller on the development board is programmed using the Arduino programming language, compiled into a binary file, and passed to the microcontroller. Each sensor in the sensor array will digitally convert the response value to different volatile substances through a multiplexer analog-to-digital converter (ADC) and store the obtained data for subsequent computer analysis and identification and extraction of related features. The processed digital signal is transmitted to the upper computer through the serial port and finally presented on the serial port monitor.

The signal processing and pattern recognition module in the E-nose system plays a decision-making role. The original data from the sensors contain lots of complex information with high dimensions, and most of it is useless. Therefore, before inputting the data into the pattern recognition system, it is necessary to preprocess the original data, which mainly involves standardization processing, feature selection, feature dimensionality reduction, and retain valid information. The pattern recognition system can solve classification and regression problems by selecting different machine learning algorithms for modeling. According to different detection targets, setting reasonable model parameters can realize the binary classification problem and realize the multi-classification problem.

In addition, sample gas collecting and cleaning are realized by the gas collection system, which is composed of three parts: filter, air chamber, and air pump. Activated carbon is used as a gas desiccant to achieve gas filtration. The gas chamber is where the sample gas contacts and reacts with the sensor and with strong sealing performance to ensure gas concentration. The air pump provides power for gas transmission. The following parameter settings: cleaning time of 60 s, capture gas time of 90 s, and gas flow rate of 200 mL/min (range: 10 mL/min–1.1 L/min).

2.2. Milk Samples

Milk samples from 10 cattle farms’ raw milk in Hebei province were collected. The test cows to which the samples belong are lactating cows from 6 days postpartum to 6 days before milkless. The initial screening of these samples was carried out. Samples with low liquid levels and sub-standard temperatures were rejected. For DHI (Dairy Herd Improvement) machine detection, there will be a null value phenomenon, and the interference value needs to be eliminated before the experiment. Finally, 100 milk samples from each of 10 cattle farms were taken in the same period time for DHI analysis and E-nose measurement. In this study, three measurements for each milk sample were taken, and they were averaged to reduce measurement error. The DHI test samples and the E-nose test samples are the same.

2.2.1. DHI Analytical Data

The milk composition feature (DHI analytical data) from the DHI laboratory analysis uses imported biochemical detection equipment, including milk composition analyzer, somatic cell counter, fresh-keeping cabinet, and other facilities. Milk sample test temperature is 40 ± 2 °C. It includes 6 test indicators: milk fat rate, protein rate, lactose rate, total solids, somatic cells (SCC), and urea nitrogen.

Milk fat contains linolenic acid, arachidonic acid, and various fat-soluble vitamins and phospholipids, which are needed by the human body [ 27 ]. The content of fat and protein is an essential indicator of evaluation for milk quality. In regular milk, the ratio of milk fat to milk protein is ranged from 1.12 to 1.30. If the value is too low, the cow may have rumen acidosis. The content of lactose in milk is generally between 4.6% and 5%. Its value not only affects milk production but also relates to rumen function. Cells are a collective term for macrophages, lymphocytes, and polymorphonuclear neutrophils in milk. The number of somatic cells (SCC) is an indicator of the extent of cow mastitis infection [ 28 ]. The number of SCC indicates the health status of cows’ milk, which is usually less than 50 × 10 4 /mL. Urea nitrogen of milk is derived from the blood, ranging from 10 mg/dL to 18 mg/dL [ 29 ]. Excessive urea nitrogen content proves that cows are more likely to suffer from acidosis [ 30 ].

2.2.2. E-Nose Measurements

After heating in a 40 °C water bath, the sample gas is drawn. An amount of 20 mL of each milk sample was extracted and stored in a 40 mL test tube, sealed and placed for 10 min to ensure that the milk sample’s volatile matter filled the entire test tube. Before performing volatile gas capture, the airway and air chamber of the E-nose were cleaned with fresh air to eliminate interfering gas.

The measuring probe of E-nose and the balanced pressure tube were simultaneously extended into the headspace of the test tube. During the process of gas capturing, the filtered headspace gas of the milk sample is sucked into the gas chamber by the gas collection system, and contacts and reacts with the sensors. Then, the response value increases and tends to turn into a steady state, this process lasts for 90 s, and the gas flow rate is 200 mL/min. During the cleaning process, the filtered air gradually removes the volatile gas, and the response value decreases and stabilizes to a constant value, completing a sample measurement. This process lasts for 60 s. Three times for experiments were performed per sample, and the results averaged to reduce experimental errors. The E-nose detection process is shown in Figure 2 . The obtained milk odor data are the relative resistivity ratios (G/G 0 ) of the sensor array under the sample gas and the pure air environment in the steady state.

An external file that holds a picture, illustration, etc.
Object name is sensors-20-04238-g002.jpg

E-nose detection structure.

2.3. Data Analysis

In this study, 10 different sources of milk (dairy farms) were selected and the volatile gas in the milk samples was measured using an E-nose. A total of 1000 sample data after normalization (Equation (1)) were used for model development, of which 800 samples were used as the training set and the remaining 200 samples serve as the testing set.

where X is the original data, X ¯ is the average of the original data, σ is the variance.

For the development of cattle farm classification models (or milk source identification models), principal component analysis (PCA) and linear discriminant analysis (LDA) are used to reduce the dimensions of model inputs. Three machine learning algorithms, LR, SVM, and RF, are then used to construct the classification models.

For the development of milk fat content and protein content estimation models using E-nose and DHI data, three nonlinear modeling algorithms, including GBDT, XGBoost, and RF, were used and compared.

SVM is a supervised learning model that can perform pattern recognition, classification, and regression analysis [ 31 , 32 ]. The principle of SVM is to find the separation hyperplane that can correctly divide the classes in the training set and obtain the largest geometric distance. The objective function of the SVM is as follows:

where w and b are the SVM parameters, ξ i is the classification loss of the i th sample point, ϕ ( x i ) is the mapping function, C is the penalty parameter, x is the i th input sample, and n is the number of training samples.

For nonlinear classification problems, the kernel (mapping) function in SVM can map samples from the original space to high-dimensional space, making the samples linearly separable in the new space. Among them, the most commonly used and the most effective is the radial basis kernel function (RBF kernel):

where x 1 , x 2 are sample points of traing set; the parameter γ (gamma), defines the range of influence for a single training example, with low values meaning ‘far’ and high values meaning ‘close’.

Random forest is a crucial bagging-based ensemble learning method. It is composed of many decision trees (CARTs). It can be used to solve classification and regression problems and has strong anti-noise ability, can avoid overfitting. The procedure of developing an RF model is as follows: firstly, m sample points are extracted from the training sample set S to form a new training subset; secondly, a classification decision tree or regression model is constructed for each training subset, which is obtained by randomly selecting k features among all features as split nodes; the output of the model is the category (classification) with the highest number of votes or the average output (regression) of each decision tree [ 33 ].

Logistic regression is a supervised machine learning algorithm for solving classification problems. The principle is to find the minimum value of the loss function to make the prediction function more accurate, thereby solving the classification problem. The penalty term is a vital hyperparameter of the LR model, and the solver parameter can optimize the loss function [ 34 ].

2.3.4. GBDT

Gradient boosting decision tree is an integrated boosting algorithm based on CART learner [ 35 ]. The purpose of its algorithm in each round of iteration is to minimize the loss function of the current learner so that the loss function always decreases along its gradient direction, and the final residuals approach 0 through continuous iteration, adding up all the tree results to get the final prediction.

2.3.5. XGBoost

Extreme gradient boosting algorithm is an improved version based on GBDT, which is not sensitive to input requirements and is widely used in the industry. Compared with the general GBDT algorithm, XGBoost uses the second derivative of the loss function about the function to be sought, adds a regularisation term to prevent overfitting, and samples the attributes when constructing each tree. It has fast training speed and high accuracy and fitting effect, etc. [ 36 ].

3. Results and Discussion

3.1. response curve and radar chart analysis of e-nose.

Figure 3 a–c shows the sample’s response curves in 90 s sampling for three measurements. During the contact between the gas and the sensor surface, the ratio G/G 0 (relative resistivity) keeps rising, and finally reaches a steady state in about 60 s. Among the seven sensors, the responses of S1, S2, and S4 are significant.

An external file that holds a picture, illustration, etc.
Object name is sensors-20-04238-g003.jpg

Response curve and radar chart for E-nose data: ( a – c ) response curve of E-nose; ( d ) radar chart of E-nose.

Steady-state values of E-nose sensor responses (collected at 90 s) for one sample randomly selected from each farm are used to produce a radar chart shown in Figure 3 d, where each vertical axis represents a sensor. It can be seen from Figure 3 d that the response values of sensor 1, sensor 2, and sensor 4 vary significantly with cattle farms. By observing the response curve and radar chart, the samples of different farms are distinguishable. Therefore, milk from different cattle farms could be identified based on E-nose measurement data.

3.2. Milk Source (Dairy Farm) Identification

In this study, the steady-state (90 s) value of the E-nose response is selected as the feature parameter of the E-nose and DHI analytical data as the composition of milk. A feature fusion method based on the milk component feature and odor feature is proposed to evaluate the identification of different cattle farms. During model construction, DHI analytical data, E-nose measurements, and fusion features are used as model inputs, respectively, to evaluate and compare model classification results. During data preprocessing, PCA and LDA are used to reduce the dimensions of the data for different features and retain valid information. Then milk source identification models are developed using support vector machine (SVM), random forest (RF), and logistic regression (LR) algorithms. The models are developed on the 800 samples of training data and tested on the remaining 200 samples of testing data to verify the developed models.

3.2.1. Results of Data Dimensionality Reduction

The original DHI analytical data (six dimensions), the E-nose measurements (seven dimensions), and the DHI analytical and E-nose measurements fusion data (13 dimensions) were analyzed by PCA. The cumulative variance explanation rates of the first two principal components (PC) for these three cases are 99.908%, 95.96%, and 94.81%. Among them, PC1 and PC2 of DHI analytical data represent 99.9%, and 0.008% of the data variation respectively; PC1 and PC2 of E-nose measurements represent 88.38% and 7.58% of the data variation respectively; PC1 and PC2 of the fusion data represent 55.72% and 39.09% of the data variation respectively.

Figure 4 a–c shows the scatter plots in the principal component subspace, where the ten farms are color-coded. It can be seen from Figure 4 a that the farms are randomly distributed and cannot be distinguished from the first two PCs of the DHI analytical data. Compared with Figure 4 a, the first two PCs of the E-nose measurements in Figure 4 b show more grouping of the farms, but it is still impossible to distinguish them. In Figure 4 c, the first two PCs of the fusion data show more separations than the other two cases.

An external file that holds a picture, illustration, etc.
Object name is sensors-20-04238-g004a.jpg

Visualization of data dimensionality reduction: ( a ) Daily Herd Improvement (DHI) data dimension reduction results by Principal Component Analysis (PCA); ( b ) E-nose data dimension reduction results by PCA; ( c ) Fusion data reduction results by PCA; ( d ) DHI data dimension reduction results by Linear Discriminant Analysis (LDA); ( e ) E-nose data dimension reduction results by LDA; ( f ) Fusion data reduction results by LDA.

The LDA method was used to reduce the dimensionality of the original data, and the cumulative variance of the linear discriminant function in the three cases was 99.53%, 93.11%, and 91.5% ( Figure 4 d–f). In particular, LD1 and LD2 of DHI analytical data represent 98.84% and 0.69% of data variance respectively; LD1 and LD2 of E-nose measurements represent 84.63% and 8.48% of data variance respectively; LD1 and LD2 of the fusion data represent 51.93% and 39.57% of data variance respectively.

Although the original data after PCA dimensionality reduction is more comprehensive, the data distribution difference between different cattle farms after LDA dimensionality reduction is more significant. In particular, the dimensionality reduction results of the combined fusion data can achieve rapid differentiation, which proves that the samples are observed to be sufficiently representative, and the LDA dimensionality reduction method can be applied to milk sample data.

3.2.2. Model Validation and Analysis

Each cattle farm draws 80 training sets and 20 testing sets, and a total of 800 training sets and 200 testing sets are available from the ten cattle farms. The SVM, RF, and LR methods are used to classify the milk sources after PCA and LDA dimension reduction. The classification accuracy is based on 200 testing set, and calculated as:

where TP (True Positive) is the number of times the dairy farm was correctly classified, FP (False Positive) is the number of times the dairy farm was incorrectly classified.

In the SVM classification model, radial basis function (RBF) is used as the kernel function of the model, the penalty parameter C and kernel parameter γ are set as 10 and 0.1 respectively, which give the best classification results.

The number of decision trees (N) is an important parameter of the RF-based model classification model. The larger N is, the better the model tends to perform. However, a high N value leads to longer training time and more memory consumption. It is found that the classification performance is best when the value of N is 4 in this study.

In the LR-based model, the parameters of the penalty term model are selected to meet L2 that meets the Gaussian distribution, avoid overfitting the model, and obtain results with more substantial generalization capabilities easily. Iteratively optimize the loss function by selecting a second-order derivative matrix.

The models of milk source identification are constructed with different features of milk, including odor features and composition features obtained from E-nose and DHI analysis, and fusion features, which are compared with the algorithm model based on these features after dimensionality reduction of PCA or LDA, including SVM, RF, and LR models, as shown in Table 2 . During training, the five-fold cross-validation method is used to prevent overfitting. This method randomly divides the training set into five subsets, each time using different subsets as the validation set to obtain the accuracy rate, and finally acquires the mean of accuracy rate of each subset. The classification performance with PCA dimensionality reduction is significantly worse than that with LDA dimensionality reduction. The reason is that PCA does not consider the category during the dimensionality reduction process, and LDA is a supervised learning method with category output [ 37 ]. Each sample of the dataset for LDA has a category output. The LDA dimension reduction method is more geared towards classification than the PCA method.

Accuracy (mean of five-fold cross-validation) in milk source identification based on PCA and LDA (%).

When the input is the fusion features after LDA reduction, the model has the best classification performance with the accuracy of 95% for SVM, 94% for RF, and 92.5% for LR (based on the testing set). For the E-nose feature after LDA, the SVM model performs best with an accuracy rate of 85.75% for the training set and 85% for the testing set. For DHI feature after LDA, the accuracy ranged from 53.5% to 58.5% (based on the testing set), the training set and testing set have not achieved the classification effect. The results indicate that for milk source identification, SVM performs better than RF and LR models. The feature fusion method effectively solves the problem of missing information of a single feature. The fusion features contain both composition and odor information of milk, which effectively improves the classification effect of the model.

3.3. Estimation Models of Milk Fat Content and Protein Content by E-Nose

3.3.1. model performance indicators.

In order to explore the established model, the following indicators are used to comprehensively evaluate the developed models.

  • (1) Mean Absolute Error ( MAE ) calculated as: M A E = 1 n ∑ i = 1 n | y i − y ^ i | (5)
  • (2) Mean Squared Error ( MSE ) calculated as: M S E = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 (6)
  • (3) Coefficient of Determination, R 2 calculated as: R 2 = ∑ i = 1 n ( y i − y ^ i ) 2 ∑ i = 1 n ( y i − y ¯ i ) 2 (7)

In the above equations, n is the number of samples, y i is the actual value, y ^ i is the predicted value, and y ¯ i is the average of the actual value.

3.3.2. Comparison of Different Models

Based on the above evaluation indicators, the three models developed using GBDT, XGBoost, and RF are evaluated and compared. The milk fat content and protein content were used as the outputs of the models, and the seven sensor outputs from the E-nose were used as inputs to establish the milk quality estimation models. The performance indices of the developed models on the training set and testing set are shown in Table 3 and Table 4 . The model errors on the testing set are shown in Figure 5 .

An external file that holds a picture, illustration, etc.
Object name is sensors-20-04238-g005.jpg

Model estimation error: ( a ) model errors for fat; ( b ) model errors for protein.

Estimation models for fat content based on three algorithms.

Estimation models for protein content based on three algorithms.

As can be seen from Table 3 and Table 4 , as well as Figure 5 , among the different modeling methods, the RF method provides the best performance. Compared with GBDT and XGBoost models, the RF model has the smallest estimation error. Furthermore, it gives smaller MAE , MSE values, and larger R 2 values than the other two modeling methods. In actuality, both the XGBoost and RF models estimate very well, with only a difference of 0.02 for milk fat and 0.01 for milk protein in R 2 values. The results prove the effectiveness of the E-nose method to estimate the rate of milk fat and protein.

4. Conclusions

Milk source identification and estimation of milk fat content and protein content using an E-nose with machine learning techniques are studied in this paper. The E-nose developed is composed of a gas sensor array module, signal acquisition and data acquisition module, and signal processing and pattern recognition module. As for the rapid identification of milk source, LR, SVM, and RF are used, in conjunction with PCA and LDA dimension reduction, to construct the classification models. Classification models using DHI features, E-nose features, and fusion features are investigated and compared. It is shown that milk source identification models using LDA extracted features as inputs give better performance than those using PCA extracted features as inputs. The reason is that, in contrast to PCA, LDA is a supervised learning method considering the categories of the data samples. The results show that the SVM model based on the fusion features after LDA has the best performance with an accuracy of 95%. Therefore, the feature fusion method can effectively improve the classification effect of the model.

For the estimation of milk fat content and protein content using E-nose data measurement, GBDT, XGBoost, and RF algorithms were used to establish the estimation models. The RF model has the best fitting performance with the R 2 values being 0.9399 and 0.9301 for fat and protein content, respectively. The experimental results show that milk quality can be accurately estimated from E-nose measurements using machine learning techniques. Further works on enhancing model accuracy and reliability will be carried out.

Author Contributions

All authors contributed extensively to the study presented in this manuscript. L.Z. proposed the research concept of this paper, provided experimental samples and supervised the work. F.M. designed the model and performed the experiments. Y.G., J.Z. and L.Z. revised the paper. All authors have read and agreed to the published version of the manuscript.

The Key R&D Program of Hebei Province (19226613D and 20326602D) and the Key R&D Program of Shijiazhuang (201500522A) supported this work.

Conflicts of Interest

All the authors declare no conflict of interest.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 11 September 2023

Analysis of the research progress on the deposition and drift of spray droplets by plant protection UAVs

  • Qin Weicai 1 , 2 &
  • Chen Panyang 3  

Scientific Reports volume  13 , Article number:  14935 ( 2023 ) Cite this article

1153 Accesses

1 Citations

Metrics details

  • Plant sciences

Plant protection unmanned aerial vehicles (UAVs), which are highly adapted to terrain and capable of efficient low-altitude spraying, will be extensively used in agricultural production. In this paper, single or several independent factors influencing the deposition characteristics of droplets sprayed by plant protection UAVs, as well as the experimental methods and related mathematical analysis models used to study droplet deposition and drift, are systematically investigated. A research method based on farmland environmental factors is proposed to simulate the deposition and drift characteristics of spray droplets. Moreover, the impacts of multiple factors on the droplet deposition characteristics are further studied by using an indoor simulation test system for the spraying flow field of plant protection UAVs to simulate the plant protection UAVs spraying flow field, temperature, humidity and natural wind. By integrating the operation parameters, environmental conditions, crop canopy characteristics and rotor airflow, the main effects and interactive effects of the factors influencing the deposition of spray droplets can be explored. A mathematical model that can reflect the internal relations of multiple factors and evaluate and analyze the droplet deposition characteristics is established. A scientific and effective method for determining the optimal spray droplet deposition is also proposed. In addition, this research method can provide a necessary scientific basis for the formulation of operating standards for plant protection UAVs, inspection and evaluation of operating tools at the same scale, and the improvement and upgrading of spraying systems.

Similar content being viewed by others

research paper on quality analysis

US oil and gas system emissions from nearly one million aerial site measurements

Evan D. Sherwin, Jeffrey S. Rutherford, … Adam R. Brandt

research paper on quality analysis

Comparing the carbon footprints of urban and conventional agriculture

Jason K. Hawes, Benjamin P. Goldstein, … Nevin Cohen

research paper on quality analysis

Why flying insects gather at artificial light

Samuel T. Fabian, Yash Sondhi, … Huai-Ti Lin

Introduction

In agriculture, aerial spray is widely used to spray fertilizers, herbicides, fungicides and other materials used for crop protection 1 . Compared with large fixed-wing agricultural aircraft, small unmanned aerial vehicles (UAVs) are particularly advantageous because they are highly maneuverable and do not need any airport for taking off or landing 2 . In recent years, aerial machinery for plant protection, especially aerial spray by small plant protection UAVs, has developed rapidly 3 . Small plant protection UAVs have greater application prospects in agricultural production because of their better terrain adaptability and low-altitude spraying capability (Figs. 1 and 2 ) 4 , 5 , 6 , 7 . However, as an emerging technology, UAV spraying technology in agricultural pest control are not common due to the lack of operational standards and uncertainty about the best spraying parameters, which leads to a series of problems, such as the poor uniformity of droplet deposition distribution and low levels of fog deposition.

figure 1

Single-rotor UAV spraying.

figure 2

Multirotor UAV spraying.

Some studies have shown that if the aerial spraying parameters are not set scientifically, it will lead to not only repeated spraying and missed spraying, degrading the effect of pest control but also pesticide drift 8 . The use of new pesticide additives and the innovative research and development of precise spraying equipment of plant protection UAVs along with its safe and efficient use in the prevention and control of diseases, pests and weeds are indispensable means to increase the pesticide deposition amount and reduce drift. Studying the deposition characteristics of spray droplets is not only of scientific significance for the development of new pesticide formulations and precise spraying equipment of plant protection UAVs but also of practical guiding significance for the safe and efficient use of pesticides in farmland. Due to many factors, such as the natural environment, pesticide characteristics, crop canopy characteristics, and plant protection UAV operating parameters, it is a complicated process to study the uniformity and penetration of spray droplets. To improve the spraying effect and reduce drift, scientific and technological staff all over the world have carried out a large number of exploratory studies on the deposition and drift characteristics of spray droplets through field or wind tunnel experiments or mathematical model analysis 9 , 10 , 11 , 12 , 13 . The main factors and secondary factors influencing the characteristics of droplet deposition and drift are organized from the many influencing factors (nozzle, droplet, aircraft type, weather factors, etc.), and the functional relationship between the amount of different droplet deposition and drift and their influencing factors are determined. However, there are not sufficient deposition models for plant protection rotor UAVs, and the existing models consider only a few influencing factors, which need to be further modified.

With the development of UAV technology, there are an increasing number of studies on the droplet deposition rules, operation parameter optimization and evaluation methods of pesticides applied by plant protection UAVs in rice fields and maize fields 14 , 15 , 16 , 17 ; however, these studies have defects in that the meteorological factors in the farmland environment are unstable and uncontrollable, the UAV track easily deviates, resulting in the poor uniformity of droplet deposition distribution (the coefficient of variation may be above 40% 16 , while it is usually below 10% for spraying by ground equipment), the test result cannot be well repeated, and different types of UAVs cannot be easily evaluated at the same scale. Thus, it is difficult to evaluate the droplet deposition characteristics of different types of UAVs scientifically. Some research has established mathematical models to study the impact of plant protection UAV operating parameters (operating height, operating speed, and spraying flow rate) on droplet deposition and drift characteristics 18 , 19 , 20 and determined the main effects influencing droplet deposition. However, due to the lack of conformity between the assumptions of these models and farmland practice, they neglected the influence of the characteristics of the crop canopy and the interaction of multiple factors such as the environment, crops, and operating parameters of application equipment on the droplet deposition characteristics (uniformity of distribution and penetration), making the results obtained through analysis with existing mathematical models highly deviate from practice.

In this paper, the current status and problems of research on the deposition and drift of spray droplets from plant protection drones are introduced, and the importance of research in this area to improve the effectiveness of pesticide application and reduce drift hazards is emphasized. The need for more in depth, comprehensive and systematic research on the deposition and drift of spray droplets from plant protection drones is highlighted, and the problems and challenges of the current research are pointed out, providing important guidance and references for future research.

Research on the influencing factors of spray droplet deposition characteristics

Studying droplet deposition characteristics (uniformity and penetration) is always a major subject in pesticide application technology research 21 . The deposition characteristics of spray droplets are influenced by application techniques and equipment, crops, the environment, etc. Detailed influencing factors include the wind speed, wind direction, leaf area index, target crop canopy structure, leaf inclination, leaf surface characteristics, and characteristics of the spray droplet population (release height, release rate, application liquid volume, spray droplet particle size spectrum) 22 , 23 , 24 .

Several studies have investigated the influence of various factors on droplet deposition characteristics in plant protection UAV spraying. Diepenbrock noted that plant leaf characteristics, such as size, inclination angle, drooping degree, and spatial arrangement, impact the composition quantity and distribution quality within the crop canopy structure, subsequently affecting droplet penetration and deposition 25 . Song et al. found that altering the initial velocity of droplets increases deposition amounts on horizontal and vertical targets. Factors like flying altitude and speed of different aircraft types have been extensively studied for their influence on droplet deposition and drift 26 . Qiu et al. used an orthogonal experimental method to study the deposition distribution rules of droplets sprayed by unmanned helicopters at different flying heights and speeds under field conditions. They established a relationship model that clarifies the interactions between deposition concentration, uniformity, flying speed, and flying height, providing valuable insights for optimizing spray operation parameters 18 . Chen et al. investigated the pattern of aerial spray droplet deposition in the rice canopy using a small unmanned helicopter. They explored the effects of different spraying parameters on droplet distribution, specifically analyzing the deposition of growth regulator spraying 27 . Wang et al. proposed a method for testing the spatial mass balance of UAV-applied droplets and conducted field experiments on three types of UAVs to accurately determine the spatial distribution of the droplets and the downdraft field. They also conducted an experimental study on the droplet deposition pattern of hovering UAV variable spraying and highlighted the significant impact of downward swirling airflow on droplet deposition distribution 14 . Qin et al. focused on the influence of spraying parameters, such as operation height and velocity, of the UAV on droplet deposition on the rice canopy and protection efficacy against plant hoppers, using water-sensitive paper to collect droplets and statistically analyzing their coverage rates. The findings indicated that UAV spraying exhibited a low-volume and highly concentrated spray pattern 19 .

In summary, there are many factors influencing the deposition characteristics (uniformity and penetration) of spray droplets. However, in most of the current research on spraying by plant protection UAVs, only the influence of factors such as the flying height and flying speed on droplet deposition in the field environment is taken into consideration. Considering the influence of the interaction between environmental factors, crop canopy characteristics (growth stage, leaf area index, leaf inclination angle) and plant protection UAV spraying parameters on droplet deposition characteristics, there is neither in-depth understanding nor relevant reports, especially under controllable environmental conditions (Fig.  3 ). To promote high-efficiency spraying technology for plant protection UAVs, targeted basic research should be carried out on the analysis of the influencing factors of plant protection UAV spraying and the optimal deposition of droplets.

figure 3

Description of the deposition and drift with rotor UAV spraying.

Research on the experimental means and testing methods of droplet deposition and drift

At present, the deposition and drift of droplets are mainly researched by field tests and wind tunnel tests 28 , 29 , 30 , 31 , 32 . Field test research on pesticide deposition and drift is similar to the actual situation, but it is quite difficult to acquire correct data due to the constant changes in meteorological factors such as the wind speed, wind direction, temperature and humidity. In addition, Emilia et al. noted that the terrain and plant morphology also influence the wind flow and droplet deposition, leading to considerable deviation among repeated test results 33 . Therefore, it is difficult to accurately determine the total amount and distribution of pesticides drifting in the air 34 . The wind tunnel laboratory can provide a controllable environment to simulate the external spraying conditions, and the wind speed and direction can be easily controlled. Therefore, it is an important means to study the drift characteristics of spraying components and avoid many defects in field test research 10 , 35 . The typical wind tunnels that are widely used in agricultural aviation spraying technology are shown in Table 1 36 , 37 .

Internationally well-known professional research institutions for pesticide application, such as the Julius Kuehn Institute-Federal Research Centre for Cultivated Plants (JKI, formerly BBA) and USDA-Agricultural Research Service, Application Technology Research Unit (USDA-ARS-ATRU), have a circular closed low-speed standard wind tunnel (Fig.  4 ). This wind tunnels are widely used to assess the distribution, degradation and drift of pesticide sprays, simulating real crop and environmental conditions. The advantages are that they provide accurate measurements of pesticide distribution and drift and are able to reproduce wind field conditions in realistic environments. However, circular low-speed wind tunnels have limitations when it comes to parameters such as spray particle size, density and flow rate for different pesticides. The Silsoe Research Institute, UK (SRI) has a standard linear low-speed wind tunnel. This wind tunnel can be used to test the performance of agricultural mechanised sprayers and the design of sprayers. The advantage is that they can simulate actual operating conditions and can accurately test the performance and flow rate of agricultural mechanised sprayers. However, linear low speed wind tunnels are typically more expensive than circular wind tunnels and can only simulate a single environmental condition. The Center for Pesticide Application and Safety (CPAS) of the University of Queensland in Australia has an open-path wind tunnel (Fig.  5 ). This type of wind tunnel can be used to test aspects such as drift and particle distribution of agricultural sprayers. The advantages are ease of operation, low cost and the ability to reproduce wind fields under different environmental conditions. However, open path wind tunnels do not simulate realistic crop environments and have unstable wind speeds. In 2014, the Nanjing Institute of Agricultural Mechanization, Ministry of Agriculture and Rural Affairs, built the NJS-1 plant protection direct flow closed wind tunnel (Fig.  6 ). This type of wind tunnel is mainly used to evaluate different sprayers in terms of performance and droplet distribution. The advantages are the ability to simulate a realistic farm environment with high accuracy and the ability to test different types and brands of sprayers. However, straight-through enclosed wind tunnels are only suitable for small equipment and small-scale trials and are costly. In 2018, the National Center for International Collaboration Research on Precision Agricultural Aviation Pesticide Spraying Technology of South China Agricultural University built a high- and low-speed composite wind tunnel for agricultural aviation research (Fig.  7 ). This wind tunnel is suitable for agricultural aerial research and can simulate the effects of spraying at different heights and wind speeds. The advantage is that it can accurately test the effects of pesticide spraying at different heights and speeds, and can improve the efficiency and accuracy of agricultural aerial spraying. However, high and low speed composite wind tunnels are relatively costly and require a high level of technology and equipment requirements. As the basic conditions for technical research, these wind tunnels have made great contributions to the study of pesticide deposition and drift rules, product testing, and product optimization 38 , 39 , 40 , 41 , 42 . However, for the study of spray droplet deposition and drift under the disturbance of the wind field of plant protection UAVs, the single-direction wind tunnel simulation test is still insufficient to simulate the combined effect of the downward swirl flow under the rotor and the natural wind. In addition, the existing agricultural wind tunnels are limited in size, so plant protection UAVs cannot be placed. In the military, a scaled model method is used to put UAVs into wind tunnels for research 43 , 44 , but it is not suitable for research on pesticide spraying with plant protection UAVs, and the airflow will rebound from the tunnel wall.

figure 4

Circle closed low-speed wind tunnel.

figure 5

Open wind tunnel.

figure 6

NJS-1DC closed wind tunnel.

figure 7

High and close speed composite wind.

Another important test technique for drift research is the sampling and analysis of droplet drift. Test studies on the drift of aerial mist in developed countries such as the United States and Germany are carried out with advanced test instruments, including automatic air samplers, gas or liquid chromatography, fluorescence analyzers, and electronic scanners. to collect and analyze the droplet deposition amount, the number of droplets, the coverage density of droplets, and the content of substances and study the correlation between additive concentration, spraying height and drift 4 , 45 , 46 . However, these traditional methods involve a long collecting and processing cycle, samples have to be processed in the lab, and it is difficult to express the dynamics of droplets in air. Particle image velocimetry (PIV) and LIDAR scanning test methods can solve the above problems, and each has its own advantages. PIV can obtain the three-dimensional spatial velocity vector of droplets and droplet size with a high sampling accuracy but limited spatial measurement scale 47 , 48 , 49 ; the LIDAR scanning method, realized by layered scanning, can quickly and accurately obtain the large-scale spatial droplet point cloud data and inversely form the three-dimensional distribution and temporal-spatial change of the droplets, but cannot reflect the spatial velocity vector change of the droplets 50 . The advantages, disadvantages and applications of droplet deposition and drift measurement methods are shown in Table 2 51 .

Overall, the sampling and analysis of droplet drift, along with techniques such as PIV and LIDAR scanning, play a crucial role in studying and understanding the behavior of droplets during aerial spraying. These methods provide valuable insights into droplet deposition, drift patterns, and the effects of various factors, enabling researchers to optimize spray practices, minimize drift, and enhance the efficiency and effectiveness of plant protection UAV applications.

Research on the mathematical analysis model of spray droplet deposition characteristics

In the development of spraying equipment and the determination of the optimal deposition conditions for spray, a large amount of data and information are needed to explain the influence of different factors on the spraying performance and the relationship between variables. At present, spraying drift modeling can be divided into models based on mechanics and models based on statistics 52 , 53 , 54 .

One of the models based on mechanics analyzes the movement of a single droplet in the airflow field by the Lagrangian trajectory tracking analysis method. Teske et al. established the AGDISP model by the analytical Lagrangian method to describe aerial spraying under the condition of ignoring the influence of aircraft wake and atmospheric turbulence 46 . This model takes not only the aircraft type, environmental conditions, and droplet properties but also the influencing factors of the nozzle model into consideration. The user can input the parameters of the nozzle, droplet spectrum, aircraft type and weather factors. from an internal database and predict the drift potential. It can effectively and accurately predict a range of 20 km but is generally used for fixed-wing aircraft. Duga et al. and Gregorio et al. also studied the deposition distribution of aerial spray in orchards with the Lagrangian discrete phase model, and the result of the numerical model showed that the prediction error of total deposition on the fruit tree canopy is above 30% 48 , 51 . Dorr et al. developed a spray deposition model for whole plants based on L-studio, which takes into account the plant leaf wettability, impact angle, droplet break-up and rebound behavior, and the number of sub-droplets produced 55 . In 2020, Zabkiewicz et al. used an updated version of the software based on this model, developing a new user interface and refining the droplet fragmentation model 56 .

Another model based on mechanics is realized with the CFD (Computational Fluid Dynamics) method 57 , 58 , but there are still large errors between the simulated value and the real value of some models due to various factors. Holterman et al. carried out a series of cross-wind single nozzle field experiments in consideration of the traveling speed, entrained airflow, geometric parameters of the farmland, sprayer system setting parameters and environmental factors when studying the droplet deposition drift model of ground boom sprayers to calibrate the mathematical model. The results showed that when the height from the crop canopy is less than or equal to 0.7 m, the error between the test and the model simulation is within 10%, but the error between droplet deposition and drift prediction gradually increases as the height of the spray boom increases 59 , 60 , 61 .

Chinese scientific and technological staff have conducted experimental research and numerical analysis on the numerical simulation and mathematical modeling of spraying droplet deposition and drift prediction of ground plant protection equipment and have drawn some conclusions that physical quantities such as the operating speed, droplet size and crosswind impact the droplet deposition and drift process (Figs. 8 and 9 ) 62 , 63 . Zhu et al. developed the DRIFTSIM based on CFD and Lagrangian methods with a CFD simulation database for ground drift prediction and a user interface to access drift-related data 64 . Hong et al. constructed an integrated computational hydrodynamic model to predict the deposition and transport of pesticide sprays under the canopy in apple orchards during different growth periods 65 .

figure 8

Rotor wind field test platform based on a wind tunnel.

figure 9

Layout scene of droplet drift.

The above research proves that computer simulation technologies are widely applicable to the prediction research of droplet deposition under various complicated wind-supply airflow conditions 66 . The existing AGDISP model is relatively developed and only suitable for research on fixed-wing aircraft, which is very different from research on plant protection UAVs. The current plant protection UAV spraying prediction model still has problems such as large relative errors between the experimental value and simulation value of the deposition and drift at each measurement point. Therefore, the prediction accuracy of the numerical model for the spray droplet deposition of plant protection UAVs is still low and needs to be improved, and there is a lack of in-depth basic research on analyzing the rotor flow field and establishing mathematical analysis models for droplet deposition 67 .

The rotor wind field test platform and droplet drift

The use of UAVs for crop spraying has become increasingly popular due to its efficiency and effectiveness. However, accurately analyzing the spraying process is challenging due to the complex flow field of the droplets in the air and the multitude of factors that can affect their deposition characteristics. Current testing systems rely on simple methods such as static targets or trays, which do not accurately represent the dynamic and complex nature of the real environment. To better study the UAV spraying flow field, a corresponding indoor simulation test system is needed. The indoor simulation system proposed in this study combines a natural wind simulation system and a rotor simulation system that can simulate several factors present in the natural environment that affect droplet deposition characteristics. The natural wind simulation system can effectively replicate wind speed variations, which is a key factor influencing droplet dispersion and deposition. By adjusting the settings of the wind simulation system, it is possible to replicate a range of wind speeds encountered in the field, allowing researchers to study the effects of different wind speeds on droplet behaviour and deposition. By adjusting the settings of the rotor simulation system, it is possible to demonstrate the magnitude of the downwash airflow at different speeds of the UAV rotor. However, it is important to note that while wind speed variations can be simulated, other factors, such as wind direction and turbulence, may have limitations in being accurately replicated in an indoor simulation system. These factors may require further development of simulation techniques to achieve more accurate replication. Nevertheless, the inclusion of natural wind simulation systems and rotor simulation systems in indoor simulation setups provides a valuable tool for studying the effects of wind speed.

The fluorescence tracer method involves adding a fluorescent dye or tracer to the liquid spray mixture used in the UAV spraying process. When these droplets containing the fluorescent tracer are released into the air, they can be illuminated with a specific wavelength of light, typically ultraviolet (UV) light. The fluorescent dye absorbs this UV light and re-emits it at a longer wavelength, usually in the visible range.

The high-speed camera is synchronized with the UV light source and captures the emitted fluorescent signals from the droplets. By analyzing the recorded video footage, researchers can precisely track the movement and behavior of the fluorescent droplets in the air. The high-speed camera captures images at a rapid frame rate, allowing for the visualization and analysis of the droplet flow field in detail.

The proposed indoor simulation test system for the spraying flow field of plant protection UAVs is a comprehensive and innovative method that combines the fluorescence tracer method and high-speed camera method to accurately track the dynamic changes in the local droplet flow field in the air. This system also includes a natural wind simulation system, which allows for the more realistic simulation of the actual environment, and thus more accurately reproduces the complex factors that affect droplet deposition characteristics. This method represents a significant improvement over existing testing systems, as it provides a more accurate and comprehensive analysis of the deposition process of droplets affected by multiple factors, enabling researchers to more effectively study the flow field and optimize the spraying process for plant protection UAVs. Overall, this proposed system has the potential to revolutionize the study of UAV spraying flow fields and could lead to significant advancements in the field of plant protection. Therefore, the method proposed in this paper is superior to the methods currently in use (Fig.  10 ).

figure 10

Diagram of the rotor wind field test platform and droplet drift.

In conclusion, existing studies on plant protection UAV spraying have primarily focused on isolated factors, such as flying height, flying speed, and nozzle flow, without considering the interaction effects among other influential factors. This limitation calls for the need to conduct experimental research that combines spray droplet deposition characteristics with crop canopy characteristics in a controllable environment, encompassing environmental conditions and operating parameters. The proposed research aims to address this gap by developing an indoor simulation system that incorporates a natural wind simulation system. This innovative system enables the study of droplet deposition characteristics influenced by multiple factors in a realistic environment. By statistically analyzing the factors affecting droplet deposition and establishing a multivariable relationship model, optimal droplet deposition suitable for field operation decision-making of plant protection UAVs can be quantified and evaluated. This research presents an effective technical pathway for understanding the deposition patterns of droplets sprayed by plant protection UAVs and supports the formulation of relevant pesticide application standards for plant protection UAVs.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Lan, Y. B., Thomson, S. J., Huang, Y. B., Hoffmann, W. C. & Zhang, H. H. Current status and future directions of precision aerial application for site-specific crop management in the USA. Comput. Electron. Agric. 74 (1), 34–38 (2010).

Google Scholar  

Chen, T. H. & Lu, S. H. Autonomous navigation control system of agricultural mini-unmaned aerial vehicles based on DSP. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE). 28 (21), 164–169 (2012) ( (in Chinese with English abstract) ).

CAS   Google Scholar  

Zhou, W. Application and popularization of agricultural unmanned plant protection helicopter. Agric. Eng. 3 (S1), 56–58 (2013).

Lan, Y. B., Hoffmann, W. C., Fritz, B. K., Martin, D. E. & Lopez, J. D. Spray drift mitigation with spray mix adjuvants. Appl. Eng. Agric. 24 (1), 5–10 (2008).

Zhang, D. Y., Lan, Y. B., Chen, L. P., Wang, X. & Liang, D. Current status and future trends of agricultural aerial spraying technology in China. Trans. Chin. Soc. Agric. Mach. 45 (10), 53–59 (2014).

Faical, B. S., Costa, F. G., Pessin, G., Ueyama, J. & Freitas, H. The use of unmanned aerial vehicles and wireless sensor networks for spraying pesticides. J. Syst. Architect. 60 (4), 393–404 (2014).

Xue, X. Y. & Lan, Y. B. Agricultural aviation applications in USA. Trans. Chin. Soc. Agric. Mach. 44 (5), 194–201 (2013).

Fritz, B. K., Hoffmann, W. C. & Lan, Y. B. Evaluation of the EPA drift reduction technology (DRT) low-speed wind tunnel protocol. J. ASTM Int. 4 (6), 1–11 (2009).

Liu, H. S., Lan, Y. B., Xue, X. Y., Zhou, Z. Y. & Luo, X. W. Development of wind tunnel test technologies in agricultural aviation spraying. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 31 (Supp. 2), 1–10 (2015) ( (in English) ).

Ru, Y., Zhu, C. Y. & Bao, R. Spray drift model of droplets and analysis of influencing factors based on wind tunnel. Trans. Chin. Soc. Agric. Mach. 45 (10), 66–72 (2014).

Lebeau, F. & Verstraete, A. RTDrift: A real time model for estimating spray drift from ground applications. Comput. Electron. Agric. 77 (2), 161–174 (2012).

Fritz, B. K. Meteorological effects on deposition and drift of aerially applied sprays. Trans. ASABE 49 (5), 1295–1301 (2006).

Zeng, A. J., He, X. K., Chen, Q. Y., Herbst, A. & Liu, Y. J. Spray drift potential evaluation of typical nozzles under wind tunnel conditions. Trans. CSAE. 21 (10), 78–81 (2005) ( (in Chinese with English abstract) ).

Wang, C. L. et al. Testing method of spatial pesticide spraying deposition quality balance for unmanned aerial vehicle. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 32 (11), 54–61 (2016) ( (in Chinese with English abstract) ).

Wang, L. et al. Design of Variable spraying system and influencing factors on droplets deposition of small UAV. Trans. Chin. Soc. Agric. Mach. 47 (1), 1–8 (2016).

Qin, W. C., Xue, X. Y., Zhou, L. X. & Wang, B. K. Effects of spraying parameters of unmanned aerial vehicle on droplets deposition distribution of maize canopies. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 30 (5), 50–56 (2014) ( (in Chinese with English abstract) ).

Gao, Y. Y. et al. Primary studies on spray droplet distribution and control effects of aerial spraying using unmanned aerial vehicle (UAV) against the corn borer. Plant Prot. 39 (2), 152–157 (2013).

Qiu, B. J. et al. Effects of flight altitude and speed of unmanned helicopter on spray deposition uniform. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 29 (24), 25–32 (2013) ( (in Chinese with English abstract) ).

Qin, W. C., Qiu, B. J., Xue, X. Y. & Wang, B. K. Droplet deposition and control effect of insecticides sprayed with an unmanned aerial vehicle against plant hoppers. Crop Prot. 85 , 79–88 (2016).

Hewitt, A. J. Droplet size spectra classification categories in aerial application scenarios. Crop Prot. 27 (9), 1284–1288 (2008).

Gil, E., Llorens, J., Llop, J., Fàbregas, X. & Gallart, M. Use of a terrestrial LIDAR sensor for drift detection in vineyard spraying. Sensors (14248220) 13 (1), 516–534. https://doi.org/10.3390/s130100516 (2013).

Article   ADS   Google Scholar  

Huang, Y., Hoffmann, W. C., Lan, Y., Wu, W. & Fritz, B. K. Development of a spray system for an unmanned aerial vehicle platform. Appl. Eng. Agric. 25 (6), 803–809 (2009).

Gaskin, R. E., Steele, K. D. & Foster, W. A. Characterizing plant surfaces for spray adhesion and retention. N. Z. Plant Prot. 58 , 179–183 (2009).

Zhu, J. W., Zhou, G. J., Cao, Y. B., Dai, Y. Y. & Zhu, G. N. Characteristics of fipronil solution deposition on paddy rice leaves. Chin. J. Pestic. Sci. 11 (2), 250–254 (2009).

Diepenbrock, W. Yield analysis of winter oilseed rape ( Brassica napus L.): A review. Field Crops Res. 67 , 35–49 (2000).

Song, J. L., He, X. K. & Yang, X. L. Influence of nozzle orientation on spray deposits. Trans. CSAE 22 (6), 96–99 (2006) ( (in Chinese with English abstract) ).

ADS   Google Scholar  

Chen, S. D. et al. Effect of spray parameters of small unmanned helicopter on distribution regularity of droplet deposition in hybrid rice canopy. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 32 (17), 40–46 (2016) ( (in Chinese with English abstract) ).

Xiong, Z. O. U., Rangshu, X. U., Jingchun, L. I. & Zilin, L. I. U. Particle kinematics analysis of droplet drift in spraying operation of plant protection UAV. Plant Dis. Pests. 13 (2), 17–23. https://doi.org/10.19579/j.cnki.plant-d.p.2022.02.006 (2022).

Article   Google Scholar  

Gil, E. et al. Influence of wind velocity and wind direction on measurements of spray drift potential of boom sprayers using drift test bench. Agric. For. Meteorol. 202 , 94–101 (2015).

Ferreira, M. C., Miller, P. C. H., Tuck, C. R., O’Sullivan, C. M., Balsari, P., Carpenter, P. I., Cooper, S. E. & Magri B. (2010). Comparison of sampling arrangements to determine airborne spray profiles in wind tunnel conditions. Asp. Appl. Biol. Int. Adv. Pest. Appl. 291–296.

Qi, L. J., Hu, J. R., Shi, Y. & Fu, Z. T. Correlative analysis of drift and spray parameters. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 5 (20), 122–125 (2004).

Zhang, R. R. et al. Spraying atomization performance by pulse width modulated variable and droplet deposition characteristics in wind tunnel. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE). 35 (3), 42–51 (2019) ( (in Chinese with English abstract) ).

Hilz, E. & Vermeer, A. W. Spray drift review: The extent to which a formulation can contribute to spray drift reduction. Crop Prot. 44 , 75–83 (2013).

Bai, G. et al. Characteristics and classification of Japanese nozzles based on relative spray drift potential. Crop Prot. 46 , 88–93 (2013).

Jiao, Y. et al. Experimental study of the droplet deposition characteristics on an unmanned aerial vehicle platform under wind tunnel conditions. Agronomy 12 (12), 3066. https://doi.org/10.3390/agronomy12123066 (2022).

Article   CAS   Google Scholar  

Hongshan, L., Yubin, L., Xinyu, X., Zhiyan, Z. & Xiwen, L. Development of wind tunnel test technologies in agricultural aviation spraying. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 31 (Supp. 2), 1–10 (2015).

Fu, Z. T. & Qi, L. J. Wind tunnel spraying drift measurements. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE). 15 (1), 115–118 (1999) ( (in Chinese with English abstract) ).

Wang, Z. et al. Stereoscopic test method for low-altitude and low-volume spraying deposition and drift distribution of plant protection UAV. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 36 (4), 54–62. https://doi.org/10.11975/j.issn.1002-6819.2020.04.007 (2020) ( (in Chinese with English abstract) ).

Ding, S. M., Xue, X. Y. & Lan, Y. B. Design and experiment of NJS-1 type open-circuit closed wind tunnel for plant protection. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE). 31 (4), 76–84 (2015) ( (in Chinese with English abstract) ).

Wang, M. X. & Zhuang, K. L. Review on helicopter rotor model wind tunnel test. Aerodyn. Exp. Meas. Control. 5 (3), 9–16 (1991).

MathSciNet   Google Scholar  

Chen, Z., Guo, Y. C. & Gao, C. Principle and technology of three-dimensional PIV. J. Exp. Fluid Mech. 20 (4), 77–82 (2006).

Xiaonan, W. A. N. G., Peng, Q. I., Congwei, Y. U. & Xiongkui, H. E. Research and development of atomization, deposition and drift of pesticide droplets. Chin. J. Pestic. Sci./Nongyaoxue Xuebao 24 (5), 1065–1079. https://doi.org/10.16801/j.issn.1008-7303.2022.0111 (2022).

Andre, W., Volker, L., Jan, C., Zande, J. & Harry, V. Field experiment on spray drift: Deposition and airborne drift during application to a winter wheat crop. Sci. Total Environ. 405 , 269–277 (2008).

Wang, C. et al. Testing method and distribution characteristics of spatial pesticide spraying deposition quality balance for unmanned aerial vehicle. Int. J. Agric. Biol. Eng. 11 (2), 18–26. https://doi.org/10.25165/j.ijabe.20181102.3187 (2018).

Clement, M., Arzel, S., Le Bot, B., Seux, R. & Millet, M. Adsorption/thermal desorption-GC/MS for the analysis of pesticides in the atmosphere. Chemosphere 40 (1), 49–56 (2000).

PubMed   ADS   CAS   Google Scholar  

Teske, M. E., Miller, P. C. H., Thistle, H. W. & Birchfield, N. B. Initial development and validation of a mechanistic spray drift model for ground boom sprayers. Trans. ASABE 52 (4), 1089–1097 (2009).

Chen, S., Lan, Y., Zhou, Z., Ouyang, F. & Wang, G. Effect of droplet size parameters on droplet deposition and drift of aerial spraying by using plant protection UAV. J. Agron. 10 , 195 (2020) ( (in Chinese with English abstract) ).

Duga, A. T. et al. Numerical analysis of the effects of wind and sprayer type on spray distribution in different orchard training systems. Bound. Layer Meteorol. 157 (3), 517–535 (2015).

Xiaohui, L. I. U. et al. Research progress on spray drift of droplets of plant protection machainery. Chin. J. Pestic. Sci. Nongyaoxue Xuebao 24 (2), 232–247. https://doi.org/10.16801/j.issn.1008-7303.2021.0166 (2022).

Gil, E., Gallart, M., Balsari, P., Marucco, P. & Liop, J. Influence of wind velocity and wind direction on measurements of spray drift potential of boom sprayers using drift test bench. Agric. For. Meteorol. 202 , 94–101 (2015).

Gregorio, L. E. et al. LIDAR as an alternative to passive collectors to measure pesticide spray drift. Atmos. Environ. 82 , 83–93 (2014).

ADS   CAS   Google Scholar  

Feng, K. et al. Research progress and prospect of pesticide droplet deposition characteristics. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 37 (20), 1–14. https://doi.org/10.11975/j.issn.1002-6819.2021.20.001 (2021) ( (in Chinese with English abstract) ).

Kruckeberg, J. P. et al. The relative accuracy of DRIFTSIM when used as a real-time spray drift predictor. Trans. ASABE 55 (4), 1159–1165 (2012).

Li, H., Zhu, H., Jiang, Z. & Lan, Y. Performance characterization on downwash flow and spray drift of multirotor unmanned agricultural aircraft system based on CFD. Int. J. Agric. Biol. Eng. 15 (3), 1–8. https://doi.org/10.25165/j.ijabe.20221503.7315 (2022).

Dorr, G. J. et al. Spray retention on whole plants: Modelling, simulations and experiments. Crop Prot. 88 , 118–130 (2016).

Zabkiewicz, J. A. et al. Simulating spray droplet impaction outcomes: Comparison with experimental data. Pest Manag. Sci. 76 (10), 3469–3476 (2020).

PubMed   CAS   Google Scholar  

Miller, P. C. H. & Hadfield, D. J. A simulation model of the spray drift from hydraulic nozzles. J. Agric. Eng. Res. 42 (2), 135–147 (1989).

Zhang, B., Tang, Q., Chen, L., Zhang, R. & Xu, M. Numerical simulation of spray drift and deposition from a crop spraying aircraft using a CFD approach. Biosyst. Eng. 166 , 184–199. https://doi.org/10.1016/j.biosystemseng.2017.11.017 (2018).

Holterman, H. J., Van De Zande, J. C., Porskamp, H. A. J. & Huijsmans, J. F. M. Modeling spray drift from boom sprayers. Comput. Electron. Agric. 19 (1), 1–22 (1997).

Zhang, D. et al. Numerical simulation and analysis of the deposition shape of the droplet jetting collision. J. Xi’an Polytech. Univ. 30 (1), 112–117 (2016).

Tang, Q., Zhang, R., Chen, L., Li, L. & Xu, G. Research progress of key technologies and verification methods of numerical modeling for plant protection unmanned aerial vehicle application. Smart Agric. 3 (3), 1–21 (2021) ( (in Chinese with English abstract) ).

Zhang, R. et al. Fluorescence tracer method for analysis of droplet deposition pattern characteristics of the sprays applied via unmanned aerial vehicle. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 36 (6), 47–55. https://doi.org/10.11975/j.issn.1002-6819.2020.06.006 (2020) ( (in Chinese with English abstract) ).

Na, G., Liu Siyao, Xu., Hui, T. S. & Tianlai, Li. Improvement on image detection algorithm of droplets deposition characteristics. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 34 (17), 176–182 (2018) ( (in Chinese with English abstract) ).

Zhu, H. et al. DRIFTSIM, A program to estimate drift distances of spray droplets. Appl. Eng. Agric. 11 (3), 365–369 (1995).

Hong, S., Zhao, L. & Zhu, H. CFD simulation of pesticide spray from air-assisted sprayers in an apple orchard: Tree deposition and off-target losses. Atmos. Environ. 175 , 109–119 (2018).

Xiahou, B., Sun, D., Song, S., Xue, X. & Dai, Q. Simulation and experimental research on droplet flow characteristics and deposition in airflow field. Int. J. Agric. Biol. Eng. 13 (6), 16–24. https://doi.org/10.25165/j.ijabe.20201306.5455 (2020).

Yang, W., Li, X., Li, M. & Hao, Z. Droplet deposition characteristics detection method based on deep learning. Comput. Electron. Agric. 198 , 107038. https://doi.org/10.1016/j.compag.2022.107038 (2022).

Download references

This research was funded by the National Natural Science Foundation of China (Grant No. 31971804); Independent Innovation Project of Agricultural Science and Technology in Jiangsu Province (CX(21)3091); Suzhou Agricultural Independent Innovation Project (SNG2022061); and Suzhou Agricultural Vocational and Technical College Landmark Achievement Cultivation Project (CG[2022]02).

Author information

Authors and affiliations.

Suzhou Polytechnic Institute of Agriculture, Suzhou, 215008, China

Nanjing Institute of Agricultural Mechanization, Ministry of Agriculture and Rural Affairs, Nanjing, 210014, China

Nanjing Institute of Technology, Nanjing, 211167, China

Chen Panyang

You can also search for this author in PubMed   Google Scholar

Contributions

Q.W. conceived and designed the study. Q.W. and C.P. performed most experiments. Q.W. analyzed the data and wrote the first draft of the manuscript. C.P. revised the manuscript. Q.W. supervised the project and reviewed the manuscript.

Corresponding author

Correspondence to Qin Weicai .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Weicai, Q., Panyang, C. Analysis of the research progress on the deposition and drift of spray droplets by plant protection UAVs. Sci Rep 13 , 14935 (2023). https://doi.org/10.1038/s41598-023-40556-0

Download citation

Received : 20 February 2023

Accepted : 12 August 2023

Published : 11 September 2023

DOI : https://doi.org/10.1038/s41598-023-40556-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on quality analysis

Help | Advanced Search

Computer Science > Computation and Language

Title: uni-smart: universal science multimodal analysis and research transformer.

Abstract: In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as molecular structure, tables, and charts, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present Uni-SMART (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over leading text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

This paper is in the following e-collection/theme issue:

Published on 26.3.2024 in Vol 26 (2024)

Youth is Prized in Medicine, Old Age is Valued in Law: Analysis of Media Narratives Over 200 Years

Authors of this article:

Author Orcid Image

Original Paper

  • Reuben Ng 1, 2 , PhD   ; 
  • Nicole Indran 1 , BSocSci (Hons)  

1 Lee Kuan Yew School of Public Policy, National University of Singapore, Singapore, Singapore

2 Lloyd’s Register Foundation Institute for the Public Understanding of Risk, National University of Singapore, Singapore, Singapore

Corresponding Author:

Reuben Ng, PhD

Lee Kuan Yew School of Public Policy

National University of Singapore

469C Bukit Timah Road

Singapore, 259772

Phone: 65 66013967

Email: [email protected]

Background: This is the first study to explore how age has influenced depictions of doctors and lawyers in the media over the course of 210 years, from 1810 to 2019. The media represents a significant platform for examining age stereotypes and possesses tremendous power to shape public opinion. Insights could be used to improve depictions of older professionals in the media.

Objective: This study aims to understand how age shapes the portrayals of doctors and lawyers. Specifically, it compares the difference in sentiments toward younger and older doctors as well as younger and older lawyers in the media over 210 years.

Methods: Leveraging a 600-million-word corpus of American media publications spanning 210 years, we compiled top descriptors (N=478,452) of nouns related to youth × occupation (eg, younger doctor or physician) and old age × occupation (eg, older lawyer or attorney). These descriptors were selected using well-established criteria including co-occurrence frequency and context relevance, and were rated on a Likert scale from 1 (very negative) to 5 (very positive). Sentiment scores were generated for “doctor/physician,” “young(er) doctor/physician,” “old(er) doctor/physician,” “lawyer/attorney,” “young(er) lawyer/attorney,” and “old(er) lawyer/attorney.” The scores were calculated per decade for 21 decades from 1810 to 2019. Topic modeling was conducted on the descriptors of each occupation in both the 1800s and 1900s using latent Dirichlet allocation.

Results: As hypothesized, the media placed a premium on youth in the medical profession, with portrayals of younger doctors becoming 10% more positive over 210 years, and those of older doctors becoming 1.4% more negative. Meanwhile, a premium was placed on old age in law. Positive portrayals of older lawyers increased by 22.6% over time, while those of younger lawyers experienced a 4.3% decrease. In the 1800s, narratives on younger doctors revolved around their participation in rural health care. In the 1900s, the focus shifted to their mastery of new medical technologies. There was no marked change in narratives surrounding older doctors from the 1800s to the 1900s, though less attention was paid to their skills in the 1900s. Narratives on younger lawyers in the 1800s referenced their limited experience. In the 1900s, there was more focus on courtroom affairs. In both the 1800s and 1900s, narratives on older lawyers emphasized their prestige, especially in the 1900s.

Conclusions: Depending on the occupation, one’s age may either be seen as an asset or a liability. Efforts must be expended to ensure that older professionals are recognized for their wealth of knowledge and skills. Failing to capitalize on the merits of an older workforce could ultimately be a grave disservice not only to older adults but to society in general.

Introduction

Due to medical advances, older adults today are much healthier than before and make up a substantial portion of the labor market. The share of individuals aged 55 years and older either working or actively seeking employment has increased significantly since the 1990s, with approximately 37 million older Americans in the workforce as of March 2021 [ 1 ]. Never in history have older people been more integral to the workforce, and projections indicate a continuation of this trend [ 2 ]. Exploring how older workers have been portrayed in the media is essential for fostering a society that values its older population. In this study, we compare the portrayals of younger and older practitioners in the legal and medical spheres over the last 210 years.

The significance of our study lies in both conceptual and practical domains. From a conceptual standpoint, this study is one of the first to examine how portrayals of older workers in the media have changed over the span of 210 years. Existing research has looked only at how older adults in general are stereotyped in the media [ 3 ]. By analyzing the portrayals of older workers over such an extensive period, this study facilitates a more comprehensive understanding of how various social, cultural, or industry-related factors may have influenced these portrayals. This will in turn allow for better contextualization of current attitudes toward older workers. From a practical standpoint, insights from this study can inform interventions aimed at combating negative age stereotypes of older workers in the media.

The theory of social constructionism maintains that reality is neither objective nor fixed, but rather is constructed through social and cultural processes [ 4 ]. As a key agent for transmitting and disseminating information, the media plays a salient role in shaping the way reality is interpreted. According to the agenda-setting theory, the media molds public opinion by determining which issues are considered important and worthy of attention [ 5 , 6 ]. Similarly, cultivation theory posits that repeated exposure to certain messages in the media can alter people’s attitudes and beliefs over time [ 7 ]. Given the power of the media to influence the collective conscience, it is critical to inspect the ways older workers have been depicted in the media.

Several studies have looked at how older people are portrayed in the media. Ng et al [ 3 ] discovered that age stereotypes in the United States have become more negative over the last 2 centuries. Meanwhile, some have observed that depictions of older persons in visual media have become more positive in Europe and America since the 1950s [ 8 , 9 ]. To date, there is a paucity of literature on how different subgroups of older people are stereotyped in the media. The existing studies have focused primarily on the portrayal of grandparents [ 10 ], with little attention dedicated to the stereotypes of older workers.

Over the decades, scholars have endeavored to dissect the nature of age stereotypes. Although there is copious evidence of a general negative bias in societal perceptions of older adults [ 3 , 11 - 15 ], age stereotypes are widely accepted as being multifaceted [ 16 , 17 ]. Negative age stereotypes include being frail and unfriendly, while positive ones include being warm and amiable. Psychologists have argued that the stereotypes applied to a given target tend to shift across contexts as only contextually meaningful information will be used to evaluate the individual [ 18 , 19 ]. This contextual malleability of stereotypes renders it a scholarly imperative to pinpoint the various circumstances in which positive and negative age stereotypes emerge.

Presently, most studies pertaining to older workers focus on age discrimination as well as stereotypes in the workplace [ 20 , 21 ]. Positive stereotypes of older workers class them as warm, reliable, and committed to the job [ 22 , 23 ], while negative ones include being less flexible and adaptable [ 23 , 24 ]. Meanwhile, even as younger workers are highly regarded in terms of physical ability, productivity, and creativity [ 23 ], they are often branded as being inexperienced, unreliable, and unmotivated [ 25 ].

The ways in which older workers are represented in the media can affect the public’s attitudes toward old age and in turn the health of older persons. According to stereotype embodiment theory, the assimilation of age stereotypes into one’s self-concept can affect one’s health [ 26 ]. Negative age stereotypes are linked to poorer health outcomes such as a reduced sense of self-efficacy, a higher risk of depression as well as poorer immune or cardiovascular health [ 26 - 29 ]. Conversely, positive age stereotypes are associated with improved functional health, well-being, and longevity [ 26 , 28 , 29 ].

Research on how society evaluates older doctors is lacking. The few studies that have been carried out examined patients’ preferences regarding the age of their physicians [ 30 - 33 ]. Although older physicians are commonly deemed more patient and reassuring [ 30 , 31 ], some believe they are more susceptible to dispensing a lower quality of care than their younger colleagues [ 32 ]. A recent study found that patients cared for by older practitioners had higher mortality rates than those treated by younger ones [ 34 ]. In the legal sphere, minimal attention has been devoted to examining the ways in which older lawyers are stereotyped. That said, prior scholarship has hinted at both the opportunities and challenges posed by older attorneys. On one hand, there has been discourse on the need to leverage the skills of older attorneys, be it by allowing them to register for emeritus pro bono status [ 35 ] or by offering them opportunities to train younger attorneys [ 36 ]. On the other hand, there have been fears that older attorneys’ ability to continue their practice may be hampered by cognitive changes [ 35 ].

We test 2 hypotheses. First, to determine how age affects portrayals of doctors and lawyers in the media, we compare the difference in sentiments toward older and younger doctors, as well as older and younger lawyers. We hypothesize that the media will place a premium on youth in the medical profession where physical dexterity—a trait typically associated with younger people [ 37 ]—is prized across a range of specialties [ 38 , 39 ]. Specifically, we hypothesize that sentiments toward younger doctors will be more positive than toward older doctors over 21 decades in the media (hypothesis 1). Second, we hypothesize that the media will place a premium on old age in the legal profession due to its inherent association with experience—a quality of paramount importance in the legal domain [ 35 ]. Specifically, we hypothesize that sentiments toward older lawyers will be more positive than toward younger lawyers over the same period (hypothesis 2).

Following earlier work [ 10 , 40 - 42 ], we created the largest historical corpus—comprising 600 million words—of American media publications spanning 210 years (1810 to 2019) by merging the Corpus of Historical American English (COHA) from 1810 to 2009 with the Corpus of Contemporary American English (COCA) from 2010 to 2019 [ 43 ]. The combination of both corpora formed the largest structured historical English corpus with over 150,000 texts collected from newspapers, magazines, fiction, and nonfiction. Publications were extracted from major news outlets such as the New York Times , Wall Street Journal , and USA Today , as well as smaller ones like the Atlantic Journal Constitution and the San Francisco Chronicle . Material that was published at some point throughout the 210 years but that has ceased publication was included in the data set. The media represents a significant platform for examining age stereotypes as it possesses tremendous power to shape public opinion [ 4 - 6 ].

Target Nouns to Measure Occupation and Age × Occupation Stereotypes

The Harris Poll [ 44 ] listed doctors (physicians) and lawyers (attorneys) as some of the most respected professions. Prior studies have also found that doctors (physicians) and lawyers (attorneys) are seen as some of the most functionally significant occupations in society [ 45 ]. To ensure the relevance and applicability of our findings, the terms we selected for analysis are those frequently used in American media to describe professionals in the legal and medical fields. These terms are “doctor,” “physician,” “lawyer,” and “attorney.”

Old Age × Occupation

The descriptors or adjectives for the following terms were compiled: “old doctors,” “older doctors,” “old physicians,” “older physicians,” “old lawyers,” “older lawyers,” “old attorneys,” and “older attorneys.” We took into consideration the fact that older workers may be referred to by other adjectives such as “aging,” “elder,” “elderly,” and “aged.” However, we used “old” and “older” as they evidenced the highest prevalence in the data set.

Youth × Occupation

The descriptors or adjectives for the following terms were compiled: “young doctors,” “younger doctors,” “young physicians,” “younger physicians,” “young lawyers,” “younger lawyers,” “young attorneys,” and “younger attorneys.” Although younger workers may be referred to by other adjectives including “junior,” we used “young” and “younger” to exclude words describing workers of a lower status.

Selection of Descriptors and Sentiment Scoring

The top descriptors that co-occurred most frequently—referred to as “collocates”—with each term were compiled per decade for 210 years based on the following inclusion criteria: (1) The collocate was present within 6 words before or after the target word (lexical proximity). Articles such as “the” and “a” were not included in the 6-word lexical span. If the target noun was the first word of a sentence, collocates from the preceding sentence were excluded. (2) The collocate referred to an older or younger person specifically (relevant context). (3) There was a mutual information score of 1.5 and above, which suggests semantic bonding, meaning that the collocate has a stronger association with the particular synonym than other words in the corpus [ 46 ]. We use the following formula to calculate the mutual information score:

“A” indicates the possibility of the target word A appearing, which is calculated by the frequency of the target word. “B” indicates the possibility of the collocate B appearing, which is calculated by the frequency of word B. “C” indicates the possibility of “A” and “B” appearing together, which is calculated by the frequency of collocate B appearing near the target word A. “SizeCorpus” refers to the size of the corpus or the number of words. Span is the span of words, that is, if there are 6 words to the left and 6 words to the right of the target word, span=12; log (2)=0.30103. This is an application of concordance analysis called “psychomics”, which has been used in past literature to analyze societal stereotypes [ 10 , 47 - 50 ]. This rigorous process culminated in 478,452 collocates (descriptors or adjectives).

To test both hypotheses, each collocate was rated using a sentiment engine on a scale from 1 (very negative) to 5 (very positive). This has proven to be a valid and reliable method of measuring words associated with age stereotypes [ 51 ] and follows previous corpus-based analyses [ 52 ]. Very negative collocates were rated 1 (eg, “frail” and “burden”), neutral collocates were rated 3 (eg, “transport” and “society”), and very positive collocates were rated 5 (eg, “venerable” and “cheerful”). For every noun per decade, we tabulated a mean score that was then weighted (by the number of times the respective noun appeared in that decade) to determine the respective sentiment score.

Analytic Strategy

Hypothesis 1 states that sentiments toward younger doctors in the media are more positive than toward older ones, while hypothesis 2 states that sentiments toward older lawyers in the media are more positive than toward younger lawyers. Both hypotheses were tested by analyzing the respective sentiment trends over 210 years (1810 to 2019) and thereafter determining whether the respective slopes were significantly different. Topic modeling was conducted on the descriptors of each occupation (eg, older doctor) in both the 1800s and 1900s using latent Dirichlet allocation (LDA) [ 53 ]. By probabilistically grouping words into topics, LDA identifies latent topics and clusters of words that co-occur frequently [ 53 ]. All data preprocessing, text analytics, and statistical analyses were done in Python 3.7 and OriginLab Corporation’s OriginPro 2019b (OriginLab Corporation).

Ethical Considerations

No ethical approval was sought for this study due to the publicly available and nonidentifiable nature of the data. Moreover, this study was exempted from ethics review as it involved a secondary analysis of publicly available material.

Youth Premium in Depictions of Doctors in the Media (Hypothesis 1)

We tested whether there was a youth premium in depictions of doctors in the media. As hypothesized, younger doctors or physicians evidenced more positive societal sentiments than older doctors or physicians over 21 decades from 1810 to 2019. Younger doctors enjoyed a 10% increase in positive portrayals over 210 years (β=0.01385, P =.04), while older doctors experienced a 1.4% decline in sentiments over the same period, though this trend was not statistically significant ( Figure 1 ). The difference across both slopes achieved statistical significance, F 1,36 =4.5602, P =.04, providing support for hypothesis 1.

research paper on quality analysis

Old Age Premium in Depictions of Lawyers in the Media (Hypothesis 2)

We tested whether there was a premium of old age in depictions of lawyers in the media. As hypothesized, older lawyers or attorneys evidenced more positive sentiments compared with younger lawyers or attorneys across 21 decades from 1810 to 2019. Older lawyers experienced a 22.6% increase in positive portrayals over 21 decades (β=0.0302, P =.02). Conversely, younger lawyers experienced a 4.3% decline in sentiments over the same period (β=–0.00673, P ≤.001). These results reflect an age premium for lawyers where older lawyers are portrayed more positively relative to younger ones ( Figure 2 ). The difference across both slopes reached statistical significance, F 1,34 =7.085, P =.01, supporting hypothesis 2.

research paper on quality analysis

Summary of Insights from Topic Modeling

Topics were generated for old age × occupation and youth × occupation framing in the 1800s and 1900s. In the 1800s, narratives on younger doctors revolved around their participation in rural health care. In the 1900s, the focus shifted to their mastery of new medical technologies. Unlike younger doctors, there was no marked change in narratives surrounding older doctors from the 1800s to 1900s, though less attention was paid to their skills in the 1900s. Narratives on younger lawyers in the 1800s described them as having limited experience. In the 1900s, there was more focus on courtroom affairs. In both the 1800s and 1900s, narratives on older lawyers emphasized their prestige. This emphasis was more apparent in the 1900s as their seniority and experience became increasingly valuable in navigating the growing complexity of court cases. These results are summarized in Table 1 .

Youth × Occupation Framing (Doctors or Physicians) in the 1800s

Topic 1 consists of terms pertaining to obtaining a “medical degree” (“college,” “degree,” and “professor”). Topic 2 touches on “interactions with patients” (“laugh,” “smile,” and “boy”). Collocates alluding to “religion and medicine” are in topic 3 (“reverend,” “pray,” and “advise”). Topic 4 is about “sickness and death in the hospital” (“sick,” “dead,” and “fever”) while topic 5 is about “diagnosis and prescription” (“prescribe,” “medicine,” and “care”). Topic 6 focuses on “rural healthcare” (“village,” “horse,” “ride,” and “inquire”).

Youth × Occupation Framing (Doctors or Physicians) in the 1900s

Collocates in topic 1 paint younger physicians as having up-to-date “medical knowledge” and being proficient in the use of newer medical technologies (“treatment,” “medical,” “informatics,” and “technology”). Those in topic 2 hint at how younger doctors dispense “high-quality care” (“better” and “care”). Topic 3 is about “sickness and death in the hospital” (“sick,” “disease,” “and “die”). Topic 4 dwells on “treatment and recovery” (“treatment,” “cure,” and “best”). Topic 5 looks at the “financial woes” confronted by some patients (“shoulder,” “worry,” “bill,” and “enough”) and topic 6 at “consultations and medical appointments” (“consult,” “advise,” “and “visit”).

Old Age × Occupation Framing (Doctors or Physicians) in the 1800s

Topic 1 describes the “skill and expertise” of older doctors (“skill” and “experience”). Topic 2 covers “visits to the hospital” (“visit,” “consult,” and “opinion”). Topic 3 makes reference to “religion and medicine” (“divinity,” “village,” and “medicine”) and topic 4 to “sickness and death in the hospital” (“die” and “trouble”).

Old Age × Occupation Framing (Doctors or Physicians) in the 1900s

Topic 1 deals with the wealth of “experience” of older doctors (“experience” and “practitioner”). Comparisons of doctors with “other occupations” (“lawyer,” “surgeon,” and “dentist”) are in topic 2. Topic 3 is about “diagnosis and prescription” (“prescription,” “drug,” and “diagnosis”) and topic 4 is about “sickness and death in the hospital” (“disease” and “dead”). Topic 5 pertains to “surgery” (“operation,” “perform,” and “abortion”). Topic 6 looks at “bedside manner” (“laugh” and “smile”) and topic 7 involves “royal visits” (“royal,” “visit,” and “news”).

Youth × Occupation Framing (Lawyers or Attorneys) in the 1800s

Words in topic 1 portray the legal profession as one of “prestige” (“eminent,” “distinguished,” and “prominent”). Those in topic 2 concern “law and religion” (“clergyman,” “judge,” and “divine”). Topic 3 features terms related to the “judicial district” (“judicial,” “district,” “jury,” and “statesman”) and topic 4 features terms related to “prosecution” (“prosecute” and “defend”). Collocates in topic 5 imply that younger attorneys have “limited experience” (“less,” “experience,” and “knowledge”).

Youth × Occupation Framing (Lawyers or Attorneys) in the 1900s

Topic 1 comprises terms related to “courtroom affairs” (“affairs,” “drama,” “judge,” and “court”). Topic 2 circles around issues regarding “trial and prosecution” (“plead,” “charge,” and “criminal”). Matters regarding “politics” dominate topic 3 (“republican,” “democratic,” and “senate”). Specific “lawsuits against politicians” are in topic 4 (“settlement,” “lawsuit, “allegation,” and “complaint”). Topic 5 focuses on “legal fees” (“legal,” “insurance,” “interest,” and “cost”) and topic 6 compares the legal profession to “other occupations” (“accountant,” “banker,” “doctor,” and “teacher”).

Old Age × Occupation Framing (Lawyers or Attorneys) in the 1800s

Topic 1 depicts older lawyers as having a certain level of “prestige” (“eminent” and “respectable”). Topic 2 revolves around “legal guidance” (“opinion” and “advice”). Topic 3 covers “prosecution” (“prosecute,” “evidence,” and “judge”) and topic 4 contains terms related to “legal fees” (“interest,” “dollar,” and “fee”).

Old Age × Occupation Framing (Lawyers or Attorneys) in the 1900s

The idea that older attorneys have more “seniority and experience” is in topic 1 (“senior,” “experience,” “successful,” and “complex”). Topic 2 discusses matters regarding “courtroom proceedings” (“judge,” “defendant,” and “trial”). Topic 3 is related to “legal guidance” (“hire” and “represent”). Topic 4 is about the “bar” (“association” and “conference”) and topic 5 is about “lawsuits against politicians” (“file,” “governor,” and “republican”). Topic 6 involves “family law” (“divorce,” “witness,” and “son”).

In this study, we compared how sentiments toward older and younger doctors and lawyers have changed in the media over the last 2 centuries. Findings reveal that media outlets have placed a premium on youth in the medical enterprise but on old age in the legal fraternity.

Narratives on younger doctors in the media are more positive than on older doctors. Additionally, while narratives on older doctors have become more negative, narratives on younger doctors are experiencing the reverse trend. One possible reason for these trends is the rise of new medical technologies and techniques that younger doctors are usually assumed to be more conversant with. Moreover, medical practitioners are said to be the most proficient in the years immediately after completing residency training. There is a common belief that doctors further from training may rely on out-of-date clinical evidence and not adhere as rigidly to evidence-based guidelines as younger doctors [ 33 ]. As seen in our LDA results, from the 1900s, the media began emphasizing the merits of younger physicians by highlighting their up-to-date medical knowledge. Although there were portrayals of older doctors as being skilled in the 1800s, this changed in the 1900s, during which more attention was dedicated to their experience and bedside manner. Thus, there may have been a subconscious association of youth with medical competence in the media.

Narratives on older lawyers in the media are more positive than those on younger lawyers. While narratives on older lawyers have become more positive, the trend is the opposite for younger lawyers. Younger attorneys may be seen as more enthusiastic and willing to learn, but they may also be viewed as lacking the expertise of their older and more seasoned counterparts. The increasing specialization of the legal profession over time [ 54 ] may have resulted in age becoming more highly valued in the field. Older attorneys may be seen as possessing a deeper understanding of legal precedents and a greater ability to navigate the complexities and intricacies of legal matters [ 35 ]. Our LDA results uncovered that since the 1900s, older lawyers have been depicted in the media as having a wealth of expertise, which may render them better able to provide valuable legal guidance to clients and colleagues alike.

The issue of youth-directed ageism is beyond the ambit of this study. Nevertheless, it cannot go unacknowledged that narratives on younger lawyers have become increasingly negative. Over the years, clarion calls have been sounded for bullying in the legal profession to end. In 2018, the International Bar Association surveyed 7000 lawyers from 135 countries on the topic of bullying and sexual harassment in the profession [ 55 ]. Results from the survey lent empirical support to claims about the rampancy of bullying and sexual harassment in the legal enterprise, which may not be particularly shocking in view of the adversarial, hierarchical, and hypercompetitive nature of the job [ 55 ]. Younger legal professionals were found to be disproportionately affected by bullying, and respondents from the United States reported higher rates of both bullying as well as sexual harassment than the global average [ 55 ]. This may account for the increase in negativity associated with narratives on younger attorneys.

In line with social constructionism [ 4 ], agenda-setting [ 5 , 6 ], and cultivation theories [ 7 ], the media is instrumental in shaping public perceptions of older adults. Depictions of younger physicians as being well acquainted with the newest medical technologies, though particularly important as they begin their medical careers, may come at the expense of older doctors. The public may be conditioned to view younger doctors as superior to older ones, which could consequently foment ageist stereotypes. The fact that depictions of older lawyers have grown more positive over time is heartening as it shows that older lawyers are being celebrated for their seniority and expertise.

The finding that an age premium exists in depictions of lawyers but not in depictions of doctors is interesting. It may be that the age of a physician is perceived as more important as work in the medical setting may have life-threatening consequences. Indeed, scholars have contended that older physicians’ waning physical skills may pose major surgical risks [ 38 , 39 ]. Such risks may be thought to outweigh the potential lessons to be imparted by older doctors. Physical dexterity has comparatively little—if any—bearing on whether one is able to perform one’s legal duties effectively, which may explain the differing results between portrayals of older doctors and lawyers.

This study yields some important insights. First, it is important that narratives on older adults in the media accurately mirror the contributions of older adults to society. More attention should be dedicated to the immense value they add to the workplace and economy in order that negative stereotypes are progressively removed from one’s cognitive repertoire. Just as the media has highlighted the strengths of older lawyers, it may be worthwhile to increase media coverage of the unique strengths of older doctors and what they can bring to patient care. Media campaigns could be held to promote positive images of older doctors. Such campaigns could feature interviews with these doctors, details of their achievements, and testimonials from patients who have received exceptional care from them. Additionally, media outlets could offer training programs for journalists to sensitize them to the issue of ageism and to equip them with guidelines on writing about older doctors tactfully. To avoid perpetuating ageist stereotypes and to promote better psychological well-being among older persons [ 26 ], it is crucial to ensure accurate depictions of older professionals in the media. As only a small fraction of older professionals are covered in the media, the few depictions that do exist hold significant influence in shaping public attitudes toward aging.

Second, our results indicate that depending on the occupation, old age may be deemed more valuable than youth. There is therefore a need to explore how the merits of old age—skill, knowledge, and experience—in these occupational contexts could be extended to the way the older population is viewed as a whole. To this end, it may be useful to encourage more interaction between older and younger people in the workplace. Age diversity may translate into advantages in both human and social capital [ 56 ] as knowledge is passed from young to old and old to young [ 57 ]. Relatedly, it is imperative that older persons are eased into the retirement phase and not tossed into it unceremoniously. Opportunities to continue contributing to the workplace—such as through mentorship programs—should be provided so as to harness the treasury of skills and knowledge of this cohort. This will also allow older persons to transition more seamlessly into the next phase of life.

This study has various limitations. First, we acknowledge that an emphasis on occupational roles may wind up perpetuating the notion that older people are to be valued only if they remain economically productive [ 58 ]. However, our findings reveal that old age may evoke positive stereotypes in certain occupational contexts, which has implications on how negative stereotypes of older adults can be eliminated. Second, as only 2 occupational roles were used for analysis, the differences in portrayals of older and younger workers in other occupational settings remain an uncharted area of research. Furthermore, the occupations considered in this study are generally considered to be prestigious. Future scholarship could examine whether old age is valued in the context of blue-collar occupations. Third, our search only included the 2 professions as umbrella terms without specifying the various specialties within each profession, such as cardiologists, ophthalmologists, criminal defense prosecutors, and tax attorneys. Thus, our study only lays out in broad strokes how older lawyers and doctors are depicted in the media. Future studies could address this limitation by delving into the various subcategories.

Fourth, the data set used in this study only comprises American sources. The meanings ascribed to age and to occupations are likely to differ across cultural contexts. For instance, the idea of the American dream may have had an impact on the portrayal of the legal profession since it could be tied closely to the notion of wealth and success. A study that investigates how portrayals of older doctors and lawyers vary across cultures may hence be worth pursuing. Finally, it is important to acknowledge that LDA may not account for the context in which the words are used, making it difficult to discern whether ageist portrayals are propagated by journalists or other sources. In addition, LDA may not detect nuances in language such as tone, sarcasm, and irony. Further research is therefore required to determine the generalizability of our findings. Other directions for future study include an analysis of how narratives pertaining to select occupations have unfolded since the 2000s. It is likely that topics related to artificial intelligence and automation will emerge in these narratives. Surveys [ 59 ], interviews [ 60 ], and big data analytics [ 61 , 62 ] could also be used to explore the types of stereotypes linked to professionals of different age groups.

Conclusions

In seeking to eradicate ageism, perhaps a pressing question is not so much whether this phenomenon exists, but rather in what situations it manifests itself and to what degree. This study has demonstrated that depending on the occupation, one’s age may either be seen as an asset or a liability. Moving forward, effort must be expended to ensure that older professionals are recognized for their wealth of knowledge and skills. Failing to capitalize on the merits of an aging workforce could ultimately be a grave disservice not only to older adults but to society in general.

Acknowledgments

We are grateful to W. Yang for preprocessing the data. We gratefully acknowledge the support of the Commonwealth Fund’s Harkness Fellowship in Heathcare Policy and Practice and the Social Science Research Council SSHR Fellowship (MOE2018-SSHR-004). The funders had no role in study design, data collection, analysis, writing, and decision to publish.

Data Availability

Data are publicly available at English-Corpora.org [ 63 ].

Authors' Contributions

RN designed the study, developed the methodology, analyzed the data, wrote the paper, and acquired the funding. NI cowrote the paper.

Conflicts of Interest

None declared.

  • U.S. Bureau of Labor Statistics. FRED URL: https://fred.stlouisfed.org/series/LNS11024230 [accessed 2021-12-21]
  • United States Census Bureau. 2020 census will help policymakers prepare for the incoming wave of aging boomers. Census.gov. 2019. URL: https://www.census.gov/library/stories/2019/12/by-2030-all-baby-boomers-will-be-age-65-or-older.html [accessed 2021-10-20]
  • Ng R, Allore HG, Trentalange M, Monin JK, Levy BR. Increasing negativity of age stereotypes across 200 years: evidence from a database of 400 million words. PLoS One. 2015;10(2):e0117086. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Berger PL, Luckmann T. The Social Construction of Reality: A Treatise in the Sociology of Knowledge, 1st Edition. Garden City, NY. Penguin Books; 1966.
  • McCombs ME, Shaw DL. The agenda-setting function of mass media. Public Opin Q. 1972;36(2):176-187. [ CrossRef ]
  • Lippmann W. Public Opinion, 1st edition. New York, NY, USA. Macmillan; 1922.
  • Gerbner G, Gross L, Morgan M, Signorielli N, Shanahan J. Growing up with television: cultivation processes. In: Bryant J, Zillmann D, editors. Media Effects: Advances in Theory and Research, Second Edition. Mahwah, NJ, US. Lawrence Erlbaum Associates Publishers; 2002;43-67.
  • Loos E, Ivan L. Visual ageism in the media. In: Ayalon L, Tesch-Römer C, editors. Contemp Perspect Ageism. Cham. Springer International Publishing; 2018;163-176.
  • Ylänne V. Representations of ageing in the media. In: Twigg J, Martin W, editors. Routledge Handbook of Cultural Gerontology, Routledge International Handbooks. London. Routledge; 2015;369-376.
  • Ng R, Indran N. Role-based framing of older adults linked to decreased ageism over 210 years: evidence from a 600-million-word historical corpus. Gerontologist. 2022;62(4):589-597. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Indran N, Liu L. Ageism on Twitter during the COVID-19 pandemic. J Soc Issues. 2022;78(4):842-859. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Hostility toward baby boomers on TikTok. Gerontologist. Sep 07, 2022;62(8):1196-1206. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Innovations for an aging society through the lens of patent data. Gerontologist. Feb 01, 2024;64(2):gnad015. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Does age matter? tweets about gerontocracy in the United States. J Gerontol B Psychol Sci Soc Sci. Nov 14, 2023;78(11):1870-1878. [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Questions about aging and later life on Quora. Gerontologist. 2024. (forthcoming).
  • Hummert ML. Multiple stereotypes of elderly and young adults: a comparison of structure and evaluations. Psychol Aging. 1990;5(2):182-193. [ CrossRef ] [ Medline ]
  • Kite ME, Stockdale GD, Whitley BE, Johnson BT. Attitudes toward younger and older adults: an updated meta-analytic review. J Soc Issues. 2005;61(2):241-266. [ CrossRef ]
  • Bodenhausen GV, Todd AR, Richeson JA. Controlling prejudice and stereotyping: antecedents, mechanisms, and contexts. In: Nelson TD, editor. Handbook of Prejudice, Stereotyping, and Discrimination. New York. Psychology Press; 2009;111-135.
  • Casper C, Rothermund K, Wentura D. The activation of specific facets of age stereotypes depends on individuating information. Social Cognition. 2011;29(4):393-414. [ CrossRef ]
  • Kleissner V, Jahn G. Implicit and explicit measurement of work-related age attitudes and age stereotypes. Front Psychol. 2020;11:579155. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng TWH, Feldman DC. Evaluating six common stereotypes about older workers with meta-analytical data. Personnel Psychology. 2012;65(4):821-858. [ CrossRef ]
  • Hassell BL, Perrewe PL. An examination of beliefs about older workers: do stereotypes still exist? J Organ Behavior. 2006;16(5):457-468. [ CrossRef ]
  • Van Dalen HP, Henkens K, Schippers J. Dealing with older workers in Europe: a comparative survey of employers' attitudes and actions. J Eur Soc Policy. 2009;19(1):47-60. [ CrossRef ]
  • Gringart E, Helmes E, Speelman CP. Exploring attitudes toward older workers among Australian employers: an empirical study. J Aging Soc Policy. 2005;17(3):85-103. [ CrossRef ] [ Medline ]
  • Finkelstein LM, Ryan KM, King EB. What do the young (old) people think of me? content and accuracy of age-based metastereotypes. Eur JWork Organ Psychol. 2013;22(6):633-657. [ CrossRef ]
  • Levy B. Stereotype embodiment: a psychosocial approach to aging. Curr Dir Psychol Sci. 2009;18(6):332-336. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Auman C, Bosworth HB, Hess TM. Effect of health-related stereotypes on physiological responses of hypertensive middle-aged and older men. J Gerontol B Psychol Sci Soc Sci. 2005;60(1):P3-P10. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Levy BR, Hausdorff JM, Hencke R, Wei JY. Reducing cardiovascular stress with positive self-stereotypes of aging. J Gerontol B Psychol Sci Soc Sci. 2000;55(4):P205-P213. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Levy BR, Slade MD, Kunkel SR, Kasl SV. Longevity increased by positive self-perceptions of aging. J Pers Soc Psychol. 2002;83(2):261-270. [ CrossRef ] [ Medline ]
  • Mckinstry B, Yang SY. Do patients care about the age of their general practitioner? a questionnaire survey in five practices. Br J Gen Pract. 1994;44(385):349-351. [ FREE Full text ]
  • Usta J, Antoun J, Ambuel B, Khawaja M. Involving the health care system in domestic violence: what women want. Ann Fam Med. 2012;10(3):213-220. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Choudhry NK, Fletcher RH, Soumerai SB. Systematic review: the relationship between clinical experience and quality of health care. Ann Intern Med. 2005;142(4):260-273. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • MacRae H. Not too old, not too young: older women's perceptions of physicians. Can J Aging. 2015;34(4):545-560. [ CrossRef ] [ Medline ]
  • Tsugawa Y, Newhouse JP, Zaslavsky AM, Blumenthal DM, Jena AB. Physician age and outcomes in elderly patients in hospital in the US: observational study. BMJ. 2017;357:j1797. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Frye KA, Oten B. Senior moments: an examination of the ethical and practical considerations of our aging bar. J Am Acad Matrim Lawyers. 2019;31:371-404. [ FREE Full text ]
  • Barnes C. Time to go. helping lawyers retire with dignity. Am Bar Assoc. 2008;32(6). [ FREE Full text ]
  • Constansia RDN, Hentzen JEKR, Buis CI, Klaase JM, de Meijer VE, Meerdink M. Is surgical subspecialization associated with hand grip strength and manual dexterity? a cross-sectional study. Ann Med Surg (Lond). 2022;73:103159. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Blasier RB. The problem of the aging surgeon: when surgeon age becomes a surgical risk factor. Clin Orthop Relat Res. 2009;467(2):402-411. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Greenfield LJ, Proctor MC. When should a surgeon retire? Adv Surg. 1999;32:385-393. [ Medline ]
  • Ng R, Indran N. Reframing aging: foregrounding familial and occupational roles of older adults is linked to decreased ageism over two centuries. J Aging Soc Policy. 2023.:1-18. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Impact of old age on an occupation's image over 210 years: an age premium for doctors, lawyers, and soldiers. J Appl Gerontol. Jun 2023;42(6):1345-1355. [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Role-based framing of older adults linked to decreased ageism over 210 years: evidence from a 600-million-word historical corpus. Gerontologist. Apr 20, 2022;62(4):589-597. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Davies M. Expanding horizons in historical linguistics with the 400-million word corpus of historical American English. Corpora. 2012;7(2):121-157. [ CrossRef ]
  • Doctors, military officers, firefighters, and scientists seen as among America's most prestigious occupations. The Harris Poll. 2014. URL: https://tinyurl.com/4e72ez3u [accessed 2021-09-10]
  • Thielbar G, Feldman SD. Occupational stereotypes and prestige. Soc Forces. 1969;48(1):64-72. [ CrossRef ]
  • Church KW, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist. 1990;16(1):22-29. [ FREE Full text ] [ CrossRef ]
  • Ng R, Indran N, Suarez P. Communicating risk perceptions through batik art. JAMA. 2023;330(9):790-791. [ CrossRef ] [ Medline ]
  • Ng R, Indran N, Yang W. Portrayals of older adults in over 3000 films around the world. J Am Geriatr Soc. Sep 2023;71(9):2726-2735. [ CrossRef ] [ Medline ]
  • Ng R, Tan YW. Diversity of COVID-19 news media coverage across 17 countries: the influence of cultural values, government stringency and pandemic severity. Int J Environ Res Public Health. Nov 09, 2021;18(22):11768. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Chow TYJ, Yang W. The impact of aging policy on societal age stereotypes and ageism. Gerontologist. Apr 20, 2022;62(4):598-606. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Levy B, Langer E. Aging free from negative stereotypes: successful memory in China and among the American deaf. J Pers Soc Psychol. 1994;66(6):989-997. [ CrossRef ] [ Medline ]
  • Ng R, Indran N. Reframing aging during COVID-19: familial role-based framing of older adults linked to decreased ageism. J Am Geriatr Soc. 2022;70(1):60-66. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3:993-1022. [ FREE Full text ] [ CrossRef ]
  • Banks CP. The State and Federal Courts: A Complete Guide to History, Powers, and Controversy. Santa Barbara, CA. ABC-CLIO; 2017.
  • Pender K. Us too? bullying and sexual harassment in the legal profession. International Bar Association. 2019. URL: https://apo.org.au/node/248266 [accessed 2021-12-25]
  • Li Y, Gong Y, Burmeister A, Wang M, Alterman V, Alonso A, et al. Leveraging age diversity for organizational performance: an intellectual capital perspective. J Appl Psychol. 2021;106(1):71-91. [ CrossRef ] [ Medline ]
  • Froidevaux A, Alterman V, Wang M. Leveraging aging workforce and age diversity to achieve organizational goals: a human resource management perspective. In: Czaja S, Sharit J, James J, editors. Current and Emerging Trends in Aging and Work. Cham. Springer International Publishing; 2020;33-58.
  • van Dyk S. The appraisal of difference: critical gerontology and the active-ageing-paradigm. J Aging Stud. 2014;31:93-103. [ CrossRef ] [ Medline ]
  • Sima LC, Ng R, Elimelech M. Modeling risk categories to predict the longitudinal prevalence of childhood diarrhea in Indonesia. Am J Trop Med Hyg. Nov 2013;89(5):884-891. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Yu CC, Tan L, Tang B, Liaw SY, Tierney T, Ho YY, et al. The development of empathy in the healthcare setting: a qualitative approach. BMC Med Educ. Apr 04, 2022;22(1):245. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ng R, Levy B. Pettiness: conceptualization, measurement and cross-cultural differences. PLoS One. 2018;13(1):e0191252. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Giest S, Ng R. Big data applications in governance and policy. PaG. Nov 21, 2018;6(4):1-4. [ CrossRef ]
  • English-Corpora.org. URL: https://www.english-corpora.org/

Abbreviations

Edited by T Leung, G Eysenbach; submitted 19.01.23; peer-reviewed by L Allen, G Myreteg; comments to author 27.02.23; revised version received 13.03.23; accepted 31.08.23; published 26.03.24.

©Reuben Ng, Nicole Indran. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 26.03.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

IMAGES

  1. (PDF) Water quality analysis and simulation

    research paper on quality analysis

  2. Analysis In A Research Paper

    research paper on quality analysis

  3. Example Of Qualitative Research Paper Chapter 1 To 3

    research paper on quality analysis

  4. How to Write a High Quality Research Paper 2023

    research paper on quality analysis

  5. 👍 Water quality research paper. Thesis Statement Examples For Water

    research paper on quality analysis

  6. Sample MLA Research Paper

    research paper on quality analysis

VIDEO

  1. HOW TO WRITE RESEARCH PAPER ? PART :-01

  2. Comprehensive Analysis of Quality Management in Pharmaceutical Manufacturing Process

  3. Technician 2

  4. Amazing Quality Paper Plate Manufacturing Process #shorts

  5. Quick tips on writing a case report 📑

  6. VLW PAPER ANALYSIS & ANSWER KEY

COMMENTS

  1. (PDF) An Introduction to Water Quality Analysis

    To predict the water quality index, this review paper provides an overview of water quality monitoring, the modeling and numerous sensors employed, and various artificial intelligence approaches.

  2. Evaluating Drinking Water Quality Using Water Quality Parameters and

    Water is a vital natural resource for human survival as well as an efficient tool of economic development. Drinking water quality is a global issue, with contaminated unimproved water sources and inadequate sanitation practices causing human diseases (Gorchev & Ozolins, 1984; Prüss-Ustün et al., 2019).Approximately 2 billion people consume water that has been tainted with feces ().

  3. A Review of the Quality Indicators of Rigor in Qualitative Research

    Abstract. Attributes of rigor and quality and suggested best practices for qualitative research design as they relate to the steps of designing, conducting, and reporting qualitative research in health professions educational scholarship are presented. A research question must be clear and focused and supported by a strong conceptual framework ...

  4. A review of water quality index models and their use for assessing

    The primary aim of this paper was to critically review the most commonly used WQI models and determine which were the most accurate. This involved a review of 110 published manuscripts from which we identified 21 WQI models used globally (see Fig. 1), which were then individually and comparatively assessed.The review identified seven basic WQI models from which most other WQI models have been ...

  5. 76814 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on WATER QUALITY ANALYSIS. Find methods information, sources, references or conduct a literature review ...

  6. Full article: Four decades of research on quality: summarising

    The purpose of this paper is to identify and depict the key areas around which research on quality has orbited during the past 37 years. ... T., Durmuşoğlu, A., Delibaş, D., & Avlanmaz, N. (2011). An analysis of the papers published in total quality management & business excellence from 1995 through 2008. Total Quality Management ...

  7. A comprehensive review of water quality indices (WQIs ...

    Water quality index (WQI) is one of the most used tools to describe water quality. It is based on physical, chemical, and biological factors that are combined into a single value that ranges from 0 to 100 and involves 4 processes: (1) parameter selection, (2) transformation of the raw data into common scale, (3) providing weights and (4) aggregation of sub-index values. The background of WQI ...

  8. Data analytics in quality 4.0: literature review and future research

    In this Section, we provide a bibliometric analysis of the reviewed papers. Figure 2 depicts the amount of the identified published papers on quality data analytics per year. It is obvious that, during the last years, there is an increasing trend on the development and implementation of data-driven methods and algorithms for quality management.

  9. A review of the application of machine learning in water quality

    Abstract. With the rapid increase in the volume of data on the aquatic environment, machine learning has become an important tool for data analysis, classification, and prediction. Unlike traditional models used in water-related research, data-driven models based on machine learning can efficiently solve more complex nonlinear problems.

  10. Water quality assessment based on multivariate statistics and water

    A bibliometric analysis for the research on river water quality assessment and simulation during 2000-2014. Scientometrics 108 , 1333-1346 (2016). Article Google Scholar

  11. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  12. Full article: Quality 2030: quality management for the future

    The paper is also an attempt to initiate research for the emerging 2030 agenda for QM, here referred to as 'Quality 2030'. This article is based on extensive data gathered during a workshop process conducted in two main steps: (1) a collaborative brainstorming workshop with 22 researchers and practitioners (spring 2019) and (2) an ...

  13. Research quality: What it is, and how to achieve it

    2) Initiating research stream: The researcher (s) must be able to assemble a research team that can achieve the identified research potential. The team should be motivated to identify research opportunities and insights, as well as to produce top-quality articles, which can reach the highest-level journals.

  14. Big data quality framework: a holistic approach to continuous quality

    Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and ...

  15. Criteria for Good Qualitative Research: A Comprehensive Review

    Fundamental Criteria: General Research Quality. Various researchers have put forward criteria for evaluating qualitative research, which have been summarized in Table 3.Also, the criteria outlined in Table 4 effectively deliver the various approaches to evaluate and assess the quality of qualitative work. The entries in Table 4 are based on Tracy's "Eight big‐tent criteria for excellent ...

  16. Citations, Citation Indicators, and Research Quality: An Overview of

    Dag W. Aksnes is research professor at the Nordic Institute for studies in Innovation, Research and Education (NIFU) and affiliated with the Centre for Research Quality and Policy Impact Studies (R-QUEST). Aksnes' research covers various topics within the field of bibliometrics, such as studies of citations, citation analyses and assessments ...

  17. Quality analysis of the clinical laboratory literature and its

    Out of 776 studies, 726 were evaluated for clinical laboratory literature quality analysis. Studies were analyzed according to the quality improvement and control methods and interventions, such as training, education, task force, and observation. ... This research paper is the culmination of a joint effort between the author, the co-author YI ...

  18. Fruit quality evaluation using machine learning techniques ...

    The main purpose of the paper is to highlight the research that has been done in the field of fruit quality detection using various machine and deep learning techniques. ... Koszela K, Przybylak A, Przybył J (2017) Application of neural image analysis in evaluating the quality of greenhouse tomatoes. Scientia Horticulture 218:222-229 ...

  19. A scoping review on quality assessment tools used in systematic reviews

    Introduction. Systematic Reviews (SRs), evidence-based medicine, and clinical guidelines bring together trustworthy information by systematically acquiring, analysing, and transferring research findings into clinical, management, and policy arenas [].As such, findings of different work in medical literature on related topics are evaluated using SRs and meta-analyses (MAs), through the ...

  20. A mixed uncertain structural reliability analysis method considering

    Quality and Reliability Engineering International is a quality engineering journal solving real-life quality and reliability problems across engineering. Abstract In this paper, a mixed model reliability analysis method is put forward for the problem of assessing the reliability of complex engineering structures containing both random and ...

  21. (PDF) Air Quality analysis

    Based on the data concerning air quality, the number and location of air quality monitoring stations for 2003-2016 published by the Regional Inspectorate for Environmental Protection in Katowice ...

  22. Frontiers

    The over-extraction of groundwater has resulted in seawater intrusion and the southward migration of the saltwater interface, gradually deteriorating the groundwater quality in the Weibei Plain. In this research, groundwater samples were gathered from 46 monitoring wells for shallow groundwater during the years 2006, 2011, 2016, and 2021. The hydrochemical features of regional groundwater and ...

  23. Methodological Guidance Paper: High-Quality Meta-Analysis in a

    The term meta-analysis was first used by Gene Glass (1976) in his presidential address at the AERA (American Educational Research Association) annual meeting, though Pearson (1904) used methods to combine results from studies on the relationship between enteric fever and mortality in 1904. The 1980s was a period of rapid development of statistical methods (Cooper & Hedges, 2009) leading to the ...

  24. Evaluating the consistency of rice and paddy quality using four

    The aim of this paper is to evaluate the quality of 13 batches of rice and 17 batches of paddy. Differential scanning calorimetry curves, Fourier transform infrared spectroscopy, ultraviolet spectroscopy and electrochemical curves of 30 batches of rice and paddy were collected, and this paper utilized the quantitative fingerprints evaluated by using the systematic quantitative fingerprinting ...

  25. Analysis of Defects in Quality Management in a Company from ...

    The paper addresses a topic related to quality management in a company in the automotive industry. The purpose of the paper is to investigate a complaint received from the customer within the analysis team in the production department, highlighting the defect analysis procedures, the way defects are treated within the company, the relationship established with the customer who complains about ...

  26. Milk Source Identification and Milk Quality Estimation Using an

    At present, domestic and foreign research by near-infrared spectroscopy , microorganism physicochemical analysis [14,15], and DHI laboratory testing have achieved excellent results in quantitative detection of milk components . However, these methods still have the disadvantages of high cost, low detection efficiency, vulnerability to damage ...

  27. Analysis of the research progress on the deposition and drift ...

    At present, the deposition and drift of droplets are mainly researched by field tests and wind tunnel tests 28,29,30,31,32.Field test research on pesticide deposition and drift is similar to the ...

  28. Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

    In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to ...

  29. (PDF) Analysis of Air Quality Index

    Rakesh Bhutiani. An ambient air quality study was undertaken in Bareilly city, U.P., India during the year 2010 and 2011. The seasonal air quality data was obtained from ten monitoring sites ...

  30. Journal of Medical Internet Research

    Background: This is the first study to explore how age has influenced depictions of doctors and lawyers in the media over the course of 210 years, from 1810 to 2019. The media represents a significant platform for examining age stereotypes and possesses tremendous power to shape public opinion. Insights could be used to improve depictions of older professionals in the media.