Method for Data Quality Assessment of Synthetic Industrial Data

Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by si...

Full description

Bibliographic Details
Main Authors: László Barna Iantovics, Călin Enăchescu
Format: Article
Language:English
Published: MDPI AG 2022-02-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/22/4/1608
_version_ 1797476716854116352
author László Barna Iantovics
Călin Enăchescu
author_facet László Barna Iantovics
Călin Enăchescu
author_sort László Barna Iantovics
collection DOAJ
description Sometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by simulation. Regarding simulated data, a limitation could consist in the fact that the data are not appropriate for research, based on poor quality or limited quantity. In such cases, the design of algorithms that are tested on that data does not give credible results. For avoiding such situations, we consider that mathematically grounded data-quality assessments should be designed according to the specific type of problem that must be solved. In this paper, we approach a multivariate type of prediction whose results finally can be used for binary classification. We propose the use of a mathematically grounded data-quality assessment, which includes, among other things, the analysis of predictive power of independent variables used for prediction. We present the assumptions that should be passed by the synthetic data. Different threshold values are established by a human assessor. In the case of research data, if all the assumptions pass, then we can consider that the data are appropriate for research and can be applied by even using other methods for solving the same type of problem. The applied method finally delivers a classification table on which can be applied any indicators of performed classification quality, such as sensitivity, specificity, accuracy, F1 score, area under curve (AUC), receiver operating characteristics (ROC), true skill statistics (TSS) and Kappa coefficient. These indicators’ values offer the possibility of comparison of the results obtained by applying the considered method with results of any other method applied for solving the same type of problem. For evaluation and validation purposes, we performed an experimental case study on a novel synthetic dataset provided by the well-known UCI data repository.
first_indexed 2024-03-09T21:05:37Z
format Article
id doaj.art-dfa5b1b961f64b5a93181629284fa9e9
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T21:05:37Z
publishDate 2022-02-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-dfa5b1b961f64b5a93181629284fa9e92023-11-23T22:02:16ZengMDPI AGSensors1424-82202022-02-01224160810.3390/s22041608Method for Data Quality Assessment of Synthetic Industrial DataLászló Barna Iantovics0Călin Enăchescu1Department of Electrical Engineering and Information Technology, George Emil Palade University of Medicine, Pharmacy, Science and Technology of Targu Mures, 540142 Targu Mures, RomaniaDepartment of Electrical Engineering and Information Technology, George Emil Palade University of Medicine, Pharmacy, Science and Technology of Targu Mures, 540142 Targu Mures, RomaniaSometimes it is difficult, or even impossible, to acquire real data from sensors and machines that must be used in research. Such examples are the modern industrial platforms that frequently are reticent to share data. In such situations, the only option is to work with synthetic data obtained by simulation. Regarding simulated data, a limitation could consist in the fact that the data are not appropriate for research, based on poor quality or limited quantity. In such cases, the design of algorithms that are tested on that data does not give credible results. For avoiding such situations, we consider that mathematically grounded data-quality assessments should be designed according to the specific type of problem that must be solved. In this paper, we approach a multivariate type of prediction whose results finally can be used for binary classification. We propose the use of a mathematically grounded data-quality assessment, which includes, among other things, the analysis of predictive power of independent variables used for prediction. We present the assumptions that should be passed by the synthetic data. Different threshold values are established by a human assessor. In the case of research data, if all the assumptions pass, then we can consider that the data are appropriate for research and can be applied by even using other methods for solving the same type of problem. The applied method finally delivers a classification table on which can be applied any indicators of performed classification quality, such as sensitivity, specificity, accuracy, F1 score, area under curve (AUC), receiver operating characteristics (ROC), true skill statistics (TSS) and Kappa coefficient. These indicators’ values offer the possibility of comparison of the results obtained by applying the considered method with results of any other method applied for solving the same type of problem. For evaluation and validation purposes, we performed an experimental case study on a novel synthetic dataset provided by the well-known UCI data repository.https://www.mdpi.com/1424-8220/22/4/1608smart sensorsensor datasmart factoryIndustry 4.0data-quality assessmentprediction problem
spellingShingle László Barna Iantovics
Călin Enăchescu
Method for Data Quality Assessment of Synthetic Industrial Data
Sensors
smart sensor
sensor data
smart factory
Industry 4.0
data-quality assessment
prediction problem
title Method for Data Quality Assessment of Synthetic Industrial Data
title_full Method for Data Quality Assessment of Synthetic Industrial Data
title_fullStr Method for Data Quality Assessment of Synthetic Industrial Data
title_full_unstemmed Method for Data Quality Assessment of Synthetic Industrial Data
title_short Method for Data Quality Assessment of Synthetic Industrial Data
title_sort method for data quality assessment of synthetic industrial data
topic smart sensor
sensor data
smart factory
Industry 4.0
data-quality assessment
prediction problem
url https://www.mdpi.com/1424-8220/22/4/1608
work_keys_str_mv AT laszlobarnaiantovics methodfordataqualityassessmentofsyntheticindustrialdata
AT calinenachescu methodfordataqualityassessmentofsyntheticindustrialdata