A data value metric for quantifying information content and utility

Abstract Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource...

Full description

Bibliographic Details
Main Authors:	Morteza Noshad, Jerome Choi, Yuming Sun, Alfred Hero, Ivo D. Dinov
Format:	Article
Language:	English
Published:	SpringerOpen 2021-06-01
Series:	Journal of Big Data
Subjects:	Data energy Artificial intelligence Machine learning Data utility Information content
Online Access:	https://doi.org/10.1186/s40537-021-00446-6

_version_	1819139216959864832
author	Morteza Noshad Jerome Choi Yuming Sun Alfred Hero Ivo D. Dinov
author_facet	Morteza Noshad Jerome Choi Yuming Sun Alfred Hero Ivo D. Dinov
author_sort	Morteza Noshad
collection	DOAJ
description	Abstract Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.
first_indexed	2024-12-22T11:19:09Z
format	Article
id	doaj.art-abbf6fc4feaf417f9da9de100174c330
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-12-22T11:19:09Z
publishDate	2021-06-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-abbf6fc4feaf417f9da9de100174c3302022-12-21T18:27:56ZengSpringerOpenJournal of Big Data2196-11152021-06-018112310.1186/s40537-021-00446-6A data value metric for quantifying information content and utilityMorteza Noshad0Jerome Choi1Yuming Sun2Alfred Hero3Ivo D. Dinov4Department of Electrical Engineering and Computer Science, University of MichiganStatistics Online Computational Resource, University of MichiganStatistics Online Computational Resource, University of MichiganDepartment of Electrical Engineering and Computer Science, University of MichiganStatistics Online Computational Resource, University of MichiganAbstract Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.https://doi.org/10.1186/s40537-021-00446-6Data energyArtificial intelligenceMachine learningData utilityInformation content
spellingShingle	Morteza Noshad Jerome Choi Yuming Sun Alfred Hero Ivo D. Dinov A data value metric for quantifying information content and utility Journal of Big Data Data energy Artificial intelligence Machine learning Data utility Information content
title	A data value metric for quantifying information content and utility
title_full	A data value metric for quantifying information content and utility
title_fullStr	A data value metric for quantifying information content and utility
title_full_unstemmed	A data value metric for quantifying information content and utility
title_short	A data value metric for quantifying information content and utility
title_sort	data value metric for quantifying information content and utility
topic	Data energy Artificial intelligence Machine learning Data utility Information content
url	https://doi.org/10.1186/s40537-021-00446-6
work_keys_str_mv	AT mortezanoshad adatavaluemetricforquantifyinginformationcontentandutility AT jeromechoi adatavaluemetricforquantifyinginformationcontentandutility AT yumingsun adatavaluemetricforquantifyinginformationcontentandutility AT alfredhero adatavaluemetricforquantifyinginformationcontentandutility AT ivoddinov adatavaluemetricforquantifyinginformationcontentandutility AT mortezanoshad datavaluemetricforquantifyinginformationcontentandutility AT jeromechoi datavaluemetricforquantifyinginformationcontentandutility AT yumingsun datavaluemetricforquantifyinginformationcontentandutility AT alfredhero datavaluemetricforquantifyinginformationcontentandutility AT ivoddinov datavaluemetricforquantifyinginformationcontentandutility

A data value metric for quantifying information content and utility

Similar Items