Complexity curve: a graphical measure of data complexity and classifier performance

We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of avai...

Full description

Bibliographic Details
Main Authors: Julian Zubek, Dariusz M. Plewczynski
Format: Article
Language:English
Published: PeerJ Inc. 2016-08-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-76.pdf
_version_ 1818984774317899776
author Julian Zubek
Dariusz M. Plewczynski
author_facet Julian Zubek
Dariusz M. Plewczynski
author_sort Julian Zubek
collection DOAJ
description We describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).
first_indexed 2024-12-20T18:24:21Z
format Article
id doaj.art-b7c16d04b6f745e2b03666b3bfcbc926
institution Directory Open Access Journal
issn 2376-5992
language English
last_indexed 2024-12-20T18:24:21Z
publishDate 2016-08-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj.art-b7c16d04b6f745e2b03666b3bfcbc9262022-12-21T19:30:12ZengPeerJ Inc.PeerJ Computer Science2376-59922016-08-012e7610.7717/peerj-cs.76Complexity curve: a graphical measure of data complexity and classifier performanceJulian Zubek0Dariusz M. Plewczynski1Centre of New Technologies, University of Warsaw, Warsaw, PolandCentre of New Technologies, University of Warsaw, Warsaw, PolandWe describe a method for assessing data set complexity based on the estimation of the underlining probability distribution and Hellinger distance. In contrast to some popular complexity measures, it is not focused on the shape of a decision boundary in a classification task but on the amount of available data with respect to the attribute structure. Complexity is expressed in terms of graphical plot, which we call complexity curve. It demonstrates the relative increase of available information with the growth of sample size. We perform theoretical and experimental examination of properties of the introduced complexity measure and show its relation to the variance component of classification error. We then compare it with popular data complexity measures on 81 diverse data sets and show that it can contribute to explaining performance of specific classifiers on these sets. We also apply our methodology to a panel of simple benchmark data sets, demonstrating how it can be used in practice to gain insights into data characteristics. Moreover, we show that the complexity curve is an effective tool for reducing the size of the training set (data pruning), allowing to significantly speed up the learning process without compromising classification accuracy. The associated code is available to download at: https://github.com/zubekj/complexity_curve (open source Python implementation).https://peerj.com/articles/cs-76.pdfLearning curvesData complexityData pruningHellinger distanceBias-variance decompositionPerformance measures
spellingShingle Julian Zubek
Dariusz M. Plewczynski
Complexity curve: a graphical measure of data complexity and classifier performance
PeerJ Computer Science
Learning curves
Data complexity
Data pruning
Hellinger distance
Bias-variance decomposition
Performance measures
title Complexity curve: a graphical measure of data complexity and classifier performance
title_full Complexity curve: a graphical measure of data complexity and classifier performance
title_fullStr Complexity curve: a graphical measure of data complexity and classifier performance
title_full_unstemmed Complexity curve: a graphical measure of data complexity and classifier performance
title_short Complexity curve: a graphical measure of data complexity and classifier performance
title_sort complexity curve a graphical measure of data complexity and classifier performance
topic Learning curves
Data complexity
Data pruning
Hellinger distance
Bias-variance decomposition
Performance measures
url https://peerj.com/articles/cs-76.pdf
work_keys_str_mv AT julianzubek complexitycurveagraphicalmeasureofdatacomplexityandclassifierperformance
AT dariuszmplewczynski complexitycurveagraphicalmeasureofdatacomplexityandclassifierperformance