A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets

Multilabel classification is a recently conceptualized task in machine learning. Contrary to most of the research that has so far focused on classification machinery, we take a data-centric approach and provide an integrative framework that blends qualitative and quantitative descriptions of multila...

Full description

Bibliographic Details
Main Authors: Francisco J. Valverde-Albacete, Carmen Peláez-Moreno
Format: Article
Language:English
Published: MDPI AG 2024-01-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/12/2/346
_version_ 1797343021210009600
author Francisco J. Valverde-Albacete
Carmen Peláez-Moreno
author_facet Francisco J. Valverde-Albacete
Carmen Peláez-Moreno
author_sort Francisco J. Valverde-Albacete
collection DOAJ
description Multilabel classification is a recently conceptualized task in machine learning. Contrary to most of the research that has so far focused on classification machinery, we take a data-centric approach and provide an integrative framework that blends qualitative and quantitative descriptions of multilabel data sources. By combining lattice theory, in the form of formal concept analysis, and entropy triangles, obtained from information theory, we explain from first principles the fundamental issues of multilabel datasets such as the dependencies of the labels, their imbalances, or the effects of the presence of hapaxes. This allows us to provide guidelines for resampling and new data collection and their relationship with broad modelling approaches. We have empirically validated our framework using 56 open datasets, challenging previous characterizations that prove that our formalization brings useful insights into the task of multilabel classification. Further work will consider the extension of this formalization to understand the relationship between the data sources, the classification methods, and ways to assess their performance.
first_indexed 2024-03-08T10:41:38Z
format Article
id doaj.art-c7875c159a3a49419a5c5295c9f5b792
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-08T10:41:38Z
publishDate 2024-01-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-c7875c159a3a49419a5c5295c9f5b7922024-01-26T17:34:19ZengMDPI AGMathematics2227-73902024-01-0112234610.3390/math12020346A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning DatasetsFrancisco J. Valverde-Albacete0Carmen Peláez-Moreno1Department of Signal Theory and Communications, Telematic Systems and Computation, Universidad Rey Juan Carlos, 28942 Fuenlabrada, Madrid, SpainDepartment of Signal Theory and Communications, Universidad Carlos III de Madrid, 28911 Leganés, Madrid, SpainMultilabel classification is a recently conceptualized task in machine learning. Contrary to most of the research that has so far focused on classification machinery, we take a data-centric approach and provide an integrative framework that blends qualitative and quantitative descriptions of multilabel data sources. By combining lattice theory, in the form of formal concept analysis, and entropy triangles, obtained from information theory, we explain from first principles the fundamental issues of multilabel datasets such as the dependencies of the labels, their imbalances, or the effects of the presence of hapaxes. This allows us to provide guidelines for resampling and new data collection and their relationship with broad modelling approaches. We have empirically validated our framework using 56 open datasets, challenging previous characterizations that prove that our formalization brings useful insights into the task of multilabel classification. Further work will consider the extension of this formalization to understand the relationship between the data sources, the classification methods, and ways to assess their performance.https://www.mdpi.com/2227-7390/12/2/346multilabel classificationmultilabel datasetsinformation sourcesformal concept analysisentropy balancesmeta-analysis
spellingShingle Francisco J. Valverde-Albacete
Carmen Peláez-Moreno
A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
Mathematics
multilabel classification
multilabel datasets
information sources
formal concept analysis
entropy balances
meta-analysis
title A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
title_full A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
title_fullStr A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
title_full_unstemmed A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
title_short A Formalization of Multilabel Classification in Terms of Lattice Theory and Information Theory: Concerning Datasets
title_sort formalization of multilabel classification in terms of lattice theory and information theory concerning datasets
topic multilabel classification
multilabel datasets
information sources
formal concept analysis
entropy balances
meta-analysis
url https://www.mdpi.com/2227-7390/12/2/346
work_keys_str_mv AT franciscojvalverdealbacete aformalizationofmultilabelclassificationintermsoflatticetheoryandinformationtheoryconcerningdatasets
AT carmenpelaezmoreno aformalizationofmultilabelclassificationintermsoflatticetheoryandinformationtheoryconcerningdatasets
AT franciscojvalverdealbacete formalizationofmultilabelclassificationintermsoflatticetheoryandinformationtheoryconcerningdatasets
AT carmenpelaezmoreno formalizationofmultilabelclassificationintermsoflatticetheoryandinformationtheoryconcerningdatasets