A survey on text classification: Practical perspectives on the Italian language.

Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and...

Full description

Bibliographic Details
Main Authors: Andrea Gasparetto, Alessandro Zangari, Matteo Marcuzzo, Andrea Albarelli
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0270904
_version_ 1828290439133790208
author Andrea Gasparetto
Alessandro Zangari
Matteo Marcuzzo
Andrea Albarelli
author_facet Andrea Gasparetto
Alessandro Zangari
Matteo Marcuzzo
Andrea Albarelli
author_sort Andrea Gasparetto
collection DOAJ
description Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.
first_indexed 2024-04-13T10:36:40Z
format Article
id doaj.art-0e1aef4d4ad24cc6a4f1ceecfb88a3ca
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-13T10:36:40Z
publishDate 2022-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-0e1aef4d4ad24cc6a4f1ceecfb88a3ca2022-12-22T02:50:02ZengPublic Library of Science (PLoS)PLoS ONE1932-62032022-01-01177e027090410.1371/journal.pone.0270904A survey on text classification: Practical perspectives on the Italian language.Andrea GasparettoAlessandro ZangariMatteo MarcuzzoAndrea AlbarelliText Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.https://doi.org/10.1371/journal.pone.0270904
spellingShingle Andrea Gasparetto
Alessandro Zangari
Matteo Marcuzzo
Andrea Albarelli
A survey on text classification: Practical perspectives on the Italian language.
PLoS ONE
title A survey on text classification: Practical perspectives on the Italian language.
title_full A survey on text classification: Practical perspectives on the Italian language.
title_fullStr A survey on text classification: Practical perspectives on the Italian language.
title_full_unstemmed A survey on text classification: Practical perspectives on the Italian language.
title_short A survey on text classification: Practical perspectives on the Italian language.
title_sort survey on text classification practical perspectives on the italian language
url https://doi.org/10.1371/journal.pone.0270904
work_keys_str_mv AT andreagasparetto asurveyontextclassificationpracticalperspectivesontheitalianlanguage
AT alessandrozangari asurveyontextclassificationpracticalperspectivesontheitalianlanguage
AT matteomarcuzzo asurveyontextclassificationpracticalperspectivesontheitalianlanguage
AT andreaalbarelli asurveyontextclassificationpracticalperspectivesontheitalianlanguage
AT andreagasparetto surveyontextclassificationpracticalperspectivesontheitalianlanguage
AT alessandrozangari surveyontextclassificationpracticalperspectivesontheitalianlanguage
AT matteomarcuzzo surveyontextclassificationpracticalperspectivesontheitalianlanguage
AT andreaalbarelli surveyontextclassificationpracticalperspectivesontheitalianlanguage