An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising...

Full description

Bibliographic Details
Main Authors:	Tharindu Ranasinghe, Marcos Zampieri
Format:	Article
Language:	English
Published:	MDPI AG 2021-07-01
Series:	Information
Subjects:	offensive language identification deep learning multilingual learning
Online Access:	https://www.mdpi.com/2078-2489/12/8/306

_version_	1797523482069696512
author	Tharindu Ranasinghe Marcos Zampieri
author_facet	Tharindu Ranasinghe Marcos Zampieri
author_sort	Tharindu Ranasinghe
collection	DOAJ
description	The pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.
first_indexed	2024-03-10T08:43:36Z
format	Article
id	doaj.art-1a8a60cbb06d4a66b451e690e08806a6
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T08:43:36Z
publishDate	2021-07-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-1a8a60cbb06d4a66b451e690e08806a62023-11-22T08:05:51ZengMDPI AGInformation2078-24892021-07-0112830610.3390/info12080306An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of IndiaTharindu Ranasinghe0Marcos Zampieri1Research Group in Computational Linguistics, University of Wolverhampton, Wolverhampton WV1 1LY, UKLanguage Technology Group, Rochester Institute of Technology, Rochester, NY 14623, USAThe pervasiveness of offensive content in social media has become an important reason for concern for online platforms. With the aim of improving online safety, a large number of studies applying computational models to identify such content have been published in the last few years, with promising results. The majority of these studies, however, deal with high-resource languages such as English due to the availability of datasets in these languages. Recent work has addressed offensive language identification from a low-resource perspective, exploring data augmentation strategies and trying to take advantage of existing multilingual pretrained models to cope with data scarcity in low-resource scenarios. In this work, we revisit the problem of low-resource offensive language identification by evaluating the performance of multilingual transformers in offensive language identification for languages spoken in India. We investigate languages from different families such as Indo-Aryan (e.g., Bengali, Hindi, and Urdu) and Dravidian (e.g., Tamil, Malayalam, and Kannada), creating important new technology for these languages. The results show that multilingual offensive language identification models perform better than monolingual models and that cross-lingual transformers show strong zero-shot and few-shot performance across languages.https://www.mdpi.com/2078-2489/12/8/306offensive language identificationdeep learningmultilingual learning
spellingShingle	Tharindu Ranasinghe Marcos Zampieri An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India Information offensive language identification deep learning multilingual learning
title	An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
title_full	An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
title_fullStr	An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
title_full_unstemmed	An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
title_short	An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
title_sort	evaluation of multilingual offensive language identification methods for the languages of india
topic	offensive language identification deep learning multilingual learning
url	https://www.mdpi.com/2078-2489/12/8/306
work_keys_str_mv	AT tharinduranasinghe anevaluationofmultilingualoffensivelanguageidentificationmethodsforthelanguagesofindia AT marcoszampieri anevaluationofmultilingualoffensivelanguageidentificationmethodsforthelanguagesofindia AT tharinduranasinghe evaluationofmultilingualoffensivelanguageidentificationmethodsforthelanguagesofindia AT marcoszampieri evaluationofmultilingualoffensivelanguageidentificationmethodsforthelanguagesofindia

An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India

Similar Items