Offensive language identification in dravidian languages using MPNet and CNN

Social media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable inform...

Full description

Bibliographic Details
Main Authors: Bharathi Raja Chakravarthi, Manoj Balaji Jagadeeshan, Vasanth Palanikumar, Ruba Priyadharshini
Format: Article
Language:English
Published: Elsevier 2023-04-01
Series:International Journal of Information Management Data Insights
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2667096822000945
_version_ 1797847711228100608
author Bharathi Raja Chakravarthi
Manoj Balaji Jagadeeshan
Vasanth Palanikumar
Ruba Priyadharshini
author_facet Bharathi Raja Chakravarthi
Manoj Balaji Jagadeeshan
Vasanth Palanikumar
Ruba Priyadharshini
author_sort Bharathi Raja Chakravarthi
collection DOAJ
description Social media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable information directed at them. Our work aims to categorise code-mixed social media comments and posts in Tamil, Malayalam, and Kannada into offensive or not offensive at different levels. We present a multilingual MPNet and CNN fusion model for detecting offensive language content directed at an individual (or group) in low-resource Dravidian languages at different levels. Our model is capable of handling data that has been code-mixed, such as Tamil and Latin scripts. The model was successfully validated on the datasets, achieving offensive language detection results better than those of other baseline models with weighted average F1-score of 0.85, 0.98, and 0.76, and performed better than the baseline models EWDT, and EWODT by 0.02, 0.02, 0.04 for Tamil, Malayalam, and Kannada respectively.
first_indexed 2024-04-09T18:15:48Z
format Article
id doaj.art-8a406a67cc4b4f8ea591d777248aee9d
institution Directory Open Access Journal
issn 2667-0968
language English
last_indexed 2024-04-09T18:15:48Z
publishDate 2023-04-01
publisher Elsevier
record_format Article
series International Journal of Information Management Data Insights
spelling doaj.art-8a406a67cc4b4f8ea591d777248aee9d2023-04-13T04:27:21ZengElsevierInternational Journal of Information Management Data Insights2667-09682023-04-0131100151Offensive language identification in dravidian languages using MPNet and CNNBharathi Raja Chakravarthi0Manoj Balaji Jagadeeshan1Vasanth Palanikumar2Ruba Priyadharshini3Corresponding author.; School of Computer Science, University of Galway, IrelandBirla Institute of Technology and Science Pilani, IndiaChennai Institute of Technology, Chennai, IndiaThe Gandhigram Rural Institute - Deemed University, IndiaSocial media has effectively replaced traditional forms of communication and marketing. As these platforms allow for the free expression of ideas and facts through text, images, and videos, there exists a significant need to screen them to safeguard people and organisations from objectionable information directed at them. Our work aims to categorise code-mixed social media comments and posts in Tamil, Malayalam, and Kannada into offensive or not offensive at different levels. We present a multilingual MPNet and CNN fusion model for detecting offensive language content directed at an individual (or group) in low-resource Dravidian languages at different levels. Our model is capable of handling data that has been code-mixed, such as Tamil and Latin scripts. The model was successfully validated on the datasets, achieving offensive language detection results better than those of other baseline models with weighted average F1-score of 0.85, 0.98, and 0.76, and performed better than the baseline models EWDT, and EWODT by 0.02, 0.02, 0.04 for Tamil, Malayalam, and Kannada respectively.http://www.sciencedirect.com/science/article/pii/S2667096822000945Offensive language identificationDravidian languagesCode-mixingDeep learningMPNetCNN
spellingShingle Bharathi Raja Chakravarthi
Manoj Balaji Jagadeeshan
Vasanth Palanikumar
Ruba Priyadharshini
Offensive language identification in dravidian languages using MPNet and CNN
International Journal of Information Management Data Insights
Offensive language identification
Dravidian languages
Code-mixing
Deep learning
MPNet
CNN
title Offensive language identification in dravidian languages using MPNet and CNN
title_full Offensive language identification in dravidian languages using MPNet and CNN
title_fullStr Offensive language identification in dravidian languages using MPNet and CNN
title_full_unstemmed Offensive language identification in dravidian languages using MPNet and CNN
title_short Offensive language identification in dravidian languages using MPNet and CNN
title_sort offensive language identification in dravidian languages using mpnet and cnn
topic Offensive language identification
Dravidian languages
Code-mixing
Deep learning
MPNet
CNN
url http://www.sciencedirect.com/science/article/pii/S2667096822000945
work_keys_str_mv AT bharathirajachakravarthi offensivelanguageidentificationindravidianlanguagesusingmpnetandcnn
AT manojbalajijagadeeshan offensivelanguageidentificationindravidianlanguagesusingmpnetandcnn
AT vasanthpalanikumar offensivelanguageidentificationindravidianlanguagesusingmpnetandcnn
AT rubapriyadharshini offensivelanguageidentificationindravidianlanguagesusingmpnetandcnn