Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding

In today’s digital world, automated sentiment analysis from online reviews can contribute to a wide variety of decision-making processes. One example is examining typical perceptions of a product based on customer feedbacks to have a better understanding of consumer expectations, which ca...

Full description

Bibliographic Details
Main Authors:	Mohammad Tareq, Md. Fokhrul Islam, Swakshar Deb, Sejuti Rahman, Abdullah Al Mahmud
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Code mixed sentiment analysis Bangla-English corpus bi-lingual zero-shot learning
Online Access:	https://ieeexplore.ieee.org/document/10129187/

_version_	1797812850255724544
author	Mohammad Tareq Md. Fokhrul Islam Swakshar Deb Sejuti Rahman Abdullah Al Mahmud
author_facet	Mohammad Tareq Md. Fokhrul Islam Swakshar Deb Sejuti Rahman Abdullah Al Mahmud
author_sort	Mohammad Tareq
collection	DOAJ
description	In today’s digital world, automated sentiment analysis from online reviews can contribute to a wide variety of decision-making processes. One example is examining typical perceptions of a product based on customer feedbacks to have a better understanding of consumer expectations, which can help enhance everything from customer service to product offerings. Online review comments, on the other hand, frequently mix different languages, use non-native scripts and do not adhere to strict grammar norms. For a low-resource language like Bangla, the lack of annotated code-mixed data makes automated sentiment analysis more challenging. To address this, we collect online reviews of different products and construct an annotated Bangla-English code mix (BE-CM) dataset (Dataset and other resources are available at <uri>https://github.com/fokhruli/CM-seti-anlysis</uri>). On our sentiment corpus, we also compare several alternative models from the existing literature. We present a simple but effective data augmentation method that can be utilized with existing word embedding algorithms without the need for a parallel corpus to improve cross-lingual contextual understanding. Our experimental results suggest that training word embedding models (e.g., Word2vec, FastText) with our data augmentation strategy can help the model in capturing the cross-lingual relationship for code-mixed sentences, thereby improving the overall performance of existing classifiers in both supervised learning and zero-shot cross-lingual adaptability. With extensive experimentations, we found that XGBoost with Fasttext embedding trained on our proposed data augmentation method outperforms other alternative models in automated sentiment analysis on code-mixed Bangla-English dataset, with a weighted F1 score of 87%.
first_indexed	2024-03-13T07:44:31Z
format	Article
id	doaj.art-193822f7daa74fb7b09d51aeabf0d872
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-13T07:44:31Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-193822f7daa74fb7b09d51aeabf0d8722023-06-02T23:00:32ZengIEEEIEEE Access2169-35362023-01-0111516575167110.1109/ACCESS.2023.327778710129187Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual UnderstandingMohammad Tareq0Md. Fokhrul Islam1https://orcid.org/0000-0002-0031-4937Swakshar Deb2Sejuti Rahman3https://orcid.org/0000-0001-6226-2434Abdullah Al Mahmud4https://orcid.org/0000-0003-1140-4505Department of Accounting and Information Systems, University of Dhaka, Dhaka, BangladeshDepartment of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, BangladeshDepartment of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, BangladeshDepartment of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, BangladeshDepartment of Banking and Insurance, University of Dhaka, Dhaka, BangladeshIn today’s digital world, automated sentiment analysis from online reviews can contribute to a wide variety of decision-making processes. One example is examining typical perceptions of a product based on customer feedbacks to have a better understanding of consumer expectations, which can help enhance everything from customer service to product offerings. Online review comments, on the other hand, frequently mix different languages, use non-native scripts and do not adhere to strict grammar norms. For a low-resource language like Bangla, the lack of annotated code-mixed data makes automated sentiment analysis more challenging. To address this, we collect online reviews of different products and construct an annotated Bangla-English code mix (BE-CM) dataset (Dataset and other resources are available at <uri>https://github.com/fokhruli/CM-seti-anlysis</uri>). On our sentiment corpus, we also compare several alternative models from the existing literature. We present a simple but effective data augmentation method that can be utilized with existing word embedding algorithms without the need for a parallel corpus to improve cross-lingual contextual understanding. Our experimental results suggest that training word embedding models (e.g., Word2vec, FastText) with our data augmentation strategy can help the model in capturing the cross-lingual relationship for code-mixed sentences, thereby improving the overall performance of existing classifiers in both supervised learning and zero-shot cross-lingual adaptability. With extensive experimentations, we found that XGBoost with Fasttext embedding trained on our proposed data augmentation method outperforms other alternative models in automated sentiment analysis on code-mixed Bangla-English dataset, with a weighted F1 score of 87%.https://ieeexplore.ieee.org/document/10129187/Code mixedsentiment analysisBangla-English corpusbi-lingualzero-shot learning
spellingShingle	Mohammad Tareq Md. Fokhrul Islam Swakshar Deb Sejuti Rahman Abdullah Al Mahmud Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding IEEE Access Code mixed sentiment analysis Bangla-English corpus bi-lingual zero-shot learning
title	Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding
title_full	Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding
title_fullStr	Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding
title_full_unstemmed	Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding
title_short	Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding
title_sort	data augmentation for bangla english code mixed sentiment analysis enhancing cross linguistic contextual understanding
topic	Code mixed sentiment analysis Bangla-English corpus bi-lingual zero-shot learning
url	https://ieeexplore.ieee.org/document/10129187/
work_keys_str_mv	AT mohammadtareq dataaugmentationforbanglaenglishcodemixedsentimentanalysisenhancingcrosslinguisticcontextualunderstanding AT mdfokhrulislam dataaugmentationforbanglaenglishcodemixedsentimentanalysisenhancingcrosslinguisticcontextualunderstanding AT swakshardeb dataaugmentationforbanglaenglishcodemixedsentimentanalysisenhancingcrosslinguisticcontextualunderstanding AT sejutirahman dataaugmentationforbanglaenglishcodemixedsentimentanalysisenhancingcrosslinguisticcontextualunderstanding AT abdullahalmahmud dataaugmentationforbanglaenglishcodemixedsentimentanalysisenhancingcrosslinguisticcontextualunderstanding

Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding

Similar Items