Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
Sentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. W...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10332170/ |
_version_ | 1797376339841384448 |
---|---|
author | Ali Erkan Tunga Gungor |
author_facet | Ali Erkan Tunga Gungor |
author_sort | Ali Erkan |
collection | DOAJ |
description | Sentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. We feed non-contextualized (Word2Vec and GloVe) and contextualized (BERT and RoBERTa/XLM-RoBERTa) embeddings and also the output of the pretrained BERT and RoBERTa/XLM-RoBERTa models as input to neural models. We make a comprehensive analysis of eleven different tokenization approaches, including the commonly used subword methods and morphologically motivated segmentations. The experiments are conducted on three English and two Turkish datasets from different domains. The results show that BERT- and RoBERTa-/XLM-RoBERTa-based and contextualized embeddings outperform other neural models. We also observe that using words in raw or preprocessed form, stemming the words, and applying WordPiece tokenizations give the most promising results in the sentiment analysis task. We ensemble the models to find out which tokenization approaches produce better results together. |
first_indexed | 2024-03-08T19:37:07Z |
format | Article |
id | doaj.art-3319d906644d4c0b90dd52623fca0003 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-08T19:37:07Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-3319d906644d4c0b90dd52623fca00032023-12-26T00:06:30ZengIEEEIEEE Access2169-35362023-01-011113495113496810.1109/ACCESS.2023.333735410332170Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment ClassificationAli Erkan0https://orcid.org/0000-0003-0125-8110Tunga Gungor1https://orcid.org/0000-0001-9448-9422Department of Computer Engineering, Boğaziçi University, Istanbul, TurkeyDepartment of Computer Engineering, Boğaziçi University, Istanbul, TurkeySentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. We feed non-contextualized (Word2Vec and GloVe) and contextualized (BERT and RoBERTa/XLM-RoBERTa) embeddings and also the output of the pretrained BERT and RoBERTa/XLM-RoBERTa models as input to neural models. We make a comprehensive analysis of eleven different tokenization approaches, including the commonly used subword methods and morphologically motivated segmentations. The experiments are conducted on three English and two Turkish datasets from different domains. The results show that BERT- and RoBERTa-/XLM-RoBERTa-based and contextualized embeddings outperform other neural models. We also observe that using words in raw or preprocessed form, stemming the words, and applying WordPiece tokenizations give the most promising results in the sentiment analysis task. We ensemble the models to find out which tokenization approaches produce better results together.https://ieeexplore.ieee.org/document/10332170/Machine learningdeep neural networksnatural language processingsentiment classificationword embeddingtokenization |
spellingShingle | Ali Erkan Tunga Gungor Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification IEEE Access Machine learning deep neural networks natural language processing sentiment classification word embedding tokenization |
title | Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification |
title_full | Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification |
title_fullStr | Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification |
title_full_unstemmed | Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification |
title_short | Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification |
title_sort | analysis of deep learning model combinations and tokenization approaches in sentiment classification |
topic | Machine learning deep neural networks natural language processing sentiment classification word embedding tokenization |
url | https://ieeexplore.ieee.org/document/10332170/ |
work_keys_str_mv | AT alierkan analysisofdeeplearningmodelcombinationsandtokenizationapproachesinsentimentclassification AT tungagungor analysisofdeeplearningmodelcombinationsandtokenizationapproachesinsentimentclassification |