Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification

Sentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. W...

Full description

Bibliographic Details
Main Authors: Ali Erkan, Tunga Gungor
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10332170/
_version_ 1797376339841384448
author Ali Erkan
Tunga Gungor
author_facet Ali Erkan
Tunga Gungor
author_sort Ali Erkan
collection DOAJ
description Sentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. We feed non-contextualized (Word2Vec and GloVe) and contextualized (BERT and RoBERTa/XLM-RoBERTa) embeddings and also the output of the pretrained BERT and RoBERTa/XLM-RoBERTa models as input to neural models. We make a comprehensive analysis of eleven different tokenization approaches, including the commonly used subword methods and morphologically motivated segmentations. The experiments are conducted on three English and two Turkish datasets from different domains. The results show that BERT- and RoBERTa-/XLM-RoBERTa-based and contextualized embeddings outperform other neural models. We also observe that using words in raw or preprocessed form, stemming the words, and applying WordPiece tokenizations give the most promising results in the sentiment analysis task. We ensemble the models to find out which tokenization approaches produce better results together.
first_indexed 2024-03-08T19:37:07Z
format Article
id doaj.art-3319d906644d4c0b90dd52623fca0003
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-08T19:37:07Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-3319d906644d4c0b90dd52623fca00032023-12-26T00:06:30ZengIEEEIEEE Access2169-35362023-01-011113495113496810.1109/ACCESS.2023.333735410332170Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment ClassificationAli Erkan0https://orcid.org/0000-0003-0125-8110Tunga Gungor1https://orcid.org/0000-0001-9448-9422Department of Computer Engineering, Boğaziçi University, Istanbul, TurkeyDepartment of Computer Engineering, Boğaziçi University, Istanbul, TurkeySentiment classification is a natural language processing task to identify opinions expressed in texts such as product or service reviews. In this work, we analyze the effects of different deep-learning model combinations, embedding methods, and tokenization approaches in sentiment classification. We feed non-contextualized (Word2Vec and GloVe) and contextualized (BERT and RoBERTa/XLM-RoBERTa) embeddings and also the output of the pretrained BERT and RoBERTa/XLM-RoBERTa models as input to neural models. We make a comprehensive analysis of eleven different tokenization approaches, including the commonly used subword methods and morphologically motivated segmentations. The experiments are conducted on three English and two Turkish datasets from different domains. The results show that BERT- and RoBERTa-/XLM-RoBERTa-based and contextualized embeddings outperform other neural models. We also observe that using words in raw or preprocessed form, stemming the words, and applying WordPiece tokenizations give the most promising results in the sentiment analysis task. We ensemble the models to find out which tokenization approaches produce better results together.https://ieeexplore.ieee.org/document/10332170/Machine learningdeep neural networksnatural language processingsentiment classificationword embeddingtokenization
spellingShingle Ali Erkan
Tunga Gungor
Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
IEEE Access
Machine learning
deep neural networks
natural language processing
sentiment classification
word embedding
tokenization
title Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
title_full Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
title_fullStr Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
title_full_unstemmed Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
title_short Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification
title_sort analysis of deep learning model combinations and tokenization approaches in sentiment classification
topic Machine learning
deep neural networks
natural language processing
sentiment classification
word embedding
tokenization
url https://ieeexplore.ieee.org/document/10332170/
work_keys_str_mv AT alierkan analysisofdeeplearningmodelcombinationsandtokenizationapproachesinsentimentclassification
AT tungagungor analysisofdeeplearningmodelcombinationsandtokenizationapproachesinsentimentclassification