Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse

As social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this i...

Full description

Bibliographic Details
Main Authors: Margarida Mendonça, Álvaro Figueira
Format: Article
Language:English
Published: MDPI AG 2024-02-01
Series:Informatics
Subjects:
Online Access:https://www.mdpi.com/2227-9709/11/1/8
_version_ 1827305993296412672
author Margarida Mendonça
Álvaro Figueira
author_facet Margarida Mendonça
Álvaro Figueira
author_sort Margarida Mendonça
collection DOAJ
description As social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this impact, aided by state-of-the-art analytical tools and by an awareness of societal biases and complexities, enables us to anticipate and mitigate the potential negative effects. One such tool is BERTopic, a novel deep-learning algorithm developed for Topic Mining, which has been shown to offer significant advantages over traditional methods like Latent Dirichlet Allocation (LDA), particularly in terms of its high modularity, which allows for extensive personalization at each stage of the topic modeling process. In this study, we hypothesize that BERTopic, when optimized for Twitter data, can provide a more coherent and stable topic modeling. We began by conducting a review of the literature on topic-mining approaches for short-text data. Using this knowledge, we explored the potential for optimizing BERTopic and analyzed its effectiveness. Our focus was on Twitter data spanning the two years of the 117th US Congress. We evaluated BERTopic’s performance using coherence, perplexity, diversity, and stability scores, finding significant improvements over traditional methods and the default parameters for this tool. We discovered that improvements are possible in BERTopic’s coherence and stability. We also identified the major topics of this Congress, which include abortion, student debt, and Judge Ketanji Brown Jackson. Additionally, we describe a simple application we developed for a better visualization of Congress topics.
first_indexed 2024-04-24T18:10:32Z
format Article
id doaj.art-ac1bda2881174f96bd7bc58446941e26
institution Directory Open Access Journal
issn 2227-9709
language English
last_indexed 2024-04-24T18:10:32Z
publishDate 2024-02-01
publisher MDPI AG
record_format Article
series Informatics
spelling doaj.art-ac1bda2881174f96bd7bc58446941e262024-03-27T13:46:49ZengMDPI AGInformatics2227-97092024-02-01111810.3390/informatics11010008Topic Extraction: BERTopic’s Insight into the 117th Congress’s TwitterverseMargarida Mendonça0Álvaro Figueira1Faculty of Sciences, University of Porto, 4169-007 Porto, PortugalCRACS-INESCTEC and Faculty of Sciences, University of Porto, 4169-007 Porto, PortugalAs social media (SM) becomes increasingly prevalent, its impact on society is expected to grow accordingly. While SM has brought positive transformations, it has also amplified pre-existing issues such as misinformation, echo chambers, manipulation, and propaganda. A thorough comprehension of this impact, aided by state-of-the-art analytical tools and by an awareness of societal biases and complexities, enables us to anticipate and mitigate the potential negative effects. One such tool is BERTopic, a novel deep-learning algorithm developed for Topic Mining, which has been shown to offer significant advantages over traditional methods like Latent Dirichlet Allocation (LDA), particularly in terms of its high modularity, which allows for extensive personalization at each stage of the topic modeling process. In this study, we hypothesize that BERTopic, when optimized for Twitter data, can provide a more coherent and stable topic modeling. We began by conducting a review of the literature on topic-mining approaches for short-text data. Using this knowledge, we explored the potential for optimizing BERTopic and analyzed its effectiveness. Our focus was on Twitter data spanning the two years of the 117th US Congress. We evaluated BERTopic’s performance using coherence, perplexity, diversity, and stability scores, finding significant improvements over traditional methods and the default parameters for this tool. We discovered that improvements are possible in BERTopic’s coherence and stability. We also identified the major topics of this Congress, which include abortion, student debt, and Judge Ketanji Brown Jackson. Additionally, we describe a simple application we developed for a better visualization of Congress topics.https://www.mdpi.com/2227-9709/11/1/8Topic MiningBERTopic117th CongressTwittershort-text data
spellingShingle Margarida Mendonça
Álvaro Figueira
Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
Informatics
Topic Mining
BERTopic
117th Congress
Twitter
short-text data
title Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
title_full Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
title_fullStr Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
title_full_unstemmed Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
title_short Topic Extraction: BERTopic’s Insight into the 117th Congress’s Twitterverse
title_sort topic extraction bertopic s insight into the 117th congress s twitterverse
topic Topic Mining
BERTopic
117th Congress
Twitter
short-text data
url https://www.mdpi.com/2227-9709/11/1/8
work_keys_str_mv AT margaridamendonca topicextractionbertopicsinsightintothe117thcongressstwitterverse
AT alvarofigueira topicextractionbertopicsinsightintothe117thcongressstwitterverse