Twitter popularity prediction based on text mining

Twitter as one of the most popular social media on the internet is generating a great amount of text data everyday. Due to huge amount data existed in Twitter, it may be difficult and even impossible for users to get access to useful and meaningful information. Therefore, automatic detection of poss...

Full description

Bibliographic Details
Main Author: Weng, Quanchi
Other Authors: Mao Kezhi
Format: Thesis
Language:English
Published: 2016
Subjects:
Online Access:http://hdl.handle.net/10356/68958
Description
Summary:Twitter as one of the most popular social media on the internet is generating a great amount of text data everyday. Due to huge amount data existed in Twitter, it may be difficult and even impossible for users to get access to useful and meaningful information. Therefore, automatic detection of possible popular Twitters can render and recommend important tweets to users timely. Our work in this thesis tries to predict the popularity of tweet based on its content: text information automatically. Here, the popularity of tweet is quantified by their count number of favorites and retweets.Our system consists of two parts. The first part is text representation learning. A good representation of text data is vital to achieve a robust and optimal performance. Here, we investigated two methods to learn text representation. One is based on traditional Bag-of-words. The other is based on a recent popular technique: word embeddings. The second part is about classification algorithms. Three classical classifiers: SVM, Naive Bayes and logistical regression are compared.Extensive experiments over 3000 tweets from The Cable News Network(CNN) official account are conducted. The task has been defined as a classification problem, in which tweets with high numbers of favorites and retweets are labeled as popular ones and the tweets with low numbers of favorites and retweets are labeled as unpopular ones. It has been proven that it is possible to detect the popularity of tweets based on their content. Especially, BoW and SVM achieves the best performance.