Ranking user generated content using topic models

With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-base...

Full description

Bibliographic Details
Main Author: Ma, Zongyang
Other Authors: Sun Aixin
Format: Thesis
Language:English
Published: 2015
Subjects:
Online Access:https://hdl.handle.net/10356/65539
Description
Summary:With the popularity of Web 2.0, more and more users express and share opinions through various online platforms. Example platforms include news websites that support user commenting like Yahoo! News, social network sites that allow users to post messages like Facebook and Twitter, and community-based question answering sites which let users ask and answer questions. As the result, a huge amount of User Generated Content (UGC) is accumulated online in the forms of comments, tweets, question and answer posts, and others. Depending on the platform within which UGC is created, UGC may be associated with different types of attributes such as creator, time, location, text and social connections of its creator. On the other hand, UGC data from different platforms shares similar characteristics: huge amount, free writing style, and heterogeneous nature. More importantly, UGC data often demonstrates master-slave relationship. A comment is associated with a news article; a hashtag is an annotation of its embedded tweet; an answer does not exist without a question. Here, news articles, tweets, and questions are master documents while comments, hashtags, and answers are slave documents. Although topic modeling (e.g., LDA and PLSA) has been widely used to model text collections, discovering fine-grained topics from UGC with the consideration of master-slave relationship remains an open and challenging problem. In this research, the generative process of UGC data is simulated using topic models for the ranking of slave documents of given master documents with the aim of reducing information overload. Depending on the platform that UGC data is created in, three sub-problems are defined and addressed: (i) comment ranking for news articles, (ii) hashtag ranking for tweets, and (iii) answer ranking for questions. Comment ranking is essential for identifying the important comments as a summary of user discussion for a news article. In this task, we assume that topics of slave documents cover the topics of their corresponding master document, and also the topics discussed solely in comments. For this problem, we propose two LDA-style topic models, namely, Master-Slave Topic Model (MSTM) and Extended Master-Slave Topic Model (EXTM). MSTM model constrains that the topics discussed in comments have to be derived from the commenting news article. EXTM model allows generating words of comments using both the topics derived from the commenting news article, and the topics derived from all comments themselves. Evaluated on Yahoo! News, the proposed models outperform baseline methods. Hashtag ranking is important for tweet annotation and retrieval. Here, we assume that the topics of slave documents are the topical summary of their corresponding master documents. For this problem, we propose two PLSA-style topic models to model the hashtag annotation behavior. Content-Pivoted Model (CPM) assumes that tweet content guides the generation of hashtags, while Hashtag-Pivoted Model (HPM) assumes that hashtags guide the generation of tweet content. The experimental results demonstrate that CPM is most effective for ranking the most relevant hashtags of tweets. Answer ranking enables users to easily pick up the best answers for questions. In this task, we assume that topics of slave documents and topics of their corresponding master documents are similar but words of slave topics and master topics are drawn from different vocabularies. For this problem, we propose a PLSA-style topic model, namely, Tri-Role Topic Model (TRTM), to model the tri-roles of users (i.e., as askers, answerers, and voters, respectively) and the activities of each role including composing question, selecting question to answer, contributing and voting answers. Evaluated on Stack Overflow data, TRTM outperforms state-of-the-art methods for ranking high-quality answers for given questions. These three problems are all on ranking UGC data from different platforms using topic models and the proposed topic models are extended depending on the master-slave structure of UGC data. For the problem of comment ranking, the slave documents (comments) are much shorter than their corresponding master document (news article). Our main concern is discovering topics from comments which reflect the topics of their news article as well as keeping topics merely discussed among comments. For the problem of hashtag ranking, the slave documents (hashtags) are extremely short, and sometimes the hashtag is just the abbreviation of one or a few words. Compared with comment ranking, hashtag ranking is more difficult and we thus introduce more factors (e.g., user and time) to enrich the hashtag representation. Lastly, for the problem of answer ranking, the answer has an important feature of vote. It is challenging for us to model the voting behavior of users in a generative model. To address this task, we focus more on modeling the relationships between questions, answers, askers and answerers using the exponential KL-divergence function. In this research, we define three ranking problems of User Generated Content. To address these problems, we propose several extended topic models to fit the characteristics and the structure of UGC data from different platforms. From Yahoo! News to Twitter, then to Stack Overflow, the features of the adopted data in our research are more and more complicated. The designed topic models include more features and relationships to more accurately simulate the generation process of UGC data. Experimental results show that our methods outperform baseline methods for all three problems.