Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation

The basic Bag of Words (BOW) representation generally used in text documents clustering or categorization loses important syntactic and semantic information contained in the documents. When the texts contain a lot of stop words or when they are of a short length this may be particularly problematic....

Full description

Bibliographic Details
Main Authors: Rayner Alfred, Suraya Alias, Asni Tahir
Format: Research Report
Language:English
Published: Universiti Malaysia Sabah 2011
Subjects:
Online Access:https://eprints.ums.edu.my/id/eprint/22890/1/Enhancing%20document%20clustering%20by%20integrating%20semantic%20background%20knowledge%20and%20syntactic%20features%20into%20the%20bag%20of%20words%20representation.pdf
_version_ 1825713672822980608
author Rayner Alfred
Suraya Alias
Asni Tahir
author_facet Rayner Alfred
Suraya Alias
Asni Tahir
author_sort Rayner Alfred
collection UMS
description The basic Bag of Words (BOW) representation generally used in text documents clustering or categorization loses important syntactic and semantic information contained in the documents. When the texts contain a lot of stop words or when they are of a short length this may be particularly problematic. In this research, we study the contribution of incorporating syntactic features [and semantic background knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies-Bouldin index (DBI). In this research, we compare the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. This research helps the understanding on how the quality of documents clustering can be improved by enriching the classic bag of words representation with additional background information.
first_indexed 2024-03-06T03:00:11Z
format Research Report
id ums.eprints-22890
institution Universiti Malaysia Sabah
language English
last_indexed 2024-03-06T03:00:11Z
publishDate 2011
publisher Universiti Malaysia Sabah
record_format dspace
spelling ums.eprints-228902019-07-22T04:34:13Z https://eprints.ums.edu.my/id/eprint/22890/ Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation Rayner Alfred Suraya Alias Asni Tahir QA Mathematics The basic Bag of Words (BOW) representation generally used in text documents clustering or categorization loses important syntactic and semantic information contained in the documents. When the texts contain a lot of stop words or when they are of a short length this may be particularly problematic. In this research, we study the contribution of incorporating syntactic features [and semantic background knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies-Bouldin index (DBI). In this research, we compare the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. This research helps the understanding on how the quality of documents clustering can be improved by enriching the classic bag of words representation with additional background information. Universiti Malaysia Sabah 2011 Research Report NonPeerReviewed text en https://eprints.ums.edu.my/id/eprint/22890/1/Enhancing%20document%20clustering%20by%20integrating%20semantic%20background%20knowledge%20and%20syntactic%20features%20into%20the%20bag%20of%20words%20representation.pdf Rayner Alfred and Suraya Alias and Asni Tahir (2011) Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation. (Unpublished)
spellingShingle QA Mathematics
Rayner Alfred
Suraya Alias
Asni Tahir
Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title_full Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title_fullStr Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title_full_unstemmed Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title_short Enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
title_sort enhancing document clustering by integrating semantic background knowledge and syntactic features into the bag of words representation
topic QA Mathematics
url https://eprints.ums.edu.my/id/eprint/22890/1/Enhancing%20document%20clustering%20by%20integrating%20semantic%20background%20knowledge%20and%20syntactic%20features%20into%20the%20bag%20of%20words%20representation.pdf
work_keys_str_mv AT rayneralfred enhancingdocumentclusteringbyintegratingsemanticbackgroundknowledgeandsyntacticfeaturesintothebagofwordsrepresentation
AT surayaalias enhancingdocumentclusteringbyintegratingsemanticbackgroundknowledgeandsyntacticfeaturesintothebagofwordsrepresentation
AT asnitahir enhancingdocumentclusteringbyintegratingsemanticbackgroundknowledgeandsyntacticfeaturesintothebagofwordsrepresentation