Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted...

Full description

Bibliographic Details
Main Authors: Li, Pengfei, Mao, Kezhi, Xu, Yuecong, Li, Qi, Zhang, Jiaheng
Other Authors: School of Electrical and Electronic Engineering
Format: Journal Article
Language:English
Published: 2020
Subjects:
Online Access:https://hdl.handle.net/10356/137227
_version_ 1811681740443353088
author Li, Pengfei
Mao, Kezhi
Xu, Yuecong
Li, Qi
Zhang, Jiaheng
author2 School of Electrical and Electronic Engineering
author_facet School of Electrical and Electronic Engineering
Li, Pengfei
Mao, Kezhi
Xu, Yuecong
Li, Qi
Zhang, Jiaheng
author_sort Li, Pengfei
collection NTU
description Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representations using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification.
first_indexed 2024-10-01T03:45:45Z
format Journal Article
id ntu-10356/137227
institution Nanyang Technological University
language English
last_indexed 2024-10-01T03:45:45Z
publishDate 2020
record_format dspace
spelling ntu-10356/1372272021-01-29T02:36:45Z Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base Li, Pengfei Mao, Kezhi Xu, Yuecong Li, Qi Zhang, Jiaheng School of Electrical and Electronic Engineering Engineering::Computer science and engineering Natural Language Processing Text Representation Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representations using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification. Accepted version 2020-03-09T08:42:21Z 2020-03-09T08:42:21Z 2020 Journal Article Li, P., Mao, K., Xu, Y., Li, Q., & Zhang, J. (2020). Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base. Knowledge-Based Systems, 193105436-. doi:10.1016/j.knosys.2019.105436 0950-7051 https://hdl.handle.net/10356/137227 10.1016/j.knosys.2019.105436 2-s2.0-85077743747 105436 en Knowledge-Based Systems © 2020 Elsevier. All rights reserved. This paper was published in Knowledge-Based Systems and is made available with permission of Elsevier. application/pdf
spellingShingle Engineering::Computer science and engineering
Natural Language Processing
Text Representation
Li, Pengfei
Mao, Kezhi
Xu, Yuecong
Li, Qi
Zhang, Jiaheng
Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title_full Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title_fullStr Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title_full_unstemmed Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title_short Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
title_sort bag of concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base
topic Engineering::Computer science and engineering
Natural Language Processing
Text Representation
url https://hdl.handle.net/10356/137227
work_keys_str_mv AT lipengfei bagofconceptsrepresentationfordocumentclassificationbasedonautomaticknowledgeacquisitionfromprobabilisticknowledgebase
AT maokezhi bagofconceptsrepresentationfordocumentclassificationbasedonautomaticknowledgeacquisitionfromprobabilisticknowledgebase
AT xuyuecong bagofconceptsrepresentationfordocumentclassificationbasedonautomaticknowledgeacquisitionfromprobabilisticknowledgebase
AT liqi bagofconceptsrepresentationfordocumentclassificationbasedonautomaticknowledgeacquisitionfromprobabilisticknowledgebase
AT zhangjiaheng bagofconceptsrepresentationfordocumentclassificationbasedonautomaticknowledgeacquisitionfromprobabilisticknowledgebase