Toolkit development for high-dimensional data pre-processing, clustering and analysis

In this report, the author documents the software project that designs and implements a high dimensional data processing toolkit. The developed toolkit is called WordTagger, that automatically labels a vocabulary of computer science words to provide the categorical information of the word-space by u...

Descrizione completa

Dettagli Bibliografici
Autore principale:	Hu, Yao.
Altri autori:	Chen Lihui
Natura:	Tesi
Lingua:	English
Pubblicazione:	2014
Soggetti:	DRNTU::Engineering::Electrical and electronic engineering
Accesso online:	http://hdl.handle.net/10356/55243

_version_	1826115888206577664
author	Hu, Yao.
author2	Chen Lihui
author_facet	Chen Lihui Hu, Yao.
author_sort	Hu, Yao.
collection	NTU
description	In this report, the author documents the software project that designs and implements a high dimensional data processing toolkit. The developed toolkit is called WordTagger, that automatically labels a vocabulary of computer science words to provide the categorical information of the word-space by using ACM taxonomy as reference [1]. The word categorical information can be used as another source of the prior knowledge to incorporate with that from the document-space into the existing semi-supervised coclustering algorithms. The author has successfully implemented this toolkit WordTagger and conducted tests to evaluate its effectiveness and efficiency. Some preliminary experiments have also been conducted to show the WordTagger labeled words could be used as an additional word-space prior knowledge source. This is done by making modifications to an existing semi-supervised approach SS-HFCR to accept prior knowledge from both document and word-space, which is referred as dual SS-HFCR. However, in the report, we show that dual SS-HFCR is unable to perform as good as expected with the categorical information from word-space provided by WordTagger. The limitations of the current integration of WordTagger and dual SS-HFCR are identified and discussed. The future work is suggested and summarized in the end of the report.
first_indexed	2024-10-01T04:02:28Z
format	Thesis
id	ntu-10356/55243
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T04:02:28Z
publishDate	2014
record_format	dspace
spelling	ntu-10356/552432023-07-04T15:35:17Z Toolkit development for high-dimensional data pre-processing, clustering and analysis Hu, Yao. Chen Lihui School of Electrical and Electronic Engineering DRNTU::Engineering::Electrical and electronic engineering In this report, the author documents the software project that designs and implements a high dimensional data processing toolkit. The developed toolkit is called WordTagger, that automatically labels a vocabulary of computer science words to provide the categorical information of the word-space by using ACM taxonomy as reference [1]. The word categorical information can be used as another source of the prior knowledge to incorporate with that from the document-space into the existing semi-supervised coclustering algorithms. The author has successfully implemented this toolkit WordTagger and conducted tests to evaluate its effectiveness and efficiency. Some preliminary experiments have also been conducted to show the WordTagger labeled words could be used as an additional word-space prior knowledge source. This is done by making modifications to an existing semi-supervised approach SS-HFCR to accept prior knowledge from both document and word-space, which is referred as dual SS-HFCR. However, in the report, we show that dual SS-HFCR is unable to perform as good as expected with the categorical information from word-space provided by WordTagger. The limitations of the current integration of WordTagger and dual SS-HFCR are identified and discussed. The future work is suggested and summarized in the end of the report. Master of Science (Communication Software and Networks) 2014-01-06T08:36:00Z 2014-01-06T08:36:00Z 2012 2012 Thesis http://hdl.handle.net/10356/55243 en 70 p. application/pdf
spellingShingle	DRNTU::Engineering::Electrical and electronic engineering Hu, Yao. Toolkit development for high-dimensional data pre-processing, clustering and analysis
title	Toolkit development for high-dimensional data pre-processing, clustering and analysis
title_full	Toolkit development for high-dimensional data pre-processing, clustering and analysis
title_fullStr	Toolkit development for high-dimensional data pre-processing, clustering and analysis
title_full_unstemmed	Toolkit development for high-dimensional data pre-processing, clustering and analysis
title_short	Toolkit development for high-dimensional data pre-processing, clustering and analysis
title_sort	toolkit development for high dimensional data pre processing clustering and analysis
topic	DRNTU::Engineering::Electrical and electronic engineering
url	http://hdl.handle.net/10356/55243
work_keys_str_mv	AT huyao toolkitdevelopmentforhighdimensionaldatapreprocessingclusteringandanalysis

Toolkit development for high-dimensional data pre-processing, clustering and analysis

Documenti analoghi