Dimension reduction of high-dimensional dataset with missing values

Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component an...

Full description

Bibliographic Details
Main Authors:	Ran Zhang, Bin Ye, Peng Liu
Format:	Article
Language:	English
Published:	SAGE Publishing 2019-08-01
Series:	Journal of Algorithms & Computational Technology
Online Access:	https://doi.org/10.1177/1748302619867440

_version_	1818980780715540480
author	Ran Zhang Bin Ye Peng Liu
author_facet	Ran Zhang Bin Ye Peng Liu
author_sort	Ran Zhang
collection	DOAJ
description	Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.
first_indexed	2024-12-20T17:20:52Z
format	Article
id	doaj.art-a47a61c0686343e095a0f076621fb5c0
institution	Directory Open Access Journal
issn	1748-3026
language	English
last_indexed	2024-12-20T17:20:52Z
publishDate	2019-08-01
publisher	SAGE Publishing
record_format	Article
series	Journal of Algorithms & Computational Technology
spelling	doaj.art-a47a61c0686343e095a0f076621fb5c02022-12-21T19:31:50ZengSAGE PublishingJournal of Algorithms & Computational Technology1748-30262019-08-011310.1177/1748302619867440Dimension reduction of high-dimensional dataset with missing valuesRan ZhangBin YePeng LiuNowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.https://doi.org/10.1177/1748302619867440
spellingShingle	Ran Zhang Bin Ye Peng Liu Dimension reduction of high-dimensional dataset with missing values Journal of Algorithms & Computational Technology
title	Dimension reduction of high-dimensional dataset with missing values
title_full	Dimension reduction of high-dimensional dataset with missing values
title_fullStr	Dimension reduction of high-dimensional dataset with missing values
title_full_unstemmed	Dimension reduction of high-dimensional dataset with missing values
title_short	Dimension reduction of high-dimensional dataset with missing values
title_sort	dimension reduction of high dimensional dataset with missing values
url	https://doi.org/10.1177/1748302619867440
work_keys_str_mv	AT ranzhang dimensionreductionofhighdimensionaldatasetwithmissingvalues AT binye dimensionreductionofhighdimensionaldatasetwithmissingvalues AT pengliu dimensionreductionofhighdimensionaldatasetwithmissingvalues

Dimension reduction of high-dimensional dataset with missing values

Similar Items