Dimension reduction of high-dimensional dataset with missing values

Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component an...

Full description

Bibliographic Details
Main Authors: Ran Zhang, Bin Ye, Peng Liu
Format: Article
Language:English
Published: SAGE Publishing 2019-08-01
Series:Journal of Algorithms & Computational Technology
Online Access:https://doi.org/10.1177/1748302619867440
_version_ 1818980780715540480
author Ran Zhang
Bin Ye
Peng Liu
author_facet Ran Zhang
Bin Ye
Peng Liu
author_sort Ran Zhang
collection DOAJ
description Nowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.
first_indexed 2024-12-20T17:20:52Z
format Article
id doaj.art-a47a61c0686343e095a0f076621fb5c0
institution Directory Open Access Journal
issn 1748-3026
language English
last_indexed 2024-12-20T17:20:52Z
publishDate 2019-08-01
publisher SAGE Publishing
record_format Article
series Journal of Algorithms & Computational Technology
spelling doaj.art-a47a61c0686343e095a0f076621fb5c02022-12-21T19:31:50ZengSAGE PublishingJournal of Algorithms & Computational Technology1748-30262019-08-011310.1177/1748302619867440Dimension reduction of high-dimensional dataset with missing valuesRan ZhangBin YePeng LiuNowadays, datasets containing a very large number of variables or features are routinely generated in many fields. Dimension reduction techniques are usually performed prior to statistically analyzing these datasets in order to avoid the effects of the curse of dimensionality. Principal component analysis is one of the most important techniques for dimension reduction and data visualization. However, datasets with missing values arising in almost every field will produce biased estimates and are difficult to handle, especially in the high dimension, low sample size settings. By exploiting a Lasso estimator of the population covariance matrix, we propose to regularize the principal component analysis to reduce the dimensionality of dataset with missing data. The Lasso estimator of covariance matrix is computationally tractable by solving a convex optimization problem. To illustrate the effectiveness of our method on dimension reduction, the principal component directions are evaluated by the metrics of Frobenius norm and cosine distance. The performances are compared with other incomplete data handling methods such as mean substitution and multiple imputation. Simulation results also show that our method is superior to other incomplete data handling methods in the context of discriminant analysis of real world high-dimensional datasets.https://doi.org/10.1177/1748302619867440
spellingShingle Ran Zhang
Bin Ye
Peng Liu
Dimension reduction of high-dimensional dataset with missing values
Journal of Algorithms & Computational Technology
title Dimension reduction of high-dimensional dataset with missing values
title_full Dimension reduction of high-dimensional dataset with missing values
title_fullStr Dimension reduction of high-dimensional dataset with missing values
title_full_unstemmed Dimension reduction of high-dimensional dataset with missing values
title_short Dimension reduction of high-dimensional dataset with missing values
title_sort dimension reduction of high dimensional dataset with missing values
url https://doi.org/10.1177/1748302619867440
work_keys_str_mv AT ranzhang dimensionreductionofhighdimensionaldatasetwithmissingvalues
AT binye dimensionreductionofhighdimensionaldatasetwithmissingvalues
AT pengliu dimensionreductionofhighdimensionaldatasetwithmissingvalues