Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scie...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-02-01
|
Series: | Genes |
Subjects: | |
Online Access: | https://www.mdpi.com/2073-4425/14/2/403 |
_version_ | 1797620802437251072 |
---|---|
author | Niloufar Dousti Mousavi Jie Yang Hani Aldirawi |
author_facet | Niloufar Dousti Mousavi Jie Yang Hani Aldirawi |
author_sort | Niloufar Dousti Mousavi |
collection | DOAJ |
description | Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of <i>Lactobacillus</i> species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>83</mn><mo>%</mo></mrow></semantics></math></inline-formula> of the model variability. |
first_indexed | 2024-03-11T08:46:47Z |
format | Article |
id | doaj.art-4ee4e1ef30f047c485c41bbabf6103d1 |
institution | Directory Open Access Journal |
issn | 2073-4425 |
language | English |
last_indexed | 2024-03-11T08:46:47Z |
publishDate | 2023-02-01 |
publisher | MDPI AG |
record_format | Article |
series | Genes |
spelling | doaj.art-4ee4e1ef30f047c485c41bbabf6103d12023-11-16T20:42:30ZengMDPI AGGenes2073-44252023-02-0114240310.3390/genes14020403Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression DataNiloufar Dousti Mousavi0Jie Yang1Hani Aldirawi2Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USADepartment of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USADepartment of Mathematics, California State University—San Bernardino, San Bernardino, CA 92407, USASparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of <i>Lactobacillus</i> species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>83</mn><mo>%</mo></mrow></semantics></math></inline-formula> of the model variability.https://www.mdpi.com/2073-4425/14/2/403zero-inflated modelhurdle modellongitudinal datamodel selectionvaginal microbiomegene expression |
spellingShingle | Niloufar Dousti Mousavi Jie Yang Hani Aldirawi Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data Genes zero-inflated model hurdle model longitudinal data model selection vaginal microbiome gene expression |
title | Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data |
title_full | Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data |
title_fullStr | Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data |
title_full_unstemmed | Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data |
title_short | Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data |
title_sort | variable selection for sparse data with applications to vaginal microbiome and gene expression data |
topic | zero-inflated model hurdle model longitudinal data model selection vaginal microbiome gene expression |
url | https://www.mdpi.com/2073-4425/14/2/403 |
work_keys_str_mv | AT niloufardoustimousavi variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata AT jieyang variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata AT hanialdirawi variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata |