Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data

Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scie...

Full description

Bibliographic Details
Main Authors: Niloufar Dousti Mousavi, Jie Yang, Hani Aldirawi
Format: Article
Language:English
Published: MDPI AG 2023-02-01
Series:Genes
Subjects:
Online Access:https://www.mdpi.com/2073-4425/14/2/403
_version_ 1797620802437251072
author Niloufar Dousti Mousavi
Jie Yang
Hani Aldirawi
author_facet Niloufar Dousti Mousavi
Jie Yang
Hani Aldirawi
author_sort Niloufar Dousti Mousavi
collection DOAJ
description Sparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of <i>Lactobacillus</i> species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>83</mn><mo>%</mo></mrow></semantics></math></inline-formula> of the model variability.
first_indexed 2024-03-11T08:46:47Z
format Article
id doaj.art-4ee4e1ef30f047c485c41bbabf6103d1
institution Directory Open Access Journal
issn 2073-4425
language English
last_indexed 2024-03-11T08:46:47Z
publishDate 2023-02-01
publisher MDPI AG
record_format Article
series Genes
spelling doaj.art-4ee4e1ef30f047c485c41bbabf6103d12023-11-16T20:42:30ZengMDPI AGGenes2073-44252023-02-0114240310.3390/genes14020403Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression DataNiloufar Dousti Mousavi0Jie Yang1Hani Aldirawi2Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USADepartment of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, Chicago, IL 60607, USADepartment of Mathematics, California State University—San Bernardino, San Bernardino, CA 92407, USASparse data with a high portion of zeros arise in various disciplines. Modeling sparse high-dimensional data is a challenging and growing research area. In this paper, we provide statistical methods and tools for analyzing sparse data in a fairly general and complex context. We utilize two real scientific applications as illustrations, including a longitudinal vaginal microbiome data and a high dimensional gene expression data. We recommend zero-inflated model selections and significance tests to identify the time intervals when the pregnant and non-pregnant groups of women are significantly different in terms of <i>Lactobacillus</i> species. We apply the same techniques to select the best 50 genes out of 2426 sparse gene expression data. The classification based on our selected genes achieves 100% prediction accuracy. Furthermore, the first four principal components based on the selected genes can explain as high as <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>83</mn><mo>%</mo></mrow></semantics></math></inline-formula> of the model variability.https://www.mdpi.com/2073-4425/14/2/403zero-inflated modelhurdle modellongitudinal datamodel selectionvaginal microbiomegene expression
spellingShingle Niloufar Dousti Mousavi
Jie Yang
Hani Aldirawi
Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
Genes
zero-inflated model
hurdle model
longitudinal data
model selection
vaginal microbiome
gene expression
title Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
title_full Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
title_fullStr Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
title_full_unstemmed Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
title_short Variable Selection for Sparse Data with Applications to Vaginal Microbiome and Gene Expression Data
title_sort variable selection for sparse data with applications to vaginal microbiome and gene expression data
topic zero-inflated model
hurdle model
longitudinal data
model selection
vaginal microbiome
gene expression
url https://www.mdpi.com/2073-4425/14/2/403
work_keys_str_mv AT niloufardoustimousavi variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata
AT jieyang variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata
AT hanialdirawi variableselectionforsparsedatawithapplicationstovaginalmicrobiomeandgeneexpressiondata