Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

Abstract Background The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution fo...

Full description

Bibliographic Details
Main Authors: Marta Pelizzola, Ragnhild Laursen, Asger Hobolth
Format: Article
Language:English
Published: BMC 2023-05-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-023-05304-1
_version_ 1797827325470965760
author Marta Pelizzola
Ragnhild Laursen
Asger Hobolth
author_facet Marta Pelizzola
Ragnhild Laursen
Asger Hobolth
author_sort Marta Pelizzola
collection DOAJ
description Abstract Background The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. Results We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. Conclusions With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .
first_indexed 2024-04-09T12:46:09Z
format Article
id doaj.art-5ac27b7730f44738a081069ae12fc69a
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-09T12:46:09Z
publishDate 2023-05-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-5ac27b7730f44738a081069ae12fc69a2023-05-14T11:29:46ZengBMCBMC Bioinformatics1471-21052023-05-0124112410.1186/s12859-023-05304-1Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorizationMarta Pelizzola0Ragnhild Laursen1Asger Hobolth2Department of Mathematics, Aarhus UniversityDepartment of Mathematics, Aarhus UniversityDepartment of Mathematics, Aarhus UniversityAbstract Background The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. Results We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients and derive the corresponding update rules for parameter estimation. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures. We also present a simulation study with a method comparison where we show that state-of-the-art methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. On the real data we describe a residual analysis to investigate and validate the model choice. Conclusions With our results on simulated and real data we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than the available methods in the literature for finding the true number of signatures. Lastly, the residual analysis clearly emphasizes the overdispersion in the mutational count data. The code for our model selection procedure and Negative Binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS .https://doi.org/10.1186/s12859-023-05304-1Cancer genomicsCross-validationModel checkingModel selectionMutational signaturesNegative Binomial
spellingShingle Marta Pelizzola
Ragnhild Laursen
Asger Hobolth
Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
BMC Bioinformatics
Cancer genomics
Cross-validation
Model checking
Model selection
Mutational signatures
Negative Binomial
title Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
title_full Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
title_fullStr Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
title_full_unstemmed Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
title_short Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization
title_sort model selection and robust inference of mutational signatures using negative binomial non negative matrix factorization
topic Cancer genomics
Cross-validation
Model checking
Model selection
Mutational signatures
Negative Binomial
url https://doi.org/10.1186/s12859-023-05304-1
work_keys_str_mv AT martapelizzola modelselectionandrobustinferenceofmutationalsignaturesusingnegativebinomialnonnegativematrixfactorization
AT ragnhildlaursen modelselectionandrobustinferenceofmutationalsignaturesusingnegativebinomialnonnegativematrixfactorization
AT asgerhobolth modelselectionandrobustinferenceofmutationalsignaturesusingnegativebinomialnonnegativematrixfactorization