High throughput nonparametric probability density estimation.

In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate dat...

Full description

Bibliographic Details
Main Authors:	Jenny Farmer, Donald Jacobs
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2018-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC5947915?pdf=render

_version_	1811219014316195840
author	Jenny Farmer Donald Jacobs
author_facet	Jenny Farmer Donald Jacobs
author_sort	Jenny Farmer
collection	DOAJ
description	In high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.
first_indexed	2024-04-12T07:18:37Z
format	Article
id	doaj.art-c4d7c76ce65443e1af6bb2e9adbfea67
institution	Directory Open Access Journal
issn	1932-6203
language	English
last_indexed	2024-04-12T07:18:37Z
publishDate	2018-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj.art-c4d7c76ce65443e1af6bb2e9adbfea672022-12-22T03:42:24ZengPublic Library of Science (PLoS)PLoS ONE1932-62032018-01-01135e019693710.1371/journal.pone.0196937High throughput nonparametric probability density estimation.Jenny FarmerDonald JacobsIn high throughput applications, such as those found in bioinformatics and finance, it is important to determine accurate probability distribution functions despite only minimal information about data characteristics, and without using human subjectivity. Such an automated process for univariate data is implemented to achieve this goal by merging the maximum entropy method with single order statistics and maximum likelihood. The only required properties of the random variables are that they are continuous and that they are, or can be approximated as, independent and identically distributed. A quasi-log-likelihood function based on single order statistics for sampled uniform random data is used to empirically construct a sample size invariant universal scoring function. Then a probability density estimate is determined by iteratively improving trial cumulative distribution functions, where better estimates are quantified by the scoring function that identifies atypical fluctuations. This criterion resists under and over fitting data as an alternative to employing the Bayesian or Akaike information criterion. Multiple estimates for the probability density reflect uncertainties due to statistical fluctuations in random samples. Scaled quantile residual plots are also introduced as an effective diagnostic to visualize the quality of the estimated probability densities. Benchmark tests show that estimates for the probability density function (PDF) converge to the true PDF as sample size increases on particularly difficult test probability densities that include cases with discontinuities, multi-resolution scales, heavy tails, and singularities. These results indicate the method has general applicability for high throughput statistical inference.http://europepmc.org/articles/PMC5947915?pdf=render
spellingShingle	Jenny Farmer Donald Jacobs High throughput nonparametric probability density estimation. PLoS ONE
title	High throughput nonparametric probability density estimation.
title_full	High throughput nonparametric probability density estimation.
title_fullStr	High throughput nonparametric probability density estimation.
title_full_unstemmed	High throughput nonparametric probability density estimation.
title_short	High throughput nonparametric probability density estimation.
title_sort	high throughput nonparametric probability density estimation
url	http://europepmc.org/articles/PMC5947915?pdf=render
work_keys_str_mv	AT jennyfarmer highthroughputnonparametricprobabilitydensityestimation AT donaldjacobs highthroughputnonparametricprobabilitydensityestimation

High throughput nonparametric probability density estimation.

Similar Items