Anomaly-aware summary statistic from data batches

Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly l...

Full description

Bibliographic Details
Main Author: Grosso, G.
Other Authors: Massachusetts Institute of Technology. Laboratory for Nuclear Science
Format: Article
Language:English
Published: Springer Berlin Heidelberg 2024
Online Access:https://hdl.handle.net/1721.1/157890
_version_ 1824458008606801920
author Grosso, G.
author2 Massachusetts Institute of Technology. Laboratory for Nuclear Science
author_facet Massachusetts Institute of Technology. Laboratory for Nuclear Science
Grosso, G.
author_sort Grosso, G.
collection MIT
description Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman-Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the likelihood-ratio test is thus shared over the batches, and complies with the assumption that the expected rate of new physical processes is time invariant. We show that this method outperforms the simple sum of the independent tests run over the batches, and can recover, or even surpass, the sensitivity of the single test run over the full data. Beside the significant advantage for the offline application of NPLM to large size samples, the proposed approach offers new prospects toward the use of NPLM to construct anomaly-aware summary statistics in quasi-online data streaming scenarios.
first_indexed 2025-02-19T04:19:04Z
format Article
id mit-1721.1/157890
institution Massachusetts Institute of Technology
language English
last_indexed 2025-02-19T04:19:04Z
publishDate 2024
publisher Springer Berlin Heidelberg
record_format dspace
spelling mit-1721.1/1578902025-01-04T06:17:33Z Anomaly-aware summary statistic from data batches Grosso, G. Massachusetts Institute of Technology. Laboratory for Nuclear Science Signal-agnostic data exploration based on machine learning could unveil very subtle statistical deviations of collider data from the expected Standard Model of particle physics. The beneficial impact of a large training sample on machine learning solutions motivates the exploration of increasingly large and inclusive samples of acquired data with resource efficient computational methods. In this work we consider the New Physics Learning Machine (NPLM), a multivariate goodness-of-fit test built on the Neyman-Pearson maximum-likelihood-ratio construction, and we address the problem of testing large size samples under computational and storage resource constraints. We propose to perform parallel NPLM routines over batches of the data, and to combine them by locally aggregating over the data-to-reference density ratios learnt by each batch. The resulting data hypothesis defining the likelihood-ratio test is thus shared over the batches, and complies with the assumption that the expected rate of new physical processes is time invariant. We show that this method outperforms the simple sum of the independent tests run over the batches, and can recover, or even surpass, the sensitivity of the single test run over the full data. Beside the significant advantage for the offline application of NPLM to large size samples, the proposed approach offers new prospects toward the use of NPLM to construct anomaly-aware summary statistics in quasi-online data streaming scenarios. 2024-12-18T20:52:11Z 2024-12-18T20:52:11Z 2024-12-12 2024-12-15T04:16:56Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/157890 Grosso, G. Anomaly-aware summary statistic from data batches. J. High Energ. Phys. 2024, 93 (2024). PUBLISHER_CC en https://doi.org/10.1007/JHEP12(2024)093 Journal of High Energy Physics Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The Author(s) application/pdf Springer Berlin Heidelberg Springer Berlin Heidelberg
spellingShingle Grosso, G.
Anomaly-aware summary statistic from data batches
title Anomaly-aware summary statistic from data batches
title_full Anomaly-aware summary statistic from data batches
title_fullStr Anomaly-aware summary statistic from data batches
title_full_unstemmed Anomaly-aware summary statistic from data batches
title_short Anomaly-aware summary statistic from data batches
title_sort anomaly aware summary statistic from data batches
url https://hdl.handle.net/1721.1/157890
work_keys_str_mv AT grossog anomalyawaresummarystatisticfromdatabatches