Big Data Analytics for the ATLAS EventIndex Project with Apache Spark

The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has se...

Full description

Bibliographic Details
Main Authors: Álvaro Fernández Casaní, Carlos García Montoro, Santiago González de la Hoz, José Salt, Javier Sánchez, Miguel Villaplana Pérez
Format: Article
Language:English
Published: Hindawi-Wiley 2023-01-01
Series:Computational and Mathematical Methods
Online Access:http://dx.doi.org/10.1155/2023/6900908
_version_ 1797666041662275584
author Álvaro Fernández Casaní
Carlos García Montoro
Santiago González de la Hoz
José Salt
Javier Sánchez
Miguel Villaplana Pérez
author_facet Álvaro Fernández Casaní
Carlos García Montoro
Santiago González de la Hoz
José Salt
Javier Sánchez
Miguel Villaplana Pérez
author_sort Álvaro Fernández Casaní
collection DOAJ
description The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.
first_indexed 2024-03-11T19:53:32Z
format Article
id doaj.art-9e2e9eacf95a481ea05bb0adfdd21529
institution Directory Open Access Journal
issn 2577-7408
language English
last_indexed 2024-03-11T19:53:32Z
publishDate 2023-01-01
publisher Hindawi-Wiley
record_format Article
series Computational and Mathematical Methods
spelling doaj.art-9e2e9eacf95a481ea05bb0adfdd215292023-10-05T00:00:02ZengHindawi-WileyComputational and Mathematical Methods2577-74082023-01-01202310.1155/2023/6900908Big Data Analytics for the ATLAS EventIndex Project with Apache SparkÁlvaro Fernández Casaní0Carlos García Montoro1Santiago González de la Hoz2José Salt3Javier Sánchez4Miguel Villaplana Pérez5Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.http://dx.doi.org/10.1155/2023/6900908
spellingShingle Álvaro Fernández Casaní
Carlos García Montoro
Santiago González de la Hoz
José Salt
Javier Sánchez
Miguel Villaplana Pérez
Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
Computational and Mathematical Methods
title Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
title_full Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
title_fullStr Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
title_full_unstemmed Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
title_short Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
title_sort big data analytics for the atlas eventindex project with apache spark
url http://dx.doi.org/10.1155/2023/6900908
work_keys_str_mv AT alvarofernandezcasani bigdataanalyticsfortheatlaseventindexprojectwithapachespark
AT carlosgarciamontoro bigdataanalyticsfortheatlaseventindexprojectwithapachespark
AT santiagogonzalezdelahoz bigdataanalyticsfortheatlaseventindexprojectwithapachespark
AT josesalt bigdataanalyticsfortheatlaseventindexprojectwithapachespark
AT javiersanchez bigdataanalyticsfortheatlaseventindexprojectwithapachespark
AT miguelvillaplanaperez bigdataanalyticsfortheatlaseventindexprojectwithapachespark