Big Data Analytics for the ATLAS EventIndex Project with Apache Spark
The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has se...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi-Wiley
2023-01-01
|
Series: | Computational and Mathematical Methods |
Online Access: | http://dx.doi.org/10.1155/2023/6900908 |
_version_ | 1797666041662275584 |
---|---|
author | Álvaro Fernández Casaní Carlos García Montoro Santiago González de la Hoz José Salt Javier Sánchez Miguel Villaplana Pérez |
author_facet | Álvaro Fernández Casaní Carlos García Montoro Santiago González de la Hoz José Salt Javier Sánchez Miguel Villaplana Pérez |
author_sort | Álvaro Fernández Casaní |
collection | DOAJ |
description | The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage. |
first_indexed | 2024-03-11T19:53:32Z |
format | Article |
id | doaj.art-9e2e9eacf95a481ea05bb0adfdd21529 |
institution | Directory Open Access Journal |
issn | 2577-7408 |
language | English |
last_indexed | 2024-03-11T19:53:32Z |
publishDate | 2023-01-01 |
publisher | Hindawi-Wiley |
record_format | Article |
series | Computational and Mathematical Methods |
spelling | doaj.art-9e2e9eacf95a481ea05bb0adfdd215292023-10-05T00:00:02ZengHindawi-WileyComputational and Mathematical Methods2577-74082023-01-01202310.1155/2023/6900908Big Data Analytics for the ATLAS EventIndex Project with Apache SparkÁlvaro Fernández Casaní0Carlos García Montoro1Santiago González de la Hoz2José Salt3Javier Sánchez4Miguel Villaplana Pérez5Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)Institute of Corpuscular Physics-IFIC (CSIC/UV)The ATLAS EventIndex was designed to provide a global event catalogue and limited event-level metadata for ATLAS experiment of the Large Hadron Collider (LHC) and their analysis groups and users during Run 2 (2015-2018) and has been running in production since. The LHC Run 3, started in 2022, has seen increased data-taking and simulation production rates, with which the current infrastructure would still cope but may be stretched to its limits by the end of Run 3. A new core storage service is being developed in HBase/Phoenix, and there is work in progress to provide at least the same functionality as the current one for increased data ingestion and search rates and with increasing volumes of stored data. In addition, new tools are being developed for solving the needed access cases within the new storage. This paper describes a new tool using Spark and implemented in Scala for accessing the big data quantities of the EventIndex project stored in HBase/Phoenix. With this tool, we can offer data discovery capabilities at different granularities, providing Spark Dataframes that can be used or refined within the same framework. Data analytic cases of the EventIndex project are implemented, like the search for duplicates of events from the same or different datasets. An algorithm and implementation for the calculation of overlap matrices of events across different datasets are presented. Our approach can be used by other higher-level tools and users, to ease access to the data in a performant and standard way using Spark abstractions. The provided tools decouple data access from the actual data schema, which makes it convenient to hide complexity and possible changes on the backed storage.http://dx.doi.org/10.1155/2023/6900908 |
spellingShingle | Álvaro Fernández Casaní Carlos García Montoro Santiago González de la Hoz José Salt Javier Sánchez Miguel Villaplana Pérez Big Data Analytics for the ATLAS EventIndex Project with Apache Spark Computational and Mathematical Methods |
title | Big Data Analytics for the ATLAS EventIndex Project with Apache Spark |
title_full | Big Data Analytics for the ATLAS EventIndex Project with Apache Spark |
title_fullStr | Big Data Analytics for the ATLAS EventIndex Project with Apache Spark |
title_full_unstemmed | Big Data Analytics for the ATLAS EventIndex Project with Apache Spark |
title_short | Big Data Analytics for the ATLAS EventIndex Project with Apache Spark |
title_sort | big data analytics for the atlas eventindex project with apache spark |
url | http://dx.doi.org/10.1155/2023/6900908 |
work_keys_str_mv | AT alvarofernandezcasani bigdataanalyticsfortheatlaseventindexprojectwithapachespark AT carlosgarciamontoro bigdataanalyticsfortheatlaseventindexprojectwithapachespark AT santiagogonzalezdelahoz bigdataanalyticsfortheatlaseventindexprojectwithapachespark AT josesalt bigdataanalyticsfortheatlaseventindexprojectwithapachespark AT javiersanchez bigdataanalyticsfortheatlaseventindexprojectwithapachespark AT miguelvillaplanaperez bigdataanalyticsfortheatlaseventindexprojectwithapachespark |