Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data

All artificial intelligence models today require preprocessed and cleaned data to work properly. This crucial step depends on the quality of the data analysis being done. The Space Weather community increased its use of AI in the past few years, but a thorough data analysis addressing all the potent...

Full description

Bibliographic Details
Main Authors: S. Bouriat, P. Vandame, M. Barthélémy, J. Chanussot
Format: Article
Language:English
Published: Frontiers Media S.A. 2022-11-01
Series:Frontiers in Astronomy and Space Sciences
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fspas.2022.980759/full
_version_ 1798015937615495168
author S. Bouriat
S. Bouriat
S. Bouriat
S. Bouriat
P. Vandame
M. Barthélémy
M. Barthélémy
J. Chanussot
author_facet S. Bouriat
S. Bouriat
S. Bouriat
S. Bouriat
P. Vandame
M. Barthélémy
M. Barthélémy
J. Chanussot
author_sort S. Bouriat
collection DOAJ
description All artificial intelligence models today require preprocessed and cleaned data to work properly. This crucial step depends on the quality of the data analysis being done. The Space Weather community increased its use of AI in the past few years, but a thorough data analysis addressing all the potential issues is not always performed beforehand. Here is an analysis of a largely used dataset: Level-2 Advanced Composition Explorer’s SWEPAM and MAG measurements from 1998 to 2021 by the ACE Science Center. This work contains guidelines and highlights issues in the ACE data that are likely to be found in other space weather datasets: missing values, inconsistency in distributions, hidden information in statistics, etc. Amongst all specificities of this data, the following can seriously impact the use of algorithms: Histograms are not uniform distributions at all, but sometime Gaussian or Laplacian. Algorithms will be inconsistent in the learning samples as some rare cases will be underrepresented. Gaussian distributions could be overly brought by Gaussian noise from measurements and the signal-to-noise ratio is difficult to estimate. Models will not be reproducible from year to year due to high changes in histograms over time. This high dependence on the solar cycle suggests that one should have at least 11 consecutive years of data to train the algorithm. Rounding of ion temperatures values to different orders of magnitude throughout the data, (probably due to a fixed number of bits on which measurements are coded) will bias the model by wrongly over-representing or under-representing some values. There is an extensive number of missing values (e.g., 41.59% for ion density) that cannot be implemented without pre-processing. Each possible pre-processing is different and subjective depending on one’s underlying objectives A linear model will not be able to accurately model the data. Our linear analysis (e.g., PCA), struggles to explain the data and their relationships. However, non-linear relationships between data seem to exist. Data seem cyclic: we witness the apparition of the solar cycle and the synodic rotation period of the Sun when looking at autocorrelations.Some suggestions are given to address the issues described to enable usage of the dataset despite these challenges.
first_indexed 2024-04-11T15:41:34Z
format Article
id doaj.art-8b9b5f4b97894216883f45d6d46d0148
institution Directory Open Access Journal
issn 2296-987X
language English
last_indexed 2024-04-11T15:41:34Z
publishDate 2022-11-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Astronomy and Space Sciences
spelling doaj.art-8b9b5f4b97894216883f45d6d46d01482022-12-22T04:15:46ZengFrontiers Media S.A.Frontiers in Astronomy and Space Sciences2296-987X2022-11-01910.3389/fspas.2022.980759980759Towards an AI-based understanding of the solar wind: A critical data analysis of ACE dataS. Bouriat0S. Bouriat1S. Bouriat2S. Bouriat3P. Vandame4M. Barthélémy5M. Barthélémy6J. Chanussot7CNRS, IPAG, University of Grenoble Alpes, Grenoble, FranceCSUG, University of Grenoble Alpes, Grenoble, FranceGIPSA-Lab, Grenoble INP, CNRS, University of Grenoble Alpes, Grenoble, FranceSpaceAble, Paris, FranceGIPSA-Lab, Grenoble INP, CNRS, University of Grenoble Alpes, Grenoble, FranceCNRS, IPAG, University of Grenoble Alpes, Grenoble, FranceCSUG, University of Grenoble Alpes, Grenoble, FranceGIPSA-Lab, Grenoble INP, CNRS, University of Grenoble Alpes, Grenoble, FranceAll artificial intelligence models today require preprocessed and cleaned data to work properly. This crucial step depends on the quality of the data analysis being done. The Space Weather community increased its use of AI in the past few years, but a thorough data analysis addressing all the potential issues is not always performed beforehand. Here is an analysis of a largely used dataset: Level-2 Advanced Composition Explorer’s SWEPAM and MAG measurements from 1998 to 2021 by the ACE Science Center. This work contains guidelines and highlights issues in the ACE data that are likely to be found in other space weather datasets: missing values, inconsistency in distributions, hidden information in statistics, etc. Amongst all specificities of this data, the following can seriously impact the use of algorithms: Histograms are not uniform distributions at all, but sometime Gaussian or Laplacian. Algorithms will be inconsistent in the learning samples as some rare cases will be underrepresented. Gaussian distributions could be overly brought by Gaussian noise from measurements and the signal-to-noise ratio is difficult to estimate. Models will not be reproducible from year to year due to high changes in histograms over time. This high dependence on the solar cycle suggests that one should have at least 11 consecutive years of data to train the algorithm. Rounding of ion temperatures values to different orders of magnitude throughout the data, (probably due to a fixed number of bits on which measurements are coded) will bias the model by wrongly over-representing or under-representing some values. There is an extensive number of missing values (e.g., 41.59% for ion density) that cannot be implemented without pre-processing. Each possible pre-processing is different and subjective depending on one’s underlying objectives A linear model will not be able to accurately model the data. Our linear analysis (e.g., PCA), struggles to explain the data and their relationships. However, non-linear relationships between data seem to exist. Data seem cyclic: we witness the apparition of the solar cycle and the synodic rotation period of the Sun when looking at autocorrelations.Some suggestions are given to address the issues described to enable usage of the dataset despite these challenges.https://www.frontiersin.org/articles/10.3389/fspas.2022.980759/fulldata analysissolar windMAGSWEPAMmachine learningACE
spellingShingle S. Bouriat
S. Bouriat
S. Bouriat
S. Bouriat
P. Vandame
M. Barthélémy
M. Barthélémy
J. Chanussot
Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
Frontiers in Astronomy and Space Sciences
data analysis
solar wind
MAG
SWEPAM
machine learning
ACE
title Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
title_full Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
title_fullStr Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
title_full_unstemmed Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
title_short Towards an AI-based understanding of the solar wind: A critical data analysis of ACE data
title_sort towards an ai based understanding of the solar wind a critical data analysis of ace data
topic data analysis
solar wind
MAG
SWEPAM
machine learning
ACE
url https://www.frontiersin.org/articles/10.3389/fspas.2022.980759/full
work_keys_str_mv AT sbouriat towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT sbouriat towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT sbouriat towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT sbouriat towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT pvandame towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT mbarthelemy towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT mbarthelemy towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata
AT jchanussot towardsanaibasedunderstandingofthesolarwindacriticaldataanalysisofacedata