Effect of data harmonization of multicentric dataset in ASD/TD classification

Abstract Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimagin...

Full description

Bibliographic Details
Main Authors:	Giacomo Serra, Francesca Mainas, Bruno Golosio, Alessandra Retico, Piernicola Oliva
Format:	Article
Language:	English
Published:	SpringerOpen 2023-11-01
Series:	Brain Informatics
Subjects:	ABIDE Multi-site data Harmonization Machine learning Autism spectrum disorder
Online Access:	https://doi.org/10.1186/s40708-023-00210-x

_version_	1797450990044053504
author	Giacomo Serra Francesca Mainas Bruno Golosio Alessandra Retico Piernicola Oliva
author_facet	Giacomo Serra Francesca Mainas Bruno Golosio Alessandra Retico Piernicola Oliva
author_sort	Giacomo Serra
collection	DOAJ
description	Abstract Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set.
first_indexed	2024-03-09T14:48:29Z
format	Article
id	doaj.art-8ef41a4d533b4180804530fb3cd0a6df
institution	Directory Open Access Journal
issn	2198-4018 2198-4026
language	English
last_indexed	2024-03-09T14:48:29Z
publishDate	2023-11-01
publisher	SpringerOpen
record_format	Article
series	Brain Informatics
spelling	doaj.art-8ef41a4d533b4180804530fb3cd0a6df2023-11-26T14:37:34ZengSpringerOpenBrain Informatics2198-40182198-40262023-11-0110111110.1186/s40708-023-00210-xEffect of data harmonization of multicentric dataset in ASD/TD classificationGiacomo Serra0Francesca Mainas1Bruno Golosio2Alessandra Retico3Piernicola Oliva4Department of Physics, University of CagliariDepartment of Physics, University of CagliariDepartment of Physics, University of CagliariNational Institute for Nuclear Physics (INFN), Pisa DivisionNational Institute for Nuclear Physics (INFN), Cagliari DivisionAbstract Machine Learning (ML) is nowadays an essential tool in the analysis of Magnetic Resonance Imaging (MRI) data, in particular in the identification of brain correlates in neurological and neurodevelopmental disorders. ML requires datasets of appropriate size for training, which in neuroimaging are typically obtained collecting data from multiple acquisition centers. However, analyzing large multicentric datasets can introduce bias due to differences between acquisition centers. ComBat harmonization is commonly used to address batch effects, but it can lead to data leakage when the entire dataset is used to estimate model parameters. In this study, structural and functional MRI data from the Autism Brain Imaging Data Exchange (ABIDE) collection were used to classify subjects with Autism Spectrum Disorders (ASD) compared to Typical Developing controls (TD). We compared the classical approach (external harmonization) in which harmonization is performed before train/test split, with an harmonization calculated only on the train set (internal harmonization), and with the dataset with no harmonization. The results showed that harmonization using the whole dataset achieved higher discrimination performance, while non-harmonized data and harmonization using only the train set showed similar results, for both structural and connectivity features. We also showed that the higher performances of the external harmonization are not due to larger size of the sample for the estimation of the model and hence these improved performance with the entire dataset may be ascribed to data leakage. In order to prevent this leakage, it is recommended to define the harmonization model solely using the train set.https://doi.org/10.1186/s40708-023-00210-xABIDEMulti-site dataHarmonizationMachine learningAutism spectrum disorder
spellingShingle	Giacomo Serra Francesca Mainas Bruno Golosio Alessandra Retico Piernicola Oliva Effect of data harmonization of multicentric dataset in ASD/TD classification Brain Informatics ABIDE Multi-site data Harmonization Machine learning Autism spectrum disorder
title	Effect of data harmonization of multicentric dataset in ASD/TD classification
title_full	Effect of data harmonization of multicentric dataset in ASD/TD classification
title_fullStr	Effect of data harmonization of multicentric dataset in ASD/TD classification
title_full_unstemmed	Effect of data harmonization of multicentric dataset in ASD/TD classification
title_short	Effect of data harmonization of multicentric dataset in ASD/TD classification
title_sort	effect of data harmonization of multicentric dataset in asd td classification
topic	ABIDE Multi-site data Harmonization Machine learning Autism spectrum disorder
url	https://doi.org/10.1186/s40708-023-00210-x
work_keys_str_mv	AT giacomoserra effectofdataharmonizationofmulticentricdatasetinasdtdclassification AT francescamainas effectofdataharmonizationofmulticentricdatasetinasdtdclassification AT brunogolosio effectofdataharmonizationofmulticentricdatasetinasdtdclassification AT alessandraretico effectofdataharmonizationofmulticentricdatasetinasdtdclassification AT piernicolaoliva effectofdataharmonizationofmulticentricdatasetinasdtdclassification

Effect of data harmonization of multicentric dataset in ASD/TD classification

Similar Items