Geographical classification of malaria parasites through applying machine learning to whole genome sequence data

Abstract Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surv...

Full description

Bibliographic Details
Main Authors: Wouter Deelder, Emilia Manko, Jody E. Phelan, Susana Campino, Luigi Palla, Taane G. Clark
Format: Article
Language:English
Published: Nature Portfolio 2022-12-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-25568-6
_version_ 1811204211936854016
author Wouter Deelder
Emilia Manko
Jody E. Phelan
Susana Campino
Luigi Palla
Taane G. Clark
author_facet Wouter Deelder
Emilia Manko
Jody E. Phelan
Susana Campino
Luigi Palla
Taane G. Clark
author_sort Wouter Deelder
collection DOAJ
description Abstract Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.
first_indexed 2024-04-12T03:07:32Z
format Article
id doaj.art-ba09c2c1f0ed4880866bf3b0398756af
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-12T03:07:32Z
publishDate 2022-12-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-ba09c2c1f0ed4880866bf3b0398756af2022-12-22T03:50:27ZengNature PortfolioScientific Reports2045-23222022-12-0112111010.1038/s41598-022-25568-6Geographical classification of malaria parasites through applying machine learning to whole genome sequence dataWouter Deelder0Emilia Manko1Jody E. Phelan2Susana Campino3Luigi Palla4Taane G. Clark5London School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineLondon School of Hygiene & Tropical MedicineAbstract Malaria, caused by Plasmodium parasites, is a major global health challenge. Whole genome sequencing (WGS) of Plasmodium falciparum and Plasmodium vivax genomes is providing insights into parasite genetic diversity, transmission patterns, and can inform decision making for clinical and surveillance purposes. Advances in sequencing technologies are helping to generate timely and big genomic datasets, with the prospect of applying Artificial Intelligence analytical techniques (e.g., machine learning) to support programmatic malaria control and elimination. Here, we assess the potential of applying deep learning convolutional neural network approaches to predict the geographic origin of infections (continents, countries, GPS locations) using WGS data of P. falciparum (n = 5957; 27 countries) and P. vivax (n = 659; 13 countries) isolates. Using identified high-quality genome-wide single nucleotide polymorphisms (SNPs) (P. falciparum: 750 k, P. vivax: 588 k), an analysis of population structure and ancestry revealed clustering at the country-level. When predicting locations for both species, classification (compared to regression) methods had the lowest distance errors, and > 90% accuracy at a country level. Our work demonstrates the utility of machine learning approaches for geo-classification of malaria parasites. With timelier WGS data generation across more malaria-affected regions, the performance of machine learning approaches for geo-classification will improve, thereby supporting disease control activities.https://doi.org/10.1038/s41598-022-25568-6
spellingShingle Wouter Deelder
Emilia Manko
Jody E. Phelan
Susana Campino
Luigi Palla
Taane G. Clark
Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
Scientific Reports
title Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
title_full Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
title_fullStr Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
title_full_unstemmed Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
title_short Geographical classification of malaria parasites through applying machine learning to whole genome sequence data
title_sort geographical classification of malaria parasites through applying machine learning to whole genome sequence data
url https://doi.org/10.1038/s41598-022-25568-6
work_keys_str_mv AT wouterdeelder geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata
AT emiliamanko geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata
AT jodyephelan geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata
AT susanacampino geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata
AT luigipalla geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata
AT taanegclark geographicalclassificationofmalariaparasitesthroughapplyingmachinelearningtowholegenomesequencedata