A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices
Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise par...
Main Authors: | , , , , , , |
---|---|
Format: | Journal article |
Language: | English |
Published: |
Public Library of Science
2020
|
_version_ | 1797060070875332608 |
---|---|
author | Watson, JA Taylor, AR Ashley, EA Dondorp, A Buckee, CO White, NJ Holmes, CC |
author_facet | Watson, JA Taylor, AR Ashley, EA Dondorp, A Buckee, CO White, NJ Holmes, CC |
author_sort | Watson, JA |
collection | OXFORD |
description | Genetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. |
first_indexed | 2024-03-06T20:12:27Z |
format | Journal article |
id | oxford-uuid:2b06adf9-0ea6-4b8e-a6dd-ca0369500ba7 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-06T20:12:27Z |
publishDate | 2020 |
publisher | Public Library of Science |
record_format | dspace |
spelling | oxford-uuid:2b06adf9-0ea6-4b8e-a6dd-ca0369500ba72022-03-26T12:28:33ZA cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matricesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:2b06adf9-0ea6-4b8e-a6dd-ca0369500ba7EnglishSymplectic ElementsPublic Library of Science2020Watson, JATaylor, ARAshley, EADondorp, ABuckee, COWhite, NJHolmes, CCGenetic surveillance of malaria parasites supports malaria control programmes, treatment guidelines and elimination strategies. Surveillance studies often pose questions about malaria parasite ancestry (e.g. how antimalarial resistance has spread) and employ statistical methods that characterise parasite population structure. Many of the methods used to characterise structure are unsupervised machine learning algorithms which depend on a genetic distance matrix, notably principal coordinates analysis (PCoA) and hierarchical agglomerative clustering (HAC). PCoA and HAC are sensitive to both the definition of genetic distance and algorithmic specification. Importantly, neither algorithm infers malaria parasite ancestry. As such, PCoA and HAC can inform (e.g. via exploratory data visualisation and hypothesis generation), but not answer comprehensively, key questions about malaria parasite ancestry. We illustrate the sensitivity of PCoA and HAC using 393 Plasmodium falciparum whole genome sequences collected from Cambodia and neighbouring regions (where antimalarial resistance has emerged and spread recently) and we provide tentative guidance for the use and interpretation of PCoA and HAC in malaria parasite genetic epidemiology. This guidance includes a call for fully transparent and reproducible analysis pipelines that feature (i) a clearly outlined scientific question; (ii) a clear justification of analytical methods used to answer the scientific question along with discussion of any inferential limitations; (iii) publicly available genetic distance matrices when downstream analyses depend on them; and (iv) sensitivity analyses. To bridge the inferential disconnect between the output of non-inferential unsupervised learning algorithms and the scientific questions of interest, tailor-made statistical models are needed to infer malaria parasite ancestry. In the absence of such models speculative reasoning should feature only as discussion but not as results. |
spellingShingle | Watson, JA Taylor, AR Ashley, EA Dondorp, A Buckee, CO White, NJ Holmes, CC A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title | A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title_full | A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title_fullStr | A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title_full_unstemmed | A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title_short | A cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
title_sort | cautionary note on the use of unsupervised machine learning algorithms to characterise malaria parasite population structure from genetic distance matrices |
work_keys_str_mv | AT watsonja acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT taylorar acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT ashleyea acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT dondorpa acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT buckeeco acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT whitenj acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT holmescc acautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT watsonja cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT taylorar cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT ashleyea cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT dondorpa cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT buckeeco cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT whitenj cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices AT holmescc cautionarynoteontheuseofunsupervisedmachinelearningalgorithmstocharacterisemalariaparasitepopulationstructurefromgeneticdistancematrices |