Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.

Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world....

Full description

Bibliographic Details
Main Authors: Ziyi Mo, Adam Siepel
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-11-01
Series:PLoS Genetics
Online Access:https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1011032&type=printable
_version_ 1797394711147708416
author Ziyi Mo
Adam Siepel
author_facet Ziyi Mo
Adam Siepel
author_sort Ziyi Mo
collection DOAJ
description Investigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.
first_indexed 2024-03-09T00:23:42Z
format Article
id doaj.art-d1ab79f501334ce198a8c69df25996a5
institution Directory Open Access Journal
issn 1553-7390
1553-7404
language English
last_indexed 2024-03-09T00:23:42Z
publishDate 2023-11-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Genetics
spelling doaj.art-d1ab79f501334ce198a8c69df25996a52023-12-12T05:32:57ZengPublic Library of Science (PLoS)PLoS Genetics1553-73901553-74042023-11-011911e101103210.1371/journal.pgen.1011032Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.Ziyi MoAdam SiepelInvestigators have recently introduced powerful methods for population genetic inference that rely on supervised machine learning from simulated data. Despite their performance advantages, these methods can fail when the simulated training data does not adequately resemble data from the real world. Here, we show that this "simulation mis-specification" problem can be framed as a "domain adaptation" problem, where a model learned from one data distribution is applied to a dataset drawn from a different distribution. By applying an established domain-adaptation technique based on a gradient reversal layer (GRL), originally introduced for image classification, we show that the effects of simulation mis-specification can be substantially mitigated. We focus our analysis on two state-of-the-art deep-learning population genetic methods-SIA, which infers positive selection from features of the ancestral recombination graph (ARG), and ReLERNN, which infers recombination rates from genotype matrices. In the case of SIA, the domain adaptive framework also compensates for ARG inference error. Using the domain-adaptive SIA (dadaSIA) model, we estimate improved selection coefficients at selected loci in the 1000 Genomes CEU population. We anticipate that domain adaptation will prove to be widely applicable in the growing use of supervised machine learning in population genetics.https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1011032&type=printable
spellingShingle Ziyi Mo
Adam Siepel
Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
PLoS Genetics
title Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
title_full Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
title_fullStr Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
title_full_unstemmed Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
title_short Domain-adaptive neural networks improve supervised machine learning based on simulated population genetic data.
title_sort domain adaptive neural networks improve supervised machine learning based on simulated population genetic data
url https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1011032&type=printable
work_keys_str_mv AT ziyimo domainadaptiveneuralnetworksimprovesupervisedmachinelearningbasedonsimulatedpopulationgeneticdata
AT adamsiepel domainadaptiveneuralnetworksimprovesupervisedmachinelearningbasedonsimulatedpopulationgeneticdata