Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity

Abstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized dupli...

Full description

Bibliographic Details
Main Authors: Benjamin Jaegle, Rahul Pisupati, Luz Mayela Soto-Jiménez, Robin Burns, Fernando A. Rabanal, Magnus Nordborg
Format: Article
Language:English
Published: BMC 2023-03-01
Series:Genome Biology
Subjects:
Online Access:https://doi.org/10.1186/s13059-023-02875-3
_version_ 1827984076595789824
author Benjamin Jaegle
Rahul Pisupati
Luz Mayela Soto-Jiménez
Robin Burns
Fernando A. Rabanal
Magnus Nordborg
author_facet Benjamin Jaegle
Rahul Pisupati
Luz Mayela Soto-Jiménez
Robin Burns
Fernando A. Rabanal
Magnus Nordborg
author_sort Benjamin Jaegle
collection DOAJ
description Abstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million (44%) heterozygous SNPs. Given that Arabidopsis thaliana (A. thaliana) is highly selfing, and that extensively heterozygous individuals have been removed, we hypothesize that these SNPs reflected cryptic copy number variation. Results The heterozygosity we observe consists of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we use genome-wide association to map the position of the duplicates. We identify 2500 putatively duplicated genes and validate them using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that transpose together. We also demonstrate that cryptic structural variation produces highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNP calls in A. thaliana are artifacts and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggests that future analyses based on independently assembled genomes will be very informative.
first_indexed 2024-04-09T22:54:12Z
format Article
id doaj.art-b0efb04d114a4efd8ce0c13e14749b02
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-04-09T22:54:12Z
publishDate 2023-03-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-b0efb04d114a4efd8ce0c13e14749b022023-03-22T11:22:17ZengBMCGenome Biology1474-760X2023-03-0124111910.1186/s13059-023-02875-3Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosityBenjamin Jaegle0Rahul Pisupati1Luz Mayela Soto-Jiménez2Robin Burns3Fernando A. Rabanal4Magnus Nordborg5Gregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesMax Planck Institute for Developmental BiologyGregor Mendel Institute, Austrian Academy of SciencesAbstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million (44%) heterozygous SNPs. Given that Arabidopsis thaliana (A. thaliana) is highly selfing, and that extensively heterozygous individuals have been removed, we hypothesize that these SNPs reflected cryptic copy number variation. Results The heterozygosity we observe consists of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we use genome-wide association to map the position of the duplicates. We identify 2500 putatively duplicated genes and validate them using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that transpose together. We also demonstrate that cryptic structural variation produces highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNP calls in A. thaliana are artifacts and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggests that future analyses based on independently assembled genomes will be very informative.https://doi.org/10.1186/s13059-023-02875-3Structural variationGene duplicationGWASSNP callingMethylation
spellingShingle Benjamin Jaegle
Rahul Pisupati
Luz Mayela Soto-Jiménez
Robin Burns
Fernando A. Rabanal
Magnus Nordborg
Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
Genome Biology
Structural variation
Gene duplication
GWAS
SNP calling
Methylation
title Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
title_full Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
title_fullStr Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
title_full_unstemmed Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
title_short Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
title_sort extensive sequence duplication in arabidopsis revealed by pseudo heterozygosity
topic Structural variation
Gene duplication
GWAS
SNP calling
Methylation
url https://doi.org/10.1186/s13059-023-02875-3
work_keys_str_mv AT benjaminjaegle extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity
AT rahulpisupati extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity
AT luzmayelasotojimenez extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity
AT robinburns extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity
AT fernandoarabanal extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity
AT magnusnordborg extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity