Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity
Abstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized dupli...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2023-03-01
|
Series: | Genome Biology |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13059-023-02875-3 |
_version_ | 1827984076595789824 |
---|---|
author | Benjamin Jaegle Rahul Pisupati Luz Mayela Soto-Jiménez Robin Burns Fernando A. Rabanal Magnus Nordborg |
author_facet | Benjamin Jaegle Rahul Pisupati Luz Mayela Soto-Jiménez Robin Burns Fernando A. Rabanal Magnus Nordborg |
author_sort | Benjamin Jaegle |
collection | DOAJ |
description | Abstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million (44%) heterozygous SNPs. Given that Arabidopsis thaliana (A. thaliana) is highly selfing, and that extensively heterozygous individuals have been removed, we hypothesize that these SNPs reflected cryptic copy number variation. Results The heterozygosity we observe consists of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we use genome-wide association to map the position of the duplicates. We identify 2500 putatively duplicated genes and validate them using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that transpose together. We also demonstrate that cryptic structural variation produces highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNP calls in A. thaliana are artifacts and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggests that future analyses based on independently assembled genomes will be very informative. |
first_indexed | 2024-04-09T22:54:12Z |
format | Article |
id | doaj.art-b0efb04d114a4efd8ce0c13e14749b02 |
institution | Directory Open Access Journal |
issn | 1474-760X |
language | English |
last_indexed | 2024-04-09T22:54:12Z |
publishDate | 2023-03-01 |
publisher | BMC |
record_format | Article |
series | Genome Biology |
spelling | doaj.art-b0efb04d114a4efd8ce0c13e14749b022023-03-22T11:22:17ZengBMCGenome Biology1474-760X2023-03-0124111910.1186/s13059-023-02875-3Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosityBenjamin Jaegle0Rahul Pisupati1Luz Mayela Soto-Jiménez2Robin Burns3Fernando A. Rabanal4Magnus Nordborg5Gregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesGregor Mendel Institute, Austrian Academy of SciencesMax Planck Institute for Developmental BiologyGregor Mendel Institute, Austrian Academy of SciencesAbstract Background It is apparent that genomes harbor much structural variation that is largely undetected for technical reasons. Such variation can cause artifacts when short-read sequencing data are mapped to a reference genome. Spurious SNPs may result from mapping of reads to unrecognized duplicated regions. Calling SNP using the raw reads of the 1001 Arabidopsis Genomes Project we identified 3.3 million (44%) heterozygous SNPs. Given that Arabidopsis thaliana (A. thaliana) is highly selfing, and that extensively heterozygous individuals have been removed, we hypothesize that these SNPs reflected cryptic copy number variation. Results The heterozygosity we observe consists of particular SNPs being heterozygous across individuals in a manner that strongly suggests it reflects shared segregating duplications rather than random tracts of residual heterozygosity due to occasional outcrossing. Focusing on such pseudo-heterozygosity in annotated genes, we use genome-wide association to map the position of the duplicates. We identify 2500 putatively duplicated genes and validate them using de novo genome assemblies from six lines. Specific examples included an annotated gene and nearby transposon that transpose together. We also demonstrate that cryptic structural variation produces highly inaccurate estimates of DNA methylation polymorphism. Conclusions Our study confirms that most heterozygous SNP calls in A. thaliana are artifacts and suggest that great caution is needed when analyzing SNP data from short-read sequencing. The finding that 10% of annotated genes exhibit copy-number variation, and the realization that neither gene- nor transposon-annotation necessarily tells us what is actually mobile in the genome suggests that future analyses based on independently assembled genomes will be very informative.https://doi.org/10.1186/s13059-023-02875-3Structural variationGene duplicationGWASSNP callingMethylation |
spellingShingle | Benjamin Jaegle Rahul Pisupati Luz Mayela Soto-Jiménez Robin Burns Fernando A. Rabanal Magnus Nordborg Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity Genome Biology Structural variation Gene duplication GWAS SNP calling Methylation |
title | Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity |
title_full | Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity |
title_fullStr | Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity |
title_full_unstemmed | Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity |
title_short | Extensive sequence duplication in Arabidopsis revealed by pseudo-heterozygosity |
title_sort | extensive sequence duplication in arabidopsis revealed by pseudo heterozygosity |
topic | Structural variation Gene duplication GWAS SNP calling Methylation |
url | https://doi.org/10.1186/s13059-023-02875-3 |
work_keys_str_mv | AT benjaminjaegle extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity AT rahulpisupati extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity AT luzmayelasotojimenez extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity AT robinburns extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity AT fernandoarabanal extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity AT magnusnordborg extensivesequenceduplicationinarabidopsisrevealedbypseudoheterozygosity |