GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species

Abstract Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parall...

Full description

Bibliographic Details
Main Authors: Liubin Zhang, Yangyang Yuan, Wenjie Peng, Bin Tang, Mulin Jun Li, Hongsheng Gui, Qiang Wang, Miaoxin Li
Format: Article
Language:English
Published: BMC 2023-04-01
Series:Genome Biology
Subjects:
Online Access:https://doi.org/10.1186/s13059-023-02906-z
_version_ 1797841037442416640
author Liubin Zhang
Yangyang Yuan
Wenjie Peng
Bin Tang
Mulin Jun Li
Hongsheng Gui
Qiang Wang
Miaoxin Li
author_facet Liubin Zhang
Yangyang Yuan
Wenjie Peng
Bin Tang
Mulin Jun Li
Hongsheng Gui
Qiang Wang
Miaoxin Li
author_sort Liubin Zhang
collection DOAJ
description Abstract Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC’s data structure and algorithms are valuable for accelerating large-scale genomic research.
first_indexed 2024-04-09T16:24:26Z
format Article
id doaj.art-d4f6f3be3aab4bcba4e0de1f78462441
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-04-09T16:24:26Z
publishDate 2023-04-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-d4f6f3be3aab4bcba4e0de1f784624412023-04-23T11:18:56ZengBMCGenome Biology1474-760X2023-04-0124112210.1186/s13059-023-02906-zGBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of speciesLiubin Zhang0Yangyang Yuan1Wenjie Peng2Bin Tang3Mulin Jun Li4Hongsheng Gui5Qiang Wang6Miaoxin Li7Program in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen UniversityProgram in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen UniversityProgram in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen UniversityProgram in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen UniversityThe Province and Ministry Co-Sponsored Collaborative Innovation Center for Medical Epigenetics, Tianjin Medical UniversityBehavioral Health Services, Henry Ford HealthMental Health Center, West China Hospital, Sichuan UniversityProgram in Bioinformatics, Zhongshan School of Medicine and The Fifth Affiliated Hospital, Sun Yat-Sen UniversityAbstract Whole -genome sequencing projects of millions of subjects contain enormous genotypes, entailing a huge memory burden and time for computation. Here, we present GBC, a toolkit for rapidly compressing large-scale genotypes into highly addressable byte-encoding blocks under an optimized parallel framework. We demonstrate that GBC is up to 1000 times faster than state-of-the-art methods to access and manage compressed large-scale genotypes while maintaining a competitive compression ratio. We also showed that conventional analysis would be substantially sped up if built on GBC to access genotypes of a large population. GBC’s data structure and algorithms are valuable for accelerating large-scale genomic research.https://doi.org/10.1186/s13059-023-02906-zLarge-scale genotypesGenotype compressionHighly addressable genotype blocksByte-encoding genotypesGenotype managementParallelization algorithm
spellingShingle Liubin Zhang
Yangyang Yuan
Wenjie Peng
Bin Tang
Mulin Jun Li
Hongsheng Gui
Qiang Wang
Miaoxin Li
GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
Genome Biology
Large-scale genotypes
Genotype compression
Highly addressable genotype blocks
Byte-encoding genotypes
Genotype management
Parallelization algorithm
title GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
title_full GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
title_fullStr GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
title_full_unstemmed GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
title_short GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
title_sort gbc a parallel toolkit based on highly addressable byte encoding blocks for extremely large scale genotypes of species
topic Large-scale genotypes
Genotype compression
Highly addressable genotype blocks
Byte-encoding genotypes
Genotype management
Parallelization algorithm
url https://doi.org/10.1186/s13059-023-02906-z
work_keys_str_mv AT liubinzhang gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT yangyangyuan gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT wenjiepeng gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT bintang gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT mulinjunli gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT hongshenggui gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT qiangwang gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies
AT miaoxinli gbcaparalleltoolkitbasedonhighlyaddressablebyteencodingblocksforextremelylargescalegenotypesofspecies