KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-03-01
|
Series: | Algorithms |
Subjects: | |
Online Access: | https://www.mdpi.com/1999-4893/15/4/107 |
_version_ | 1797437229276069888 |
---|---|
author | Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu |
author_facet | Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu |
author_sort | Deyou Tang |
collection | DOAJ |
description | Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact. |
first_indexed | 2024-03-09T11:16:50Z |
format | Article |
id | doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f |
institution | Directory Open Access Journal |
issn | 1999-4893 |
language | English |
last_indexed | 2024-03-09T11:16:50Z |
publishDate | 2022-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Algorithms |
spelling | doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f2023-12-01T00:28:39ZengMDPI AGAlgorithms1999-48932022-03-0115410710.3390/a15040107KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data AnalysisDeyou Tang0Daqiang Tan1Weihao Xiao2Jiabin Lin3Juan Fu4School of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Medicine, South China University of Technology, Guangzhou 510006, ChinaBackground: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.https://www.mdpi.com/1999-4893/15/4/107k-merKMCCHTKCnext generation sequencingalgorithm evaluationoptimization |
spellingShingle | Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis Algorithms k-mer KMC CHTKC next generation sequencing algorithm evaluation optimization |
title | KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis |
title_full | KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis |
title_fullStr | KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis |
title_full_unstemmed | KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis |
title_short | KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis |
title_sort | kmc3 and chtkc best scenarios deficiencies and challenges in high throughput sequencing data analysis |
topic | k-mer KMC CHTKC next generation sequencing algorithm evaluation optimization |
url | https://www.mdpi.com/1999-4893/15/4/107 |
work_keys_str_mv | AT deyoutang kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT daqiangtan kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT weihaoxiao kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT jiabinlin kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT juanfu kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis |