KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their...

Full description

Bibliographic Details
Main Authors: Deyou Tang, Daqiang Tan, Weihao Xiao, Jiabin Lin, Juan Fu
Format: Article
Language:English
Published: MDPI AG 2022-03-01
Series:Algorithms
Subjects:
Online Access:https://www.mdpi.com/1999-4893/15/4/107
_version_ 1797437229276069888
author Deyou Tang
Daqiang Tan
Weihao Xiao
Jiabin Lin
Juan Fu
author_facet Deyou Tang
Daqiang Tan
Weihao Xiao
Jiabin Lin
Juan Fu
author_sort Deyou Tang
collection DOAJ
description Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.
first_indexed 2024-03-09T11:16:50Z
format Article
id doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f
institution Directory Open Access Journal
issn 1999-4893
language English
last_indexed 2024-03-09T11:16:50Z
publishDate 2022-03-01
publisher MDPI AG
record_format Article
series Algorithms
spelling doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f2023-12-01T00:28:39ZengMDPI AGAlgorithms1999-48932022-03-0115410710.3390/a15040107KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data AnalysisDeyou Tang0Daqiang Tan1Weihao Xiao2Jiabin Lin3Juan Fu4School of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Medicine, South China University of Technology, Guangzhou 510006, ChinaBackground: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.https://www.mdpi.com/1999-4893/15/4/107k-merKMCCHTKCnext generation sequencingalgorithm evaluationoptimization
spellingShingle Deyou Tang
Daqiang Tan
Weihao Xiao
Jiabin Lin
Juan Fu
KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
Algorithms
k-mer
KMC
CHTKC
next generation sequencing
algorithm evaluation
optimization
title KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_full KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_fullStr KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_full_unstemmed KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_short KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_sort kmc3 and chtkc best scenarios deficiencies and challenges in high throughput sequencing data analysis
topic k-mer
KMC
CHTKC
next generation sequencing
algorithm evaluation
optimization
url https://www.mdpi.com/1999-4893/15/4/107
work_keys_str_mv AT deyoutang kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis
AT daqiangtan kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis
AT weihaoxiao kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis
AT jiabinlin kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis
AT juanfu kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis