KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their...

Full description

Bibliographic Details
Main Authors:	Deyou Tang, Daqiang Tan, Weihao Xiao, Jiabin Lin, Juan Fu
Format:	Article
Language:	English
Published:	MDPI AG 2022-03-01
Series:	Algorithms
Subjects:	k-mer KMC CHTKC next generation sequencing algorithm evaluation optimization
Online Access:	https://www.mdpi.com/1999-4893/15/4/107

_version_	1797437229276069888
author	Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu
author_facet	Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu
author_sort	Deyou Tang
collection	DOAJ
description	Background: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.
first_indexed	2024-03-09T11:16:50Z
format	Article
id	doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f
institution	Directory Open Access Journal
issn	1999-4893
language	English
last_indexed	2024-03-09T11:16:50Z
publishDate	2022-03-01
publisher	MDPI AG
record_format	Article
series	Algorithms
spelling	doaj.art-5a599a4ab9d74e0da7a0fda7759fc74f2023-12-01T00:28:39ZengMDPI AGAlgorithms1999-48932022-03-0115410710.3390/a15040107KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data AnalysisDeyou Tang0Daqiang Tan1Weihao Xiao2Jiabin Lin3Juan Fu4School of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Software Engineering, South China University of Technology, Guangzhou 510006, ChinaSchool of Medicine, South China University of Technology, Guangzhou 510006, ChinaBackground: K-mer frequency counting is an upstream process of many bioinformatics data analysis workflows. KMC3 and CHTKC are the representative partition-based k-mer counting and non-partition-based k-mer counting algorithms, respectively. This paper evaluates the two algorithms and presents their best applicable scenarios and potential improvements using multiple hardware contexts and datasets. Results: KMC3 uses less memory and runs faster than CHTKC on a regular configuration server. CHTKC is efficient on high-performance computing platforms with high available memory, multi-thread, and low IO bandwidth. When tested with various datasets, KMC3 is less sensitive to the number of distinct k-mers and is more efficient for tasks with relatively low sequencing quality and long k-mer. CHTKC performs better than KMC3 in counting assignments with large-scale datasets, high sequencing quality, and short k-mer. Both algorithms are affected by IO bandwidth, and decreasing the influence of the IO bottleneck is critical as our tests show improvement by filtering and compressing consecutive first-occurring k-mers in KMC3. Conclusions: KMC3 is more competitive for running counter on ordinary hardware resources, and CHTKC is more competitive for counting k-mers in super-scale datasets on higher-performance computing platforms. Reducing the influence of the IO bottleneck is essential for optimizing the k-mer counting algorithm, and filtering and compressing low-frequency k-mers is critical in relieving IO impact.https://www.mdpi.com/1999-4893/15/4/107k-merKMCCHTKCnext generation sequencingalgorithm evaluationoptimization
spellingShingle	Deyou Tang Daqiang Tan Weihao Xiao Jiabin Lin Juan Fu KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis Algorithms k-mer KMC CHTKC next generation sequencing algorithm evaluation optimization
title	KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_full	KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_fullStr	KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_full_unstemmed	KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_short	KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis
title_sort	kmc3 and chtkc best scenarios deficiencies and challenges in high throughput sequencing data analysis
topic	k-mer KMC CHTKC next generation sequencing algorithm evaluation optimization
url	https://www.mdpi.com/1999-4893/15/4/107
work_keys_str_mv	AT deyoutang kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT daqiangtan kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT weihaoxiao kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT jiabinlin kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis AT juanfu kmc3andchtkcbestscenariosdeficienciesandchallengesinhighthroughputsequencingdataanalysis

KMC3 and CHTKC: Best Scenarios, Deficiencies, and Challenges in High-Throughput Sequencing Data Analysis

Similar Items