Hash encoding on nucleotide acids for classification

Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely...

Full description

Bibliographic Details
Main Author: Ni, Wei
Other Authors: Kwoh Chee Keong
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/147987
_version_ 1826121361397907456
author Ni, Wei
author2 Kwoh Chee Keong
author_facet Kwoh Chee Keong
Ni, Wei
author_sort Ni, Wei
collection NTU
description Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study.
first_indexed 2024-10-01T05:31:24Z
format Final Year Project (FYP)
id ntu-10356/147987
institution Nanyang Technological University
language English
last_indexed 2024-10-01T05:31:24Z
publishDate 2021
publisher Nanyang Technological University
record_format dspace
spelling ntu-10356/1479872021-04-16T05:01:45Z Hash encoding on nucleotide acids for classification Ni, Wei Kwoh Chee Keong School of Computer Science and Engineering ASCKKWOH@ntu.edu.sg Engineering::Computer science and engineering Extraction of meaningful information from the DNA is a key element in bioinformatics research and DNA sequence classification has a wide range of presentations. In recent years, Machine Learning and Deep Learning techniques are popular, especially Convolutional Neural Networks (CNN) have been widely used because of the high accuracy. To employ CNN or other Machine Learning/Deep Learning techniques for DNA/RNA classification or other discovery tasks, the input sequences are required to be numeric. Therefore, encoding is compulsory to covert the sequences into a vector or multi-dimensional matrix. The objective of this project was to find a more suitable way to use in encoding DNA/RNA sequences for classification. In this project, different encoding methods – hash encoding, one-hot encoding, and ordinal encoding were used on the two datasets, and the encoded data were used to the different Deep Learning models, including FNN, CNN, and Machine Learning models to do classification. The performance of each encoding method was evaluated in this study. This study suggests that hash encoding is an efficient way of encoding for both binary classification and multi-class classification problems. One-hot encoding and ordinal encoding are only suitable for the smaller dataset with a uniform length of data. For the same dataset, one-hot encoding performs better than the ordinal encoding in this study. Bachelor of Engineering (Computer Science) 2021-04-16T05:01:45Z 2021-04-16T05:01:45Z 2021 Final Year Project (FYP) Ni, W. (2021). Hash encoding on nucleotide acids for classification. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/147987 https://hdl.handle.net/10356/147987 en PSCSE19-0040 application/pdf Nanyang Technological University
spellingShingle Engineering::Computer science and engineering
Ni, Wei
Hash encoding on nucleotide acids for classification
title Hash encoding on nucleotide acids for classification
title_full Hash encoding on nucleotide acids for classification
title_fullStr Hash encoding on nucleotide acids for classification
title_full_unstemmed Hash encoding on nucleotide acids for classification
title_short Hash encoding on nucleotide acids for classification
title_sort hash encoding on nucleotide acids for classification
topic Engineering::Computer science and engineering
url https://hdl.handle.net/10356/147987
work_keys_str_mv AT niwei hashencodingonnucleotideacidsforclassification