Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aime...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-10-01
|
Series: | Heliyon |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2405844023073693 |
_version_ | 1797646559737806848 |
---|---|
author | Lala Septem Riza Muhammad Iqbal Zain Ahmad Izzuddin Yudi Prasetyo Topik Hidayat Khyrina Airin Fariza Abu Samah |
author_facet | Lala Septem Riza Muhammad Iqbal Zain Ahmad Izzuddin Yudi Prasetyo Topik Hidayat Khyrina Airin Fariza Abu Samah |
author_sort | Lala Septem Riza |
collection | DOAJ |
description | The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research. |
first_indexed | 2024-03-11T15:03:21Z |
format | Article |
id | doaj.art-08ac1e9924c14eb3b4394edaeba72661 |
institution | Directory Open Access Journal |
issn | 2405-8440 |
language | English |
last_indexed | 2024-03-11T15:03:21Z |
publishDate | 2023-10-01 |
publisher | Elsevier |
record_format | Article |
series | Heliyon |
spelling | doaj.art-08ac1e9924c14eb3b4394edaeba726612023-10-30T06:05:26ZengElsevierHeliyon2405-84402023-10-01910e20161Implementation of machine learning in DNA barcoding for determining the plant family taxonomyLala Septem Riza0Muhammad Iqbal Zain1Ahmad Izzuddin2Yudi Prasetyo3Topik Hidayat4Khyrina Airin Fariza Abu Samah5Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, Indonesia; Corresponding author.Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Biology Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaFaculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Cawangan Melaka Kampus Jasin, Melaka, MalaysiaThe DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.http://www.sciencedirect.com/science/article/pii/S2405844023073693DNA barcodingUnsupervised learningBioinformaticsHierarchical clusteringMachine learningTaxonomy |
spellingShingle | Lala Septem Riza Muhammad Iqbal Zain Ahmad Izzuddin Yudi Prasetyo Topik Hidayat Khyrina Airin Fariza Abu Samah Implementation of machine learning in DNA barcoding for determining the plant family taxonomy Heliyon DNA barcoding Unsupervised learning Bioinformatics Hierarchical clustering Machine learning Taxonomy |
title | Implementation of machine learning in DNA barcoding for determining the plant family taxonomy |
title_full | Implementation of machine learning in DNA barcoding for determining the plant family taxonomy |
title_fullStr | Implementation of machine learning in DNA barcoding for determining the plant family taxonomy |
title_full_unstemmed | Implementation of machine learning in DNA barcoding for determining the plant family taxonomy |
title_short | Implementation of machine learning in DNA barcoding for determining the plant family taxonomy |
title_sort | implementation of machine learning in dna barcoding for determining the plant family taxonomy |
topic | DNA barcoding Unsupervised learning Bioinformatics Hierarchical clustering Machine learning Taxonomy |
url | http://www.sciencedirect.com/science/article/pii/S2405844023073693 |
work_keys_str_mv | AT lalaseptemriza implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy AT muhammadiqbalzain implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy AT ahmadizzuddin implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy AT yudiprasetyo implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy AT topikhidayat implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy AT khyrinaairinfarizaabusamah implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy |