Implementation of machine learning in DNA barcoding for determining the plant family taxonomy

The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aime...

Full description

Bibliographic Details
Main Authors: Lala Septem Riza, Muhammad Iqbal Zain, Ahmad Izzuddin, Yudi Prasetyo, Topik Hidayat, Khyrina Airin Fariza Abu Samah
Format: Article
Language:English
Published: Elsevier 2023-10-01
Series:Heliyon
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405844023073693
_version_ 1797646559737806848
author Lala Septem Riza
Muhammad Iqbal Zain
Ahmad Izzuddin
Yudi Prasetyo
Topik Hidayat
Khyrina Airin Fariza Abu Samah
author_facet Lala Septem Riza
Muhammad Iqbal Zain
Ahmad Izzuddin
Yudi Prasetyo
Topik Hidayat
Khyrina Airin Fariza Abu Samah
author_sort Lala Septem Riza
collection DOAJ
description The DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.
first_indexed 2024-03-11T15:03:21Z
format Article
id doaj.art-08ac1e9924c14eb3b4394edaeba72661
institution Directory Open Access Journal
issn 2405-8440
language English
last_indexed 2024-03-11T15:03:21Z
publishDate 2023-10-01
publisher Elsevier
record_format Article
series Heliyon
spelling doaj.art-08ac1e9924c14eb3b4394edaeba726612023-10-30T06:05:26ZengElsevierHeliyon2405-84402023-10-01910e20161Implementation of machine learning in DNA barcoding for determining the plant family taxonomyLala Septem Riza0Muhammad Iqbal Zain1Ahmad Izzuddin2Yudi Prasetyo3Topik Hidayat4Khyrina Airin Fariza Abu Samah5Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, Indonesia; Corresponding author.Department of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Computer Science Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaDepartment of Biology Education, Universitas Pendidikan Indonesia, Bandung, IndonesiaFaculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Cawangan Melaka Kampus Jasin, Melaka, MalaysiaThe DNA barcoding approach has been used extensively in taxonomy and phylogenetics. The differences in certain DNA sequences are able to differentiate and help classify organisms into taxa. It has been used in cases of taxonomic disputes where morphology by itself is insufficient. This research aimed to utilize hierarchical clustering, an unsupervised machine learning method, to determine and resolve disputes in plant family taxonomy. We take a case study of Leguminosae that historically some classify into three families (Fabaceae, Caesalpiniaceae, and Mimosaceae) but others classify into one family (Leguminosae). This study is divided into several phases, which are: (i) data collection, (ii) data preprocessing, (iii) finding the best distance method, and (iv) determining disputed family. The data used are collected from several sources, including National Center for Biotechnology Information (NCBI), journals, and websites. The data for validation of the methods were collected from NCBI. This was used to determine the best distance method for differentiating families or genera. The data for the case study in the Leguminosae group was collected from journals and a website. From the experiment that we have conducted, we found that the Pearson method is the best distance method to do clustering ITS sequence of plants, both in accuracy and computational cost. We use the Pearson method to determine the disputed family between Leguminosae. We found that the case study of Leguminosae should be grouped into one family based on our research.http://www.sciencedirect.com/science/article/pii/S2405844023073693DNA barcodingUnsupervised learningBioinformaticsHierarchical clusteringMachine learningTaxonomy
spellingShingle Lala Septem Riza
Muhammad Iqbal Zain
Ahmad Izzuddin
Yudi Prasetyo
Topik Hidayat
Khyrina Airin Fariza Abu Samah
Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
Heliyon
DNA barcoding
Unsupervised learning
Bioinformatics
Hierarchical clustering
Machine learning
Taxonomy
title Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_full Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_fullStr Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_full_unstemmed Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_short Implementation of machine learning in DNA barcoding for determining the plant family taxonomy
title_sort implementation of machine learning in dna barcoding for determining the plant family taxonomy
topic DNA barcoding
Unsupervised learning
Bioinformatics
Hierarchical clustering
Machine learning
Taxonomy
url http://www.sciencedirect.com/science/article/pii/S2405844023073693
work_keys_str_mv AT lalaseptemriza implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy
AT muhammadiqbalzain implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy
AT ahmadizzuddin implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy
AT yudiprasetyo implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy
AT topikhidayat implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy
AT khyrinaairinfarizaabusamah implementationofmachinelearningindnabarcodingfordeterminingtheplantfamilytaxonomy