Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences

Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially f...

Full description

Bibliographic Details
Main Authors: Sarwan Ali, Murray Patterson
Format: Article
Language:English
Published: MDPI AG 2023-10-01
Series:J
Subjects:
Online Access:https://www.mdpi.com/2571-8800/6/4/38
_version_ 1827574586514866176
author Sarwan Ali
Murray Patterson
author_facet Sarwan Ali
Murray Patterson
author_sort Sarwan Ali
collection DOAJ
description Data visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on <i>k</i>-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.
first_indexed 2024-03-08T20:40:16Z
format Article
id doaj.art-ddbfe7f4b15b43f9bcd10369cfccb09f
institution Directory Open Access Journal
issn 2571-8800
language English
last_indexed 2024-03-08T20:40:16Z
publishDate 2023-10-01
publisher MDPI AG
record_format Article
series J
spelling doaj.art-ddbfe7f4b15b43f9bcd10369cfccb09f2023-12-22T14:16:37ZengMDPI AGJ2571-88002023-10-016457959110.3390/j6040038Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein SequencesSarwan Ali0Murray Patterson1Department of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAData visualization plays a crucial role in gaining insights from high-dimensional datasets. ISOMAP is a popular algorithm that maps high-dimensional data into a lower-dimensional space while preserving the underlying geometric structure. However, ISOMAP can be computationally expensive, especially for large datasets, due to the computation of the pairwise distances between data points. The motivation behind this study is to improve efficiency by leveraging an approximate method, which is based on random kitchen sinks (RKS). This approach provides a faster way to compute the kernel matrix. Using RKS significantly reduces the computational complexity of ISOMAP while still obtaining a meaningful low-dimensional representation of the data. We compare the performance of the approximate ISOMAP approach using RKS with the traditional t-SNE algorithm. The comparison involves computing the distance matrix using the original high-dimensional data and the low-dimensional data computed from both t-SNE and ISOMAP. The quality of the low-dimensional embeddings is measured using several metrics, including mean squared error (MSE), mean absolute error (MAE), and explained variance score (EVS). Additionally, the runtime of each algorithm is recorded to assess its computational efficiency. The comparison is conducted on a set of protein sequences, used in many bioinformatics tasks. We use three different embedding methods based on <i>k</i>-mers, minimizers, and position weight matrix (PWM) to capture various aspects of the underlying structure and the relationships between the protein sequences. By comparing different embeddings and by evaluating the effectiveness of the approximate ISOMAP approach using RKS and comparing it against t-SNE, we provide insights on the efficacy of our proposed approach. Our goal is to retain the quality of the low-dimensional embeddings while improving the computational performance.https://www.mdpi.com/2571-8800/6/4/38t-SNEISOMAPdata visualizationCOVID-19
spellingShingle Sarwan Ali
Murray Patterson
Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
J
t-SNE
ISOMAP
data visualization
COVID-19
title Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
title_full Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
title_fullStr Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
title_full_unstemmed Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
title_short Improving ISOMAP Efficiency with RKS: A Comparative Study with t-Distributed Stochastic Neighbor Embedding on Protein Sequences
title_sort improving isomap efficiency with rks a comparative study with t distributed stochastic neighbor embedding on protein sequences
topic t-SNE
ISOMAP
data visualization
COVID-19
url https://www.mdpi.com/2571-8800/6/4/38
work_keys_str_mv AT sarwanali improvingisomapefficiencywithrksacomparativestudywithtdistributedstochasticneighborembeddingonproteinsequences
AT murraypatterson improvingisomapefficiencywithrksacomparativestudywithtdistributedstochasticneighborembeddingonproteinsequences