OsamorSoft: clustering index for comparison and quality validation in high throughput dataset

Abstract The existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid...

Full description

Bibliographic Details
Main Authors: Ifeoma Patricia Osamor, Victor Chukwudi Osamor
Format: Article
Language:English
Published: SpringerOpen 2020-07-01
Series:Journal of Big Data
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40537-020-00325-6
_version_ 1818314350450966528
author Ifeoma Patricia Osamor
Victor Chukwudi Osamor
author_facet Ifeoma Patricia Osamor
Victor Chukwudi Osamor
author_sort Ifeoma Patricia Osamor
collection DOAJ
description Abstract The existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid either randomly, sequentially or some other principles influences which tend to influence the final result outcome. Popular external cluster quality validation and comparison models require the computation of varying clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARIMA) and Hubert and Arabie Adjusted Rand Index (ARIHA). In literature, Hubert and Arabie Adjusted Rand Index (ARIHA) has been adjudged as a good measure of cluster validity. Based on ARIHA as a popular clustering quality index, we developed OsamorSoft which constitutes DNA_Omatrix and OsamorSpreadSheet as a tool for cluster quality validation in high throughput analysis. The proposed method will help to bridge the yawning gap created by lesser number of friendly tools available to externally evaluate the ever-increasing number of clustering algorithms. Our implementation was tested alongside with clusters created with four k-means algorithms using malaria microarray data. Furthermore, our results evolved a compact 4-stage OsamorSpreadSheet statistics that our easy-to-use GUI java and spreadsheet-based tool of OsamorSoft uses for cluster quality comparison. It is recommended that a framework be evolved to facilitate the simplified integration and automation of several other cluster validity indexes for comparative analysis of big data problems.
first_indexed 2024-12-13T08:48:15Z
format Article
id doaj.art-2865039358994fc3b2dc3eef4e3ec539
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-13T08:48:15Z
publishDate 2020-07-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-2865039358994fc3b2dc3eef4e3ec5392022-12-21T23:53:25ZengSpringerOpenJournal of Big Data2196-11152020-07-017111310.1186/s40537-020-00325-6OsamorSoft: clustering index for comparison and quality validation in high throughput datasetIfeoma Patricia Osamor0Victor Chukwudi Osamor1Department of Accounting, Faculty of Management Sciences, Lagos State UniversityDepartment of Computer and Information Sciences, College of Science and Technology, Covenant UniversityAbstract The existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid either randomly, sequentially or some other principles influences which tend to influence the final result outcome. Popular external cluster quality validation and comparison models require the computation of varying clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARIMA) and Hubert and Arabie Adjusted Rand Index (ARIHA). In literature, Hubert and Arabie Adjusted Rand Index (ARIHA) has been adjudged as a good measure of cluster validity. Based on ARIHA as a popular clustering quality index, we developed OsamorSoft which constitutes DNA_Omatrix and OsamorSpreadSheet as a tool for cluster quality validation in high throughput analysis. The proposed method will help to bridge the yawning gap created by lesser number of friendly tools available to externally evaluate the ever-increasing number of clustering algorithms. Our implementation was tested alongside with clusters created with four k-means algorithms using malaria microarray data. Furthermore, our results evolved a compact 4-stage OsamorSpreadSheet statistics that our easy-to-use GUI java and spreadsheet-based tool of OsamorSoft uses for cluster quality comparison. It is recommended that a framework be evolved to facilitate the simplified integration and automation of several other cluster validity indexes for comparative analysis of big data problems.http://link.springer.com/article/10.1186/s40537-020-00325-6Clustering indexAlgorithmsOsamorSoftValidationRandAutomation
spellingShingle Ifeoma Patricia Osamor
Victor Chukwudi Osamor
OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
Journal of Big Data
Clustering index
Algorithms
OsamorSoft
Validation
Rand
Automation
title OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
title_full OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
title_fullStr OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
title_full_unstemmed OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
title_short OsamorSoft: clustering index for comparison and quality validation in high throughput dataset
title_sort osamorsoft clustering index for comparison and quality validation in high throughput dataset
topic Clustering index
Algorithms
OsamorSoft
Validation
Rand
Automation
url http://link.springer.com/article/10.1186/s40537-020-00325-6
work_keys_str_mv AT ifeomapatriciaosamor osamorsoftclusteringindexforcomparisonandqualityvalidationinhighthroughputdataset
AT victorchukwudiosamor osamorsoftclusteringindexforcomparisonandqualityvalidationinhighthroughputdataset