KNN-Based Algorithm of Hard Case Detection in Datasets for Classification

The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded b...

Full description

Bibliographic Details
Main Authors: Anton Okhrimenko, Nataliia Kussul
Format: Article
Language:English
Published: Anhalt University of Applied Sciences 2023-03-01
Series:Proceedings of the International Conference on Applied Innovations in IT
Subjects:
Online Access:https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8
_version_ 1797834487429595136
author Anton Okhrimenko
Nataliia Kussul
author_facet Anton Okhrimenko
Nataliia Kussul
author_sort Anton Okhrimenko
collection DOAJ
description The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded by a significant number of another class data points at the feature space. However, being able to find such hardcases in a dataset allows using another set of rules than for normal data samples. In this work, we introduce a KNN-based detection algorithm of data points and subspaces for which the classification decision is ambiguous. The algorithm described in details along with demonstration on artificially generated dataset. Also, the possible usecases are discussed, including dataset quality assessment, custom ensemble strategy and data sampling modifications. The proposed algorithm can be used during full cycle of machine learning model developing, from forming train dataset to real case model inference.
first_indexed 2024-04-09T14:39:49Z
format Article
id doaj.art-1a1946802ab84980943396673413a4a2
institution Directory Open Access Journal
issn 2199-8876
language English
last_indexed 2024-04-09T14:39:49Z
publishDate 2023-03-01
publisher Anhalt University of Applied Sciences
record_format Article
series Proceedings of the International Conference on Applied Innovations in IT
spelling doaj.art-1a1946802ab84980943396673413a4a22023-05-03T09:05:03ZengAnhalt University of Applied SciencesProceedings of the International Conference on Applied Innovations in IT2199-88762023-03-0111111311810.25673/101926KNN-Based Algorithm of Hard Case Detection in Datasets for ClassificationAnton Okhrimenko0https://orcid.org/0009-0004-8520-0278Nataliia Kussul1https://orcid.org/0000-0002-9704-9702Institute of Physics and Technology, Igor Sikorsky Kyiv Polytechnic Institute, Peremohy Avenue 37, Kyiv, UkraineInstitute of Physics and Technology, Igor Sikorsky Kyiv Polytechnic Institute, Peremohy Avenue 37, Kyiv, Ukraine / Department of Space Information Technologies and System, Space Research Institute National Academy of Science of Ukraine an State Space Agency of Ukraine, Glushkov Avenue 40, Kyiv, Ukraine / Anhalt University of Applied Sciences, Bernburger Str. 57, Kothen, Germany The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded by a significant number of another class data points at the feature space. However, being able to find such hardcases in a dataset allows using another set of rules than for normal data samples. In this work, we introduce a KNN-based detection algorithm of data points and subspaces for which the classification decision is ambiguous. The algorithm described in details along with demonstration on artificially generated dataset. Also, the possible usecases are discussed, including dataset quality assessment, custom ensemble strategy and data sampling modifications. The proposed algorithm can be used during full cycle of machine learning model developing, from forming train dataset to real case model inference.https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8knndataset quality assessmentimbalanced datasetshard cases
spellingShingle Anton Okhrimenko
Nataliia Kussul
KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
Proceedings of the International Conference on Applied Innovations in IT
knn
dataset quality assessment
imbalanced datasets
hard cases
title KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
title_full KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
title_fullStr KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
title_full_unstemmed KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
title_short KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
title_sort knn based algorithm of hard case detection in datasets for classification
topic knn
dataset quality assessment
imbalanced datasets
hard cases
url https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8
work_keys_str_mv AT antonokhrimenko knnbasedalgorithmofhardcasedetectionindatasetsforclassification
AT nataliiakussul knnbasedalgorithmofhardcasedetectionindatasetsforclassification