KNN-Based Algorithm of Hard Case Detection in Datasets for Classification
The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded b...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Anhalt University of Applied Sciences
2023-03-01
|
Series: | Proceedings of the International Conference on Applied Innovations in IT |
Subjects: | |
Online Access: | https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8 |
_version_ | 1797834487429595136 |
---|---|
author | Anton Okhrimenko Nataliia Kussul |
author_facet | Anton Okhrimenko Nataliia Kussul |
author_sort | Anton Okhrimenko |
collection | DOAJ |
description | The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded by a significant number of another class data points at the feature space. However, being able to find such hardcases in a dataset allows using another set of rules than for normal data samples. In this work, we introduce a KNN-based detection algorithm of data points and subspaces for which the classification decision is ambiguous. The algorithm described in details along with demonstration on artificially generated dataset. Also, the possible usecases are discussed, including dataset quality assessment, custom ensemble strategy and data sampling modifications. The proposed algorithm can be used during full cycle of machine learning model developing, from forming train dataset to real case model inference. |
first_indexed | 2024-04-09T14:39:49Z |
format | Article |
id | doaj.art-1a1946802ab84980943396673413a4a2 |
institution | Directory Open Access Journal |
issn | 2199-8876 |
language | English |
last_indexed | 2024-04-09T14:39:49Z |
publishDate | 2023-03-01 |
publisher | Anhalt University of Applied Sciences |
record_format | Article |
series | Proceedings of the International Conference on Applied Innovations in IT |
spelling | doaj.art-1a1946802ab84980943396673413a4a22023-05-03T09:05:03ZengAnhalt University of Applied SciencesProceedings of the International Conference on Applied Innovations in IT2199-88762023-03-0111111311810.25673/101926KNN-Based Algorithm of Hard Case Detection in Datasets for ClassificationAnton Okhrimenko0https://orcid.org/0009-0004-8520-0278Nataliia Kussul1https://orcid.org/0000-0002-9704-9702Institute of Physics and Technology, Igor Sikorsky Kyiv Polytechnic Institute, Peremohy Avenue 37, Kyiv, UkraineInstitute of Physics and Technology, Igor Sikorsky Kyiv Polytechnic Institute, Peremohy Avenue 37, Kyiv, Ukraine / Department of Space Information Technologies and System, Space Research Institute National Academy of Science of Ukraine an State Space Agency of Ukraine, Glushkov Avenue 40, Kyiv, Ukraine / Anhalt University of Applied Sciences, Bernburger Str. 57, Kothen, Germany The machine learning models for classification are designed to find the best way to separate two or more classes. In case of class overlapping, there is no possible way to clearly separate such data. Any ML algorithm will fail to correctly classify a certain set of datapoints, which are surrounded by a significant number of another class data points at the feature space. However, being able to find such hardcases in a dataset allows using another set of rules than for normal data samples. In this work, we introduce a KNN-based detection algorithm of data points and subspaces for which the classification decision is ambiguous. The algorithm described in details along with demonstration on artificially generated dataset. Also, the possible usecases are discussed, including dataset quality assessment, custom ensemble strategy and data sampling modifications. The proposed algorithm can be used during full cycle of machine learning model developing, from forming train dataset to real case model inference.https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8knndataset quality assessmentimbalanced datasetshard cases |
spellingShingle | Anton Okhrimenko Nataliia Kussul KNN-Based Algorithm of Hard Case Detection in Datasets for Classification Proceedings of the International Conference on Applied Innovations in IT knn dataset quality assessment imbalanced datasets hard cases |
title | KNN-Based Algorithm of Hard Case Detection in Datasets for Classification |
title_full | KNN-Based Algorithm of Hard Case Detection in Datasets for Classification |
title_fullStr | KNN-Based Algorithm of Hard Case Detection in Datasets for Classification |
title_full_unstemmed | KNN-Based Algorithm of Hard Case Detection in Datasets for Classification |
title_short | KNN-Based Algorithm of Hard Case Detection in Datasets for Classification |
title_sort | knn based algorithm of hard case detection in datasets for classification |
topic | knn dataset quality assessment imbalanced datasets hard cases |
url | https://icaiit.org/paper.php?paper=11th_ICAIIT_1/2_8 |
work_keys_str_mv | AT antonokhrimenko knnbasedalgorithmofhardcasedetectionindatasetsforclassification AT nataliiakussul knnbasedalgorithmofhardcasedetectionindatasetsforclassification |