Uncertainty Based Optimal Sample Selection for Big Data

In Machine learning and pattern recognition, building a better predictive model is one of the key problems in the presence of big or massive data; especially, if that data contains noisy and unrepresentative data samples. These types of samples adversely affect the learning model and may degrade its...

Full description

Bibliographic Details
Main Authors:	Saadia Ajmal, Rana Aamir Raza Ashfaq, Kashif Saleem
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Big data instance selection machine learning uncertainty
Online Access:	https://ieeexplore.ieee.org/document/10004968/

_version_	1828056010423533568
author	Saadia Ajmal Rana Aamir Raza Ashfaq Kashif Saleem
author_facet	Saadia Ajmal Rana Aamir Raza Ashfaq Kashif Saleem
author_sort	Saadia Ajmal
collection	DOAJ
description	In Machine learning and pattern recognition, building a better predictive model is one of the key problems in the presence of big or massive data; especially, if that data contains noisy and unrepresentative data samples. These types of samples adversely affect the learning model and may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample the data after eliminating unnecessary instances by maintaining the underlying distribution intact. This process is called sampling or instance selection (IS). However, in this process, a substantial computational cost is involved. This paper discusses an uncertainty based optimal sample selection (UBOSS) method which can select a subset of optimal samples efficiently. Our proposed work comprises three main steps; initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original data set; then, an uncertainty-based selector is designed to obtain fuzziness (i.e., a type of uncertainty) of those samples using a classifier whose output is a membership or fuzzy vector; this process further utilizes the divide-and-conquer strategy to obtain a subset of representative samples. Experiments are conducted on six datasets to evaluate the performance of the proposed IS method. Results show that our proposed methodology outperforms when compared with the selection performance (i.e., optimum samples) of the baseline methods (i.e., CNN, IB3, and DROP3).
first_indexed	2024-04-10T20:48:45Z
format	Article
id	doaj.art-0fb33a6020fb45239d2f41d571cfdc9b
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-10T20:48:45Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-0fb33a6020fb45239d2f41d571cfdc9b2023-01-24T00:00:59ZengIEEEIEEE Access2169-35362023-01-01116284629210.1109/ACCESS.2022.323359810004968Uncertainty Based Optimal Sample Selection for Big DataSaadia Ajmal0https://orcid.org/0000-0001-7073-2686Rana Aamir Raza Ashfaq1Kashif Saleem2https://orcid.org/0000-0001-8062-3301Department of Computer Science, Bahauddin Zakariya University, Multan, PakistanDepartment of Computer Science, Bahauddin Zakariya University, Multan, PakistanDepartment of Computer Sciences and Engineering, College of Applied Studies and Community Service, King Saud University, Riyadh, Saudi ArabiaIn Machine learning and pattern recognition, building a better predictive model is one of the key problems in the presence of big or massive data; especially, if that data contains noisy and unrepresentative data samples. These types of samples adversely affect the learning model and may degrade its performance. To alleviate this problem, sometimes, it becomes necessary to sample the data after eliminating unnecessary instances by maintaining the underlying distribution intact. This process is called sampling or instance selection (IS). However, in this process, a substantial computational cost is involved. This paper discusses an uncertainty based optimal sample selection (UBOSS) method which can select a subset of optimal samples efficiently. Our proposed work comprises three main steps; initially, it uses an IS method to identify the patterns of representative and unrepresentative samples from the original data set; then, an uncertainty-based selector is designed to obtain fuzziness (i.e., a type of uncertainty) of those samples using a classifier whose output is a membership or fuzzy vector; this process further utilizes the divide-and-conquer strategy to obtain a subset of representative samples. Experiments are conducted on six datasets to evaluate the performance of the proposed IS method. Results show that our proposed methodology outperforms when compared with the selection performance (i.e., optimum samples) of the baseline methods (i.e., CNN, IB3, and DROP3).https://ieeexplore.ieee.org/document/10004968/Big datainstance selectionmachine learninguncertainty
spellingShingle	Saadia Ajmal Rana Aamir Raza Ashfaq Kashif Saleem Uncertainty Based Optimal Sample Selection for Big Data IEEE Access Big data instance selection machine learning uncertainty
title	Uncertainty Based Optimal Sample Selection for Big Data
title_full	Uncertainty Based Optimal Sample Selection for Big Data
title_fullStr	Uncertainty Based Optimal Sample Selection for Big Data
title_full_unstemmed	Uncertainty Based Optimal Sample Selection for Big Data
title_short	Uncertainty Based Optimal Sample Selection for Big Data
title_sort	uncertainty based optimal sample selection for big data
topic	Big data instance selection machine learning uncertainty
url	https://ieeexplore.ieee.org/document/10004968/
work_keys_str_mv	AT saadiaajmal uncertaintybasedoptimalsampleselectionforbigdata AT ranaaamirrazaashfaq uncertaintybasedoptimalsampleselectionforbigdata AT kashifsaleem uncertaintybasedoptimalsampleselectionforbigdata

Uncertainty Based Optimal Sample Selection for Big Data

Similar Items