Improved cost-sensitive representation of data for solving the imbalanced big data classification problem

Abstract Dimension reduction is a preprocessing step in machine learning for eliminating undesirable features and increasing learning accuracy. In order to reduce the redundant features, there are data representation methods, each of which has its own advantages. On the other hand, big data with imb...

Full description

Bibliographic Details
Main Authors: Mahboubeh Fattahi, Mohammad Hossein Moattar, Yahya Forghani
Format: Article
Language:English
Published: SpringerOpen 2022-05-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-022-00617-z
_version_ 1818200902039764992
author Mahboubeh Fattahi
Mohammad Hossein Moattar
Yahya Forghani
author_facet Mahboubeh Fattahi
Mohammad Hossein Moattar
Yahya Forghani
author_sort Mahboubeh Fattahi
collection DOAJ
description Abstract Dimension reduction is a preprocessing step in machine learning for eliminating undesirable features and increasing learning accuracy. In order to reduce the redundant features, there are data representation methods, each of which has its own advantages. On the other hand, big data with imbalanced classes is one of the most important issues in pattern recognition and machine learning. In this paper, a method is proposed in the form of a cost-sensitive optimization problem which implements the process of selecting and extracting the features simultaneously. The feature extraction phase is based on reducing error and maintaining geometric relationships between data by solving a manifold learning optimization problem. In the feature selection phase, the cost-sensitive optimization problem is adopted based on minimizing the upper limit of the generalization error. Finally, the optimization problem which is constituted from the above two problems is solved by adding a cost-sensitive term to create a balance between classes without manipulating the data. To evaluate the results of the feature reduction, the multi-class linear SVM classifier is used on the reduced data. The proposed method is compared with some other approaches on 21 datasets from the UCI learning repository, microarrays and high-dimensional datasets, as well as imbalanced datasets from the KEEL repository. The results indicate the significant efficiency of the proposed method compared to some similar approaches.
first_indexed 2024-12-12T02:45:02Z
format Article
id doaj.art-fb8cf9f78ce34f55a5a53cd9be98356e
institution Directory Open Access Journal
issn 2196-1115
language English
last_indexed 2024-12-12T02:45:02Z
publishDate 2022-05-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj.art-fb8cf9f78ce34f55a5a53cd9be98356e2022-12-22T00:41:04ZengSpringerOpenJournal of Big Data2196-11152022-05-019112410.1186/s40537-022-00617-zImproved cost-sensitive representation of data for solving the imbalanced big data classification problemMahboubeh Fattahi0Mohammad Hossein Moattar1Yahya Forghani2Department of Computer Engineering, Mashhad Branch, Islamic Azad UniversityDepartment of Computer Engineering, Mashhad Branch, Islamic Azad UniversityDepartment of Computer Engineering, Mashhad Branch, Islamic Azad UniversityAbstract Dimension reduction is a preprocessing step in machine learning for eliminating undesirable features and increasing learning accuracy. In order to reduce the redundant features, there are data representation methods, each of which has its own advantages. On the other hand, big data with imbalanced classes is one of the most important issues in pattern recognition and machine learning. In this paper, a method is proposed in the form of a cost-sensitive optimization problem which implements the process of selecting and extracting the features simultaneously. The feature extraction phase is based on reducing error and maintaining geometric relationships between data by solving a manifold learning optimization problem. In the feature selection phase, the cost-sensitive optimization problem is adopted based on minimizing the upper limit of the generalization error. Finally, the optimization problem which is constituted from the above two problems is solved by adding a cost-sensitive term to create a balance between classes without manipulating the data. To evaluate the results of the feature reduction, the multi-class linear SVM classifier is used on the reduced data. The proposed method is compared with some other approaches on 21 datasets from the UCI learning repository, microarrays and high-dimensional datasets, as well as imbalanced datasets from the KEEL repository. The results indicate the significant efficiency of the proposed method compared to some similar approaches.https://doi.org/10.1186/s40537-022-00617-zFeature selectionFeature extractionImbalanced dataBig data classificationCost sensitiveOptimization
spellingShingle Mahboubeh Fattahi
Mohammad Hossein Moattar
Yahya Forghani
Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
Journal of Big Data
Feature selection
Feature extraction
Imbalanced data
Big data classification
Cost sensitive
Optimization
title Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
title_full Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
title_fullStr Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
title_full_unstemmed Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
title_short Improved cost-sensitive representation of data for solving the imbalanced big data classification problem
title_sort improved cost sensitive representation of data for solving the imbalanced big data classification problem
topic Feature selection
Feature extraction
Imbalanced data
Big data classification
Cost sensitive
Optimization
url https://doi.org/10.1186/s40537-022-00617-z
work_keys_str_mv AT mahboubehfattahi improvedcostsensitiverepresentationofdataforsolvingtheimbalancedbigdataclassificationproblem
AT mohammadhosseinmoattar improvedcostsensitiverepresentationofdataforsolvingtheimbalancedbigdataclassificationproblem
AT yahyaforghani improvedcostsensitiverepresentationofdataforsolvingtheimbalancedbigdataclassificationproblem