Analysis of Dimensionality Reduction Techniques on Big Data

Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that ca...

Full description

Bibliographic Details
Main Authors:	G. Thippa Reddy, M. Praveen Kumar Reddy, Kuruva Lakshmanna, Rajesh Kaluri, Dharmendra Singh Rajput, Gautam Srivastava, Thar Baker
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Cardiotocography dataset dimensionality reduction feature engineering linear discriminant analysis machine learning principal component analysis
Online Access:	https://ieeexplore.ieee.org/document/9036908/

_version_	1831805624095080448
author	G. Thippa Reddy M. Praveen Kumar Reddy Kuruva Lakshmanna Rajesh Kaluri Dharmendra Singh Rajput Gautam Srivastava Thar Baker
author_facet	G. Thippa Reddy M. Praveen Kumar Reddy Kuruva Lakshmanna Rajesh Kaluri Dharmendra Singh Rajput Gautam Srivastava Thar Baker
author_sort	G. Thippa Reddy
collection	DOAJ
description	Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.
first_indexed	2024-12-22T19:27:47Z
format	Article
id	doaj.art-1e5a7e20569547b6b4b0e8c2d57a85fe
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-22T19:27:47Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-1e5a7e20569547b6b4b0e8c2d57a85fe2022-12-21T18:15:12ZengIEEEIEEE Access2169-35362020-01-018547765478810.1109/ACCESS.2020.29809429036908Analysis of Dimensionality Reduction Techniques on Big DataG. Thippa Reddy0https://orcid.org/0000-0003-0097-801XM. Praveen Kumar Reddy1https://orcid.org/0000-0003-4209-2495Kuruva Lakshmanna2Rajesh Kaluri3https://orcid.org/0000-0003-2073-9833Dharmendra Singh Rajput4Gautam Srivastava5https://orcid.org/0000-0001-9851-4103Thar Baker6https://orcid.org/0000-0002-5166-4873School of Infromation Technology and Engineering, VIT, Vellore, IndiaSchool of Infromation Technology and Engineering, VIT, Vellore, IndiaSchool of Infromation Technology and Engineering, VIT, Vellore, IndiaSchool of Infromation Technology and Engineering, VIT, Vellore, IndiaSchool of Infromation Technology and Engineering, VIT, Vellore, IndiaDepartment of Mathematics and Computer Science, Brandon University, Brandon, CanadaDepartment of Computer Science, Liverpool John Moores University, Liverpool, U.KDue to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.https://ieeexplore.ieee.org/document/9036908/Cardiotocography datasetdimensionality reductionfeature engineeringlinear discriminant analysismachine learningprincipal component analysis
spellingShingle	G. Thippa Reddy M. Praveen Kumar Reddy Kuruva Lakshmanna Rajesh Kaluri Dharmendra Singh Rajput Gautam Srivastava Thar Baker Analysis of Dimensionality Reduction Techniques on Big Data IEEE Access Cardiotocography dataset dimensionality reduction feature engineering linear discriminant analysis machine learning principal component analysis
title	Analysis of Dimensionality Reduction Techniques on Big Data
title_full	Analysis of Dimensionality Reduction Techniques on Big Data
title_fullStr	Analysis of Dimensionality Reduction Techniques on Big Data
title_full_unstemmed	Analysis of Dimensionality Reduction Techniques on Big Data
title_short	Analysis of Dimensionality Reduction Techniques on Big Data
title_sort	analysis of dimensionality reduction techniques on big data
topic	Cardiotocography dataset dimensionality reduction feature engineering linear discriminant analysis machine learning principal component analysis
url	https://ieeexplore.ieee.org/document/9036908/
work_keys_str_mv	AT gthippareddy analysisofdimensionalityreductiontechniquesonbigdata AT mpraveenkumarreddy analysisofdimensionalityreductiontechniquesonbigdata AT kuruvalakshmanna analysisofdimensionalityreductiontechniquesonbigdata AT rajeshkaluri analysisofdimensionalityreductiontechniquesonbigdata AT dharmendrasinghrajput analysisofdimensionalityreductiontechniquesonbigdata AT gautamsrivastava analysisofdimensionalityreductiontechniquesonbigdata AT tharbaker analysisofdimensionalityreductiontechniquesonbigdata

Analysis of Dimensionality Reduction Techniques on Big Data

Similar Items