Data normalization in machine learning

In machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardizat...

Full description

Bibliographic Details
Main Authors:	V. V. Starovoitov, Yu. I. Golub
Format:	Article
Language:	Russian
Published:	The United Institute of Informatics Problems of the National Academy of Sciences of Belarus 2021-09-01
Series:	Informatika
Subjects:	object classification clustering data normalization function normalization sigmoid hyperbolic tangent random forest
Online Access:	https://inf.grid.by/jour/article/view/1156

_version_	1797877140681654272
author	V. V. Starovoitov Yu. I. Golub
author_facet	V. V. Starovoitov Yu. I. Golub
author_sort	V. V. Starovoitov
collection	DOAJ
description	In machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardization. The difference between these operations is shown. The paper systematizes the basic operations presented in these scales, as well as the main variants of the function normalization. A new scale of parts is suggested and examples of the data normalization for correct analysis are given. Analysis of publications has shown that there is no universal method of data normalization, but normalization of the initial data makes it possible to increase the accuracy of their classification. It is better to perform data clustering by methods using distance functions after converting all features into a single scale. The results of classification and clustering by different methods can be compared with different scoring functions, which often have different ranges of values. To select the most accurate function, it is reasonable to normalize several functions and to compare their estimates on a single scale. The rules for separating features of tree-like classifiers are invariant to scales of quantitative features. Only comparison operation is used. Perhaps due to this property, the random forest classifier, as a result of numerous experiments, is recognized as one of the best classifiers in the analysis of data of different nature.
first_indexed	2024-04-10T02:13:37Z
format	Article
id	doaj.art-0f7da5371aa14a469fa1e29590b07205
institution	Directory Open Access Journal
issn	1816-0301
language	Russian
last_indexed	2024-04-10T02:13:37Z
publishDate	2021-09-01
publisher	The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
record_format	Article
series	Informatika
spelling	doaj.art-0f7da5371aa14a469fa1e29590b072052023-03-13T08:32:25ZrusThe United Institute of Informatics Problems of the National Academy of Sciences of BelarusInformatika1816-03012021-09-01183839610.37661/1816-0301-2021-18-3-83-96982Data normalization in machine learningV. V. Starovoitov0Yu. I. Golub1The United Institute of Informatics Problems, National Academy of Sciences of BelarusThe United Institute of Informatics Problems, National Academy of Sciences of BelarusIn machine learning, the input data is often given in different dimensions. As a result of the scientific papers review, it is shown that the initial data described in different types of scales and units of measurement should be converted into a single representation by normalization or standardization. The difference between these operations is shown. The paper systematizes the basic operations presented in these scales, as well as the main variants of the function normalization. A new scale of parts is suggested and examples of the data normalization for correct analysis are given. Analysis of publications has shown that there is no universal method of data normalization, but normalization of the initial data makes it possible to increase the accuracy of their classification. It is better to perform data clustering by methods using distance functions after converting all features into a single scale. The results of classification and clustering by different methods can be compared with different scoring functions, which often have different ranges of values. To select the most accurate function, it is reasonable to normalize several functions and to compare their estimates on a single scale. The rules for separating features of tree-like classifiers are invariant to scales of quantitative features. Only comparison operation is used. Perhaps due to this property, the random forest classifier, as a result of numerous experiments, is recognized as one of the best classifiers in the analysis of data of different nature.https://inf.grid.by/jour/article/view/1156object classificationclusteringdata normalizationfunction normalizationsigmoidhyperbolic tangentrandom forest
spellingShingle	V. V. Starovoitov Yu. I. Golub Data normalization in machine learning Informatika object classification clustering data normalization function normalization sigmoid hyperbolic tangent random forest
title	Data normalization in machine learning
title_full	Data normalization in machine learning
title_fullStr	Data normalization in machine learning
title_full_unstemmed	Data normalization in machine learning
title_short	Data normalization in machine learning
title_sort	data normalization in machine learning
topic	object classification clustering data normalization function normalization sigmoid hyperbolic tangent random forest
url	https://inf.grid.by/jour/article/view/1156
work_keys_str_mv	AT vvstarovoitov datanormalizationinmachinelearning AT yuigolub datanormalizationinmachinelearning

Data normalization in machine learning

Similar Items