Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution

When patterns to be recognised are described by features of continuous type, discretisation becomes either an optional or necessary step in the initial data pre-processing stage. Characteristics of data, distribution of data points in the input space, can significantly influence the process of trans...

Full description

Bibliographic Details
Main Authors: Urszula Stańczyk, Beata Zielosko
Format: Article
Language:English
Published: Polish Academy of Sciences 2021-06-01
Series:Bulletin of the Polish Academy of Sciences: Technical Sciences
Subjects:
Online Access:https://journals.pan.pl/Content/119904/PDF/17_01628_Bpast.No.69(4)_27.08.21_druk.pdf
_version_ 1811314171637137408
author Urszula Stańczyk
Beata Zielosko
author_facet Urszula Stańczyk
Beata Zielosko
author_sort Urszula Stańczyk
collection DOAJ
description When patterns to be recognised are described by features of continuous type, discretisation becomes either an optional or necessary step in the initial data pre-processing stage. Characteristics of data, distribution of data points in the input space, can significantly influence the process of transformation from real-valued into nominal attributes, and the resulting performance of classification systems employing them. If data include several separate sets, their discretisation becomes more complex, as varying numbers of intervals and different ranges can be constructed for the same variables. The paper presents research on irregularities in data distribution, observed in the context of discretisation processes. Selected discretisation methods were used and their effect on the performance of decision algorithms, induced in classical rough set approach, was investigated. The studied input space was defined by measurable style-markers, which, exploited as characteristic features, facilitate treating a task of stylometric authorship attribution as classification
first_indexed 2024-04-13T11:07:10Z
format Article
id doaj.art-edbaf96ca3ab4a78bf82bcadaac15258
institution Directory Open Access Journal
issn 2300-1917
language English
last_indexed 2024-04-13T11:07:10Z
publishDate 2021-06-01
publisher Polish Academy of Sciences
record_format Article
series Bulletin of the Polish Academy of Sciences: Technical Sciences
spelling doaj.art-edbaf96ca3ab4a78bf82bcadaac152582022-12-22T02:49:15ZengPolish Academy of SciencesBulletin of the Polish Academy of Sciences: Technical Sciences2300-19172021-06-01694https://doi.org/10.24425/bpasts.2021.137629Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attributionUrszula Stańczyk0Beata Zielosko1Silesian University of Technology, ul. Akademicka 2A, 44-100 Gliwice, PolandUniversity of Silesia in Katowice, ul. Będzińska 39, 41-200 Sosnowiec, PolandWhen patterns to be recognised are described by features of continuous type, discretisation becomes either an optional or necessary step in the initial data pre-processing stage. Characteristics of data, distribution of data points in the input space, can significantly influence the process of transformation from real-valued into nominal attributes, and the resulting performance of classification systems employing them. If data include several separate sets, their discretisation becomes more complex, as varying numbers of intervals and different ranges can be constructed for the same variables. The paper presents research on irregularities in data distribution, observed in the context of discretisation processes. Selected discretisation methods were used and their effect on the performance of decision algorithms, induced in classical rough set approach, was investigated. The studied input space was defined by measurable style-markers, which, exploited as characteristic features, facilitate treating a task of stylometric authorship attribution as classificationhttps://journals.pan.pl/Content/119904/PDF/17_01628_Bpast.No.69(4)_27.08.21_druk.pdfdiscretisationdata irregularitiesevaluation and test setsrough setsauthorship attributionstylometry
spellingShingle Urszula Stańczyk
Beata Zielosko
Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
Bulletin of the Polish Academy of Sciences: Technical Sciences
discretisation
data irregularities
evaluation and test sets
rough sets
authorship attribution
stylometry
title Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
title_full Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
title_fullStr Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
title_full_unstemmed Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
title_short Data irregularities in discretisation of test sets used for evaluation of classification systems: A case study on authorship attribution
title_sort data irregularities in discretisation of test sets used for evaluation of classification systems a case study on authorship attribution
topic discretisation
data irregularities
evaluation and test sets
rough sets
authorship attribution
stylometry
url https://journals.pan.pl/Content/119904/PDF/17_01628_Bpast.No.69(4)_27.08.21_druk.pdf
work_keys_str_mv AT urszulastanczyk datairregularitiesindiscretisationoftestsetsusedforevaluationofclassificationsystemsacasestudyonauthorshipattribution
AT beatazielosko datairregularitiesindiscretisationoftestsetsusedforevaluationofclassificationsystemsacasestudyonauthorshipattribution