Empirical performance of simple statistical inference-type models with feature extraction in comparison to modern machine learning methods concerning regression and classification problems

<p>While most statistical methods used in the industry are of simple structure -- like decision trees and logistic regression -- many machine learning competitions are won by modern complex methods, which also outperformed the more simple methods in previous benchmark studies. In these benchma...

Full description

Bibliographic Details
Main Author: Dittgen, M
Other Authors: Graf, F
Format: Thesis
Published: 2019
Subjects:
Description
Summary:<p>While most statistical methods used in the industry are of simple structure -- like decision trees and logistic regression -- many machine learning competitions are won by modern complex methods, which also outperformed the more simple methods in previous benchmark studies. In these benchmark studies, the results were usually obtained over raw and unprocessed datasets. In practice, when developing a model, a feature extraction is performed first to create non-redundant and useful covariates for the model. This is especially important for the simple methods, which often require approximately linear features without severe outliers.</p> <p>In this thesis, we investigate the performance of modern machine learning methods on raw data, without any additionally extracted features, in comparison to simpler and more interpretable models on data together with feature extraction. For this purpose we employ certain preprocessing and feature extraction algorithms to objectively approximate this task over 32 different datasets.</p> <p>Our results show that, for classification problems, the simple methods perform almost as well as the more complex methods. This is true for both binary classification, regarding the metric <em>area under the ROC curve</em>, and for multiclass classification with regards to <em>the weighted F1 Score</em> metric.</p> <p>In our regression problems, the modern methods outperform the simpler methods and seem superior to the more interpretable methods in terms of their respective <em>R-squared</em> or <em>explained variance</em>. </p> <p>We suspect that the difference between classification and regression problems might be due to the evaluation metrics used; specifically how they react to the monotonicity or linearity of the extracted features, but more research is necessary to investigate this difference. </p>