Summary: | <p>While most statistical methods used in the industry are of simple structure -- like decision trees and logistic regression -- many machine learning competitions are won by modern complex methods, which also outperformed the more simple methods in previous benchmark studies. In these benchmark studies, the results were usually obtained over raw and unprocessed datasets. In practice, when developing a model, a feature extraction is performed first to create non-redundant and useful covariates for the model. This is especially important for the simple methods, which often require approximately linear features without severe outliers.</p> <p>In this thesis, we investigate the performance of modern machine learning methods on raw data, without any additionally extracted features, in comparison to simpler and more interpretable models on data together with feature extraction. For this purpose we employ certain preprocessing and feature extraction algorithms to objectively approximate this task over 32 different datasets.</p> <p>Our results show that, for classification problems, the simple methods perform almost as well as the more complex methods. This is true for both binary classification, regarding the metric <em>area under the ROC curve</em>, and for multiclass classification with regards to <em>the weighted F1 Score</em> metric.</p> <p>In our regression problems, the modern methods outperform the simpler methods and seem superior to the more interpretable methods in terms of their respective <em>R-squared</em> or <em>explained variance</em>. </p> <p>We suspect that the difference between classification and regression problems might be due to the evaluation metrics used; specifically how they react to the monotonicity or linearity of the extracted features, but more research is necessary to investigate this difference. </p>
|