Summary: | <p>The notion that it is difficult to make predictions about derivational morphology has been a recurring theme in morphological research over the last decades. It can be unclear whether a derivative exists at all, what a derivative means exactly, and which affix is used to form a derivative. The central goal of this thesis is to demonstrate that recent progress in natural language processing (NLP) allows for a fresh view on the (un-)predictability of derivational morphology.</p>
<p>Prior research in morphology has recognized semantic and extralinguistic factors as two key challenges for successfully predicting derivational morphology. The first set of papers contained in the thesis leverages novel methods from NLP and applies them to large-scale, socially-stratified datasets. I find that this computational approach results in substantially improved models, demonstrating that derivational morphology is predictable to a larger extent than previously thought.</p>
<p>A side result of the first part of the thesis is that tokenization (i.e., the way in which words are segmented) affects the capability of NLP systems to predict derivational morphology, raising the question whether it deteriorates performance on a larger scale. The second set of papers contained in the thesis shows that this is indeed the case. As a remedy, I devise tokenization strategies that are directly informed by morphology, with beneficial effects on performance.</p>
<p>On a wider scale, the results of this thesis suggest that NLP and deep learning more generally can greatly benefit linguistic research, a view that is still contested by many scholars in linguistics. At the same time, the thesis shows that even, or perhaps especially, in the age of large language models, linguistic insights continue to be relevant for the development of human language technology.</p>
|