Impact of pre-training on background knowledge and societal bias

<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsuperv...

Full description

Bibliographic Details
Main Author: Kocijan, V
Other Authors: Lukasiewicz, T
Format: Thesis
Language:English
Published: 2021
Subjects:
_version_ 1797107199596560384
author Kocijan, V
author2 Lukasiewicz, T
author_facet Lukasiewicz, T
Kocijan, V
author_sort Kocijan, V
collection OXFORD
description <p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p> <br> <p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion. For both of them, a novel task-specific method for unsupervised pre-training is introduced. Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p> <br> <p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM. By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge. The obtained pre-trained models are then evaluated on the Gap benchmark. On this benchmark, potentially problematic patterns in the test set are demonstrated. To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p> <br> <p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets. The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks. To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p> <br> <p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p>
first_indexed 2024-03-07T07:12:37Z
format Thesis
id oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c125
institution University of Oxford
language English
last_indexed 2024-03-07T07:12:37Z
publishDate 2021
record_format dspace
spelling oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c1252022-07-08T10:56:00ZImpact of pre-training on background knowledge and societal biasThesishttp://purl.org/coar/resource_type/c_db06uuid:a3197e60-53f2-4271-82d8-280b2f44c125Natural Language ProcessingKnowledge Base CompletionEnglishHyrax Deposit2021Kocijan, VLukasiewicz, TCamburu, O-MSallinger, ECho, K<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p> <br> <p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion. For both of them, a novel task-specific method for unsupervised pre-training is introduced. Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p> <br> <p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM. By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge. The obtained pre-trained models are then evaluated on the Gap benchmark. On this benchmark, potentially problematic patterns in the test set are demonstrated. To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p> <br> <p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets. The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks. To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p> <br> <p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p>
spellingShingle Natural Language Processing
Knowledge Base Completion
Kocijan, V
Impact of pre-training on background knowledge and societal bias
title Impact of pre-training on background knowledge and societal bias
title_full Impact of pre-training on background knowledge and societal bias
title_fullStr Impact of pre-training on background knowledge and societal bias
title_full_unstemmed Impact of pre-training on background knowledge and societal bias
title_short Impact of pre-training on background knowledge and societal bias
title_sort impact of pre training on background knowledge and societal bias
topic Natural Language Processing
Knowledge Base Completion
work_keys_str_mv AT kocijanv impactofpretrainingonbackgroundknowledgeandsocietalbias