Impact of pre-training on background knowledge and societal bias

<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsuperv...

Full description

Bibliographic Details
Main Author:	Kocijan, V
Other Authors:	Lukasiewicz, T
Format:	Thesis
Language:	English
Published:	2021
Subjects:	Natural Language Processing Knowledge Base Completion

_version_	1797107199596560384
author	Kocijan, V
author2	Lukasiewicz, T
author_facet	Lukasiewicz, T Kocijan, V
author_sort	Kocijan, V
collection	OXFORD
description	<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p> <br> <p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion. For both of them, a novel task-specific method for unsupervised pre-training is introduced. Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p> <br> <p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM. By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge. The obtained pre-trained models are then evaluated on the Gap benchmark. On this benchmark, potentially problematic patterns in the test set are demonstrated. To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p> <br> <p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets. The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks. To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p> <br> <p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p>
first_indexed	2024-03-07T07:12:37Z
format	Thesis
id	oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c125
institution	University of Oxford
language	English
last_indexed	2024-03-07T07:12:37Z
publishDate	2021
record_format	dspace
spelling	oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c1252022-07-08T10:56:00ZImpact of pre-training on background knowledge and societal biasThesishttp://purl.org/coar/resource_type/c_db06uuid:a3197e60-53f2-4271-82d8-280b2f44c125Natural Language ProcessingKnowledge Base CompletionEnglishHyrax Deposit2021Kocijan, VLukasiewicz, TCamburu, O-MSallinger, ECho, K<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p> <br> <p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion. For both of them, a novel task-specific method for unsupervised pre-training is introduced. Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p> <br> <p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM. By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge. The obtained pre-trained models are then evaluated on the Gap benchmark. On this benchmark, potentially problematic patterns in the test set are demonstrated. To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p> <br> <p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets. The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks. To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p> <br> <p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p>
spellingShingle	Natural Language Processing Knowledge Base Completion Kocijan, V Impact of pre-training on background knowledge and societal bias
title	Impact of pre-training on background knowledge and societal bias
title_full	Impact of pre-training on background knowledge and societal bias
title_fullStr	Impact of pre-training on background knowledge and societal bias
title_full_unstemmed	Impact of pre-training on background knowledge and societal bias
title_short	Impact of pre-training on background knowledge and societal bias
title_sort	impact of pre training on background knowledge and societal bias
topic	Natural Language Processing Knowledge Base Completion
work_keys_str_mv	AT kocijanv impactofpretrainingonbackgroundknowledgeandsocietalbias

Impact of pre-training on background knowledge and societal bias

Similar Items