Impact of pre-training on background knowledge and societal bias
<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsuperv...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Language: | English |
Published: |
2021
|
Subjects: |
_version_ | 1797107199596560384 |
---|---|
author | Kocijan, V |
author2 | Lukasiewicz, T |
author_facet | Lukasiewicz, T Kocijan, V |
author_sort | Kocijan, V |
collection | OXFORD |
description | <p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained.
Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify.
This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p>
<br>
<p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion.
For both of them, a novel task-specific method for unsupervised pre-training is introduced.
Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p>
<br>
<p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM.
By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge.
The obtained pre-trained models are then evaluated on the Gap benchmark.
On this benchmark, potentially problematic patterns in the test set are demonstrated.
To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p>
<br>
<p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets.
The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks.
To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p>
<br>
<p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p> |
first_indexed | 2024-03-07T07:12:37Z |
format | Thesis |
id | oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c125 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:12:37Z |
publishDate | 2021 |
record_format | dspace |
spelling | oxford-uuid:a3197e60-53f2-4271-82d8-280b2f44c1252022-07-08T10:56:00ZImpact of pre-training on background knowledge and societal biasThesishttp://purl.org/coar/resource_type/c_db06uuid:a3197e60-53f2-4271-82d8-280b2f44c125Natural Language ProcessingKnowledge Base CompletionEnglishHyrax Deposit2021Kocijan, VLukasiewicz, TCamburu, O-MSallinger, ECho, K<p>With appropriate pre-training on unstructured text, larger and more accurate neural network models can be trained. Unfortunately, unstructured pre-training data may contain undesired societal biases, which a model may mimic and amplify. This thesis focuses on both improving unsupervised pre-training and developing diagnostics of obtained pre-trained models for potential undesired behaviour.</p> <br> <p> Pre-training and diagnostics are done on two tasks: coreference resolution and knowledge base completion. For both of them, a novel task-specific method for unsupervised pre-training is introduced. Then, the obtained models are analysed for potential undesired behaviour by evaluating them on relevant datasets, focusing on gender bias in particular.</p> <br> <p> Two novel pre-training datasets for coreference resolution are introduced, MaskedWiki and WikiCREM. By fine-tuning on these datasets, state-of-the-art performance on multiple benchmarks is achieved, including on the Winograd Schema Challenge, a commonsense reasoning benchmark that requires a lot of background knowledge. The obtained pre-trained models are then evaluated on the Gap benchmark. On this benchmark, potentially problematic patterns in the test set are demonstrated. To remove these undesired patterns, a novel test sample weighting method and a proof of its correctness are introduced.</p> <br> <p> A method for pre-training in knowledge base completion is introduced, the first of its kind, significantly improving the results on multiple smaller datasets. The obtained models outperform much larger and highly trained models, which are trained on more general language-modelling tasks. To better understand the behaviour of the obtained models for knowledge base completion, the first diagnostic dataset for pre-trained knowledge base completion models is introduced, demonstrating how stereotypes in the pre-training data can affect the predictions of a model on the target knowledge base.</p> <br> <p> The future developments of both task-specific pre-training and bias detection are discussed, motivating future research directions in the field.</p> |
spellingShingle | Natural Language Processing Knowledge Base Completion Kocijan, V Impact of pre-training on background knowledge and societal bias |
title | Impact of pre-training on background knowledge and societal bias |
title_full | Impact of pre-training on background knowledge and societal bias |
title_fullStr | Impact of pre-training on background knowledge and societal bias |
title_full_unstemmed | Impact of pre-training on background knowledge and societal bias |
title_short | Impact of pre-training on background knowledge and societal bias |
title_sort | impact of pre training on background knowledge and societal bias |
topic | Natural Language Processing Knowledge Base Completion |
work_keys_str_mv | AT kocijanv impactofpretrainingonbackgroundknowledgeandsocietalbias |