Open Coding for Machine Learning

Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learni...

Full description

Bibliographic Details
Main Author: Price, Magdalena
Other Authors: Hadfield-Menell, Dylan
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/145142
_version_ 1811084224747274240
author Price, Magdalena
author2 Hadfield-Menell, Dylan
author_facet Hadfield-Menell, Dylan
Price, Magdalena
author_sort Price, Magdalena
collection MIT
description Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learning techniques, modifying the input features of the algorithm, or improving the pre-processing of the dataset [35]. However, even if the prediction model is fair and the raw dataset is fair, unfair labels still present the possibility of adding bias to the system [25]. In particular, predictive models for subjective observations are trained on correlative metrics that may not accurately reflect the nuanced nature of what is being predicted; Such a phenomenon may be understood as goal misspecification. Large datasets in particular can fall victim to this phenomenon [35], as the time and cost required demand alternative, less thorough methods of labeling. Thus, we take an approach that analyzes current methods of labeling big data, looking to reduce goal misspecification by modifying the process of labeling big data. Grounded coding theory [12] presents a modern approach to effectively labeling data from a human perspective, dividing the exploratory process into several stages that encourage thoughtful interaction with text corpora. In order to support effective data labeling, we draw explicit inspiration from some of the methodologies presented. Then, we build on these methodologies by augmenting them with machine learning techniques, providing support for effective and scalable data labeling. Thus, by providing a space for qualified individuals to effectively and efficiently create custom labels, our research better enables quality correlative goals for predictive models. Combining social science methodology with semi-supervised learning, we present a scalable annotation interface that serves as an effective alternative to current data labeling practices.
first_indexed 2024-09-23T12:47:16Z
format Thesis
id mit-1721.1/145142
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:47:16Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1451422022-08-30T03:33:23Z Open Coding for Machine Learning Price, Magdalena Hadfield-Menell, Dylan Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learning techniques, modifying the input features of the algorithm, or improving the pre-processing of the dataset [35]. However, even if the prediction model is fair and the raw dataset is fair, unfair labels still present the possibility of adding bias to the system [25]. In particular, predictive models for subjective observations are trained on correlative metrics that may not accurately reflect the nuanced nature of what is being predicted; Such a phenomenon may be understood as goal misspecification. Large datasets in particular can fall victim to this phenomenon [35], as the time and cost required demand alternative, less thorough methods of labeling. Thus, we take an approach that analyzes current methods of labeling big data, looking to reduce goal misspecification by modifying the process of labeling big data. Grounded coding theory [12] presents a modern approach to effectively labeling data from a human perspective, dividing the exploratory process into several stages that encourage thoughtful interaction with text corpora. In order to support effective data labeling, we draw explicit inspiration from some of the methodologies presented. Then, we build on these methodologies by augmenting them with machine learning techniques, providing support for effective and scalable data labeling. Thus, by providing a space for qualified individuals to effectively and efficiently create custom labels, our research better enables quality correlative goals for predictive models. Combining social science methodology with semi-supervised learning, we present a scalable annotation interface that serves as an effective alternative to current data labeling practices. M.Eng. 2022-08-29T16:36:09Z 2022-08-29T16:36:09Z 2022-05 2022-05-27T16:19:29.577Z Thesis https://hdl.handle.net/1721.1/145142 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Price, Magdalena
Open Coding for Machine Learning
title Open Coding for Machine Learning
title_full Open Coding for Machine Learning
title_fullStr Open Coding for Machine Learning
title_full_unstemmed Open Coding for Machine Learning
title_short Open Coding for Machine Learning
title_sort open coding for machine learning
url https://hdl.handle.net/1721.1/145142
work_keys_str_mv AT pricemagdalena opencodingformachinelearning