Open Coding for Machine Learning

Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learni...

Full description

Bibliographic Details
Main Author: Price, Magdalena
Other Authors: Hadfield-Menell, Dylan
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/145142
Description
Summary:Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learning techniques, modifying the input features of the algorithm, or improving the pre-processing of the dataset [35]. However, even if the prediction model is fair and the raw dataset is fair, unfair labels still present the possibility of adding bias to the system [25]. In particular, predictive models for subjective observations are trained on correlative metrics that may not accurately reflect the nuanced nature of what is being predicted; Such a phenomenon may be understood as goal misspecification. Large datasets in particular can fall victim to this phenomenon [35], as the time and cost required demand alternative, less thorough methods of labeling. Thus, we take an approach that analyzes current methods of labeling big data, looking to reduce goal misspecification by modifying the process of labeling big data. Grounded coding theory [12] presents a modern approach to effectively labeling data from a human perspective, dividing the exploratory process into several stages that encourage thoughtful interaction with text corpora. In order to support effective data labeling, we draw explicit inspiration from some of the methodologies presented. Then, we build on these methodologies by augmenting them with machine learning techniques, providing support for effective and scalable data labeling. Thus, by providing a space for qualified individuals to effectively and efficiently create custom labels, our research better enables quality correlative goals for predictive models. Combining social science methodology with semi-supervised learning, we present a scalable annotation interface that serves as an effective alternative to current data labeling practices.