Open Coding for Machine Learning
Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learni...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/145142 |
_version_ | 1811084224747274240 |
---|---|
author | Price, Magdalena |
author2 | Hadfield-Menell, Dylan |
author_facet | Hadfield-Menell, Dylan Price, Magdalena |
author_sort | Price, Magdalena |
collection | MIT |
description | Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learning techniques, modifying the input features of the algorithm, or improving the pre-processing of the dataset [35]. However, even if the prediction model is fair and the raw dataset is fair, unfair labels still present the possibility of adding bias to the system [25].
In particular, predictive models for subjective observations are trained on correlative metrics that may not accurately reflect the nuanced nature of what is being predicted; Such a phenomenon may be understood as goal misspecification. Large datasets in particular can fall victim to this phenomenon [35], as the time and cost required demand alternative, less thorough methods of labeling. Thus, we take an approach that analyzes current methods of labeling big data, looking to reduce goal misspecification by modifying the process of labeling big data.
Grounded coding theory [12] presents a modern approach to effectively labeling data from a human perspective, dividing the exploratory process into several stages that encourage thoughtful interaction with text corpora. In order to support effective data labeling, we draw explicit inspiration from some of the methodologies presented. Then, we build on these methodologies by augmenting them with machine learning techniques, providing support for effective and scalable data labeling.
Thus, by providing a space for qualified individuals to effectively and efficiently create custom labels, our research better enables quality correlative goals for predictive models. Combining social science methodology with semi-supervised learning, we present a scalable annotation interface that serves as an effective alternative to current data labeling practices. |
first_indexed | 2024-09-23T12:47:16Z |
format | Thesis |
id | mit-1721.1/145142 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T12:47:16Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1451422022-08-30T03:33:23Z Open Coding for Machine Learning Price, Magdalena Hadfield-Menell, Dylan Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Data-driven decisions have an unavoidable influence on people’s lives [5], and despite being marketed as fair decision-making tools, predictive models can easily perpetuate the same biases they hope to counteract. Some approaches to reducing this bias include incorporating interactive machine learning techniques, modifying the input features of the algorithm, or improving the pre-processing of the dataset [35]. However, even if the prediction model is fair and the raw dataset is fair, unfair labels still present the possibility of adding bias to the system [25]. In particular, predictive models for subjective observations are trained on correlative metrics that may not accurately reflect the nuanced nature of what is being predicted; Such a phenomenon may be understood as goal misspecification. Large datasets in particular can fall victim to this phenomenon [35], as the time and cost required demand alternative, less thorough methods of labeling. Thus, we take an approach that analyzes current methods of labeling big data, looking to reduce goal misspecification by modifying the process of labeling big data. Grounded coding theory [12] presents a modern approach to effectively labeling data from a human perspective, dividing the exploratory process into several stages that encourage thoughtful interaction with text corpora. In order to support effective data labeling, we draw explicit inspiration from some of the methodologies presented. Then, we build on these methodologies by augmenting them with machine learning techniques, providing support for effective and scalable data labeling. Thus, by providing a space for qualified individuals to effectively and efficiently create custom labels, our research better enables quality correlative goals for predictive models. Combining social science methodology with semi-supervised learning, we present a scalable annotation interface that serves as an effective alternative to current data labeling practices. M.Eng. 2022-08-29T16:36:09Z 2022-08-29T16:36:09Z 2022-05 2022-05-27T16:19:29.577Z Thesis https://hdl.handle.net/1721.1/145142 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Price, Magdalena Open Coding for Machine Learning |
title | Open Coding for Machine Learning |
title_full | Open Coding for Machine Learning |
title_fullStr | Open Coding for Machine Learning |
title_full_unstemmed | Open Coding for Machine Learning |
title_short | Open Coding for Machine Learning |
title_sort | open coding for machine learning |
url | https://hdl.handle.net/1721.1/145142 |
work_keys_str_mv | AT pricemagdalena opencodingformachinelearning |