Comment to U.S Copyright Office on Data Provenance and Copyright

Scholars have paid much attention to the copying of raw data to train and develop machine learning models. Many have argued that such use of raw data, derived either directly from the internet or from a dataset, is protected under fair use such that the owners of the original work may not be success...

Full description

Bibliographic Details
Main Authors:	Mahari, Robert, Shayne, Longpre, Donewald, Lisette, Polozov, Alan, Pentland, Alex 'Sandy', Lipsitz, Ari
Other Authors:	MIT Connection Science (Research institute)
Format:	Other
Language:	en_US
Published:	U.S. Copyright Office 2024
Subjects:	Computational Law Data Provenance Fair Use Copyright Law Large Language Models Generative AI AI Regulation Regulation by Design
Online Access:	https://hdl.handle.net/1721.1/154171

_version_	1824458442443587584
author	Mahari, Robert Shayne, Longpre Donewald, Lisette Polozov, Alan Pentland, Alex 'Sandy' Lipsitz, Ari
author2	MIT Connection Science (Research institute)
author_facet	MIT Connection Science (Research institute) Mahari, Robert Shayne, Longpre Donewald, Lisette Polozov, Alan Pentland, Alex 'Sandy' Lipsitz, Ari
author_sort	Mahari, Robert
collection	MIT
description	Scholars have paid much attention to the copying of raw data to train and develop machine learning models. Many have argued that such use of raw data, derived either directly from the internet or from a dataset, is protected under fair use such that the owners of the original work may not be successful in a claim for copyright infringement. We refer to such compilations of data derived from another source, and repurposed for machine learning, as unsupervised datasets. Less attention, however, has been paid to supervised datasets, which we define as datasets containing data created for the sole purpose of training machine learning models (mainly for finetuning and alignment). Supervised datasets may likely contain copyrightable contributions from the dataset creators in the form of annotations. To the extent that dataset creators likely have copyright interests in their supervised datasets, model developers must either rely on fair use or a license in order to avoid infringing the work of dataset creators. However, we argue that the unauthorized use of supervised datasets is unlikely to be protected by fair use. Whereas the use of unsupervised data for training machine learning is distinct from the original purpose of the unsupervised data, the unauthorized use of supervised datasets for training machine learning is identical to its original purpose. Fair use would therefore likely not apply to the annotations, labels, and curated comments in supervised datasets. For this reason, having a valid license to a supervised dataset is perhaps particularly critical. Unfortunately, our recent research has found that the licenses attached to publicly available supervised datasets are often imprecise, inaccurate, or missing altogether. Model developers may be exposing themselves to unknown amounts of liability. We argue that this is a problem that needs to be addressed and propose a tool that might serve as a launching point for ensuring license transparency.
first_indexed	2024-09-23T16:51:43Z
format	Other
id	mit-1721.1/154171
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2025-02-19T04:25:57Z
publishDate	2024
publisher	U.S. Copyright Office
record_format	dspace
spelling	mit-1721.1/1541712025-02-06T18:53:04Z Comment to U.S Copyright Office on Data Provenance and Copyright Mahari, Robert Shayne, Longpre Donewald, Lisette Polozov, Alan Pentland, Alex 'Sandy' Lipsitz, Ari MIT Connection Science (Research institute) Computational Law Data Provenance Fair Use Copyright Law Large Language Models Generative AI AI Regulation Regulation by Design Scholars have paid much attention to the copying of raw data to train and develop machine learning models. Many have argued that such use of raw data, derived either directly from the internet or from a dataset, is protected under fair use such that the owners of the original work may not be successful in a claim for copyright infringement. We refer to such compilations of data derived from another source, and repurposed for machine learning, as unsupervised datasets. Less attention, however, has been paid to supervised datasets, which we define as datasets containing data created for the sole purpose of training machine learning models (mainly for finetuning and alignment). Supervised datasets may likely contain copyrightable contributions from the dataset creators in the form of annotations. To the extent that dataset creators likely have copyright interests in their supervised datasets, model developers must either rely on fair use or a license in order to avoid infringing the work of dataset creators. However, we argue that the unauthorized use of supervised datasets is unlikely to be protected by fair use. Whereas the use of unsupervised data for training machine learning is distinct from the original purpose of the unsupervised data, the unauthorized use of supervised datasets for training machine learning is identical to its original purpose. Fair use would therefore likely not apply to the annotations, labels, and curated comments in supervised datasets. For this reason, having a valid license to a supervised dataset is perhaps particularly critical. Unfortunately, our recent research has found that the licenses attached to publicly available supervised datasets are often imprecise, inaccurate, or missing altogether. Model developers may be exposing themselves to unknown amounts of liability. We argue that this is a problem that needs to be addressed and propose a tool that might serve as a launching point for ensuring license transparency. 2024-04-17T17:35:01Z 2024-04-17T17:35:01Z 2023-11-01 Other https://hdl.handle.net/1721.1/154171 en_US Attribution-NonCommercial-NoDerivs 3.0 United States http://creativecommons.org/licenses/by-nc-nd/3.0/us/ application/pdf U.S. Copyright Office
spellingShingle	Computational Law Data Provenance Fair Use Copyright Law Large Language Models Generative AI AI Regulation Regulation by Design Mahari, Robert Shayne, Longpre Donewald, Lisette Polozov, Alan Pentland, Alex 'Sandy' Lipsitz, Ari Comment to U.S Copyright Office on Data Provenance and Copyright
title	Comment to U.S Copyright Office on Data Provenance and Copyright
title_full	Comment to U.S Copyright Office on Data Provenance and Copyright
title_fullStr	Comment to U.S Copyright Office on Data Provenance and Copyright
title_full_unstemmed	Comment to U.S Copyright Office on Data Provenance and Copyright
title_short	Comment to U.S Copyright Office on Data Provenance and Copyright
title_sort	comment to u s copyright office on data provenance and copyright
topic	Computational Law Data Provenance Fair Use Copyright Law Large Language Models Generative AI AI Regulation Regulation by Design
url	https://hdl.handle.net/1721.1/154171
work_keys_str_mv	AT maharirobert commenttouscopyrightofficeondataprovenanceandcopyright AT shaynelongpre commenttouscopyrightofficeondataprovenanceandcopyright AT donewaldlisette commenttouscopyrightofficeondataprovenanceandcopyright AT polozovalan commenttouscopyrightofficeondataprovenanceandcopyright AT pentlandalexsandy commenttouscopyrightofficeondataprovenanceandcopyright AT lipsitzari commenttouscopyrightofficeondataprovenanceandcopyright

Comment to U.S Copyright Office on Data Provenance and Copyright

Similar Items