MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library

This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based mu...

Full description

Bibliographic Details
Main Authors: Sil Hamilton, Andrew Piper
Format: Article
Language:English
Published: Ubiquity Press 2023-02-01
Series:Journal of Open Humanities Data
Subjects:
Online Access:https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95
_version_ 1797867987774996480
author Sil Hamilton
Andrew Piper
author_facet Sil Hamilton
Andrew Piper
author_sort Sil Hamilton
collection DOAJ
description This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.
first_indexed 2024-04-09T23:50:00Z
format Article
id doaj.art-2ec57ea5db7e4e07b84dca1fe2461fec
institution Directory Open Access Journal
issn 2059-481X
language English
last_indexed 2024-04-09T23:50:00Z
publishDate 2023-02-01
publisher Ubiquity Press
record_format Article
series Journal of Open Humanities Data
spelling doaj.art-2ec57ea5db7e4e07b84dca1fe2461fec2023-03-17T13:00:20ZengUbiquity PressJournal of Open Humanities Data2059-481X2023-02-0193310.5334/johd.9515MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital LibrarySil Hamilton0https://orcid.org/0000-0002-6579-4628Andrew Piper1https://orcid.org/0000-0001-9663-5999Languages, Literatures, and Cultures, McGill University, MontrealLanguages, Literatures, and Cultures, McGill University, MontrealThis dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95fictionmultilingual fictionnon-english proseworld literature
spellingShingle Sil Hamilton
Andrew Piper
MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
Journal of Open Humanities Data
fiction
multilingual fiction
non-english prose
world literature
title MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
title_full MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
title_fullStr MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
title_full_unstemmed MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
title_short MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
title_sort multihathi a complete collection of multilingual prose fiction in the hathitrust digital library
topic fiction
multilingual fiction
non-english prose
world literature
url https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95
work_keys_str_mv AT silhamilton multihathiacompletecollectionofmultilingualprosefictioninthehathitrustdigitallibrary
AT andrewpiper multihathiacompletecollectionofmultilingualprosefictioninthehathitrustdigitallibrary