MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library
This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based mu...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2023-02-01
|
Series: | Journal of Open Humanities Data |
Subjects: | |
Online Access: | https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95 |
_version_ | 1797867987774996480 |
---|---|
author | Sil Hamilton Andrew Piper |
author_facet | Sil Hamilton Andrew Piper |
author_sort | Sil Hamilton |
collection | DOAJ |
description | This dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature. |
first_indexed | 2024-04-09T23:50:00Z |
format | Article |
id | doaj.art-2ec57ea5db7e4e07b84dca1fe2461fec |
institution | Directory Open Access Journal |
issn | 2059-481X |
language | English |
last_indexed | 2024-04-09T23:50:00Z |
publishDate | 2023-02-01 |
publisher | Ubiquity Press |
record_format | Article |
series | Journal of Open Humanities Data |
spelling | doaj.art-2ec57ea5db7e4e07b84dca1fe2461fec2023-03-17T13:00:20ZengUbiquity PressJournal of Open Humanities Data2059-481X2023-02-0193310.5334/johd.9515MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital LibrarySil Hamilton0https://orcid.org/0000-0002-6579-4628Andrew Piper1https://orcid.org/0000-0001-9663-5999Languages, Literatures, and Cultures, McGill University, MontrealLanguages, Literatures, and Cultures, McGill University, MontrealThis dataset provides detailed metadata on ca. 10.2 million works of fiction and non-fiction written after 1799 in 521 different languages available in the HathiTrust Digital Library. The dataset bolsters the May 2022 Hathifile by supplying missing predicted fiction tags with a bespoke BERT-based multilingual classifier. Our classifier completes the catalogue with an additional 400,000 non-English volumes predicted to be works of fiction, capturing 95% of all works presently provided by HathiTrust. We provide each work with metadata including the work’s genre at the level of fiction or non-fiction, length in pages, original language, and the year the work was published. With a total page count of ca. 1.4 billion pages, our dataset provides researchers with a substantial source of non-English modern literature. We also present insight into how multilingual classifiers can be trained with monolingual data, itself a discovery with implications for the study of lower resource languages. We hope our provisions will accelerate empirical research into non-English prose and literature.https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95fictionmultilingual fictionnon-english proseworld literature |
spellingShingle | Sil Hamilton Andrew Piper MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library Journal of Open Humanities Data fiction multilingual fiction non-english prose world literature |
title | MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library |
title_full | MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library |
title_fullStr | MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library |
title_full_unstemmed | MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library |
title_short | MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library |
title_sort | multihathi a complete collection of multilingual prose fiction in the hathitrust digital library |
topic | fiction multilingual fiction non-english prose world literature |
url | https://account.openhumanitiesdata.metajnl.com/index.php/up/article/view/95 |
work_keys_str_mv | AT silhamilton multihathiacompletecollectionofmultilingualprosefictioninthehathitrustdigitallibrary AT andrewpiper multihathiacompletecollectionofmultilingualprosefictioninthehathitrustdigitallibrary |