EEBO-TCP Phase 1 Texts (XML Files TEI P3)
From the 1 January 2015 the first phase of EEBO-TCP (Early English Books Online – Text Creation Partnership) transcribed books entered the public domain. The goal of the Text Creation Partnership is to create accurate XML/SGML encoded electronic text editions of early printed books, transcribing and...
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Dataset |
Language: | English |
Published: |
University of Oxford
2015
|
Subjects: |
_version_ | 1797088458300194816 |
---|---|
author | Siefring, J Huber, E Blaney, J Charles, S Willcox, P |
author2 | Popham, M |
author_facet | Popham, M Siefring, J Huber, E Blaney, J Charles, S Willcox, P |
author_sort | Siefring, J |
collection | OXFORD |
description | From the 1 January 2015 the first phase of EEBO-TCP (Early English Books Online – Text Creation Partnership) transcribed books entered the public domain. The goal of the Text Creation Partnership is to create accurate XML/SGML encoded electronic text editions of early printed books, transcribing and encoding the page images of books from ProQuest’s Early English Books Online. The work the TCP does, and hence the resulting transcriptions that they create, are jointly funded and owned by more than 150 libraries worldwide. The TCP began in 1999 as a partnership led by the University of Michigan Library and the Bodleian Libraries of the University of Oxford, working closely with ProQuest and receiving funding in the UK from Jisc. 25,368 texts are deposited here as XML files, divided into nine zipped folders by year of creation, e.g. EEBO-TCP Phase 1 XML Files TEI P3 (2001) to EEBO-TCP Phase 1 XML Files TEI P3 (2009) with an additional zipped folder containing three sort lists. The XML files were derived from SGML files, and the character encoding of the XML is UTF8 Unicode. Character entities with unicode equivalents have been converted to the appropriate codepoint, rendered in UTF8; entities with no unicode equivalent have generally been captured as text strings within curly braces, e.g. {quod} for the Latin brevigraph "quod"; in a few cases, a look-alike Unicode character has been substituted. The same is true in the case of entities whose Unicode equivalents are ill-supported by fonts or browsers. The intent of this variety of character transformations is to supply a text that will be readily displayable, and therefore human-readable. The XML files have been combined with bibliographic headers which are also XML in format and UTF8 in character encoding. They have been derived largely and ultimately from MARC catalogue records for the books in question, and contain not only the transcription data itself but also additional metadata, both bibliographic (describing the book) and administrative (describing the process that produced the transcript). The texts have been encoded to Text Encoding Initiative (TEI) P3 standard (though TEI P5 versions also exist), and users of the data should be aware of the process of creating the TCP texts, and therefore the assumptions that can be made about the data: 1) Text selection was based on the New Cambridge Bibliography of English Literature (NCBEL). If an author (or for an anonymous work, the title) appears in NCBEL, then their works are eligible for inclusion. Selection was intended to range over a wide variety of subject areas, to reflect the true nature of the print record of the period. In general, first editions of a work in English were prioritized, although there are a number of works in other languages, notably Latin and Welsh, and sometimes a second or later edition of a work was chosen if there was a compelling reason to do so; 2) Image sets were sent to external keying companies for transcription and basic encoding. Quality assurance was then carried out by editorial teams in Oxford and Michigan. 5% (or 5 pages, whichever is the greater) of each text was proofread for accuracy and those which did not meet QA standards were returned for rekeying. After proofreading, the encoding was enhanced and/or corrected and characters marked as illegible were corrected where possible up to a limit of 100 instances per text. Any remaining illegibles were encoded as < GAP >s. Understanding these processes should make clear that, while the overall quality of TCP data is very good, some errors will remain and some readable characters will be marked as illegible. Users should bear in mind that in all likelihood such instances will never have been looked at by a TCP editor; 3) Special characters such as alchemical or astronomical symbols have been captured, but material in non-Roman alphabets has not been captured, except where it forms part or all of the title of a work, or there is another practical reason to include it. Any handwritten material has been excluded, and damage is indicated usually by a < GAP DESC=”illegible” >, but no information about the reason for damage is given; 4) Encoding of TCP texts is, in the main, structural. Textual divisions (and hierarchies) are defined and given a type. Features such as lists, tables, speeches, speakers, stage directions, verse (< l > and < lg >) and prose (< p >) are encoded, as are quotations and letters (including opening and closing elements). Features which would require more significant editorial input or specialist expertise, such as complex mathematical or musical notation are not. The presence of such complex material is indicated by e.g. < GAP DESC=”math” > or < GAP DESC=”music” >. Every effort has been made to be consistent across the project, but inevitably differences in the encoding of similar features will be found, especially when comparing text created in the very early days of the project with those created years later when editorial guidelines were more fully established. The Early English Books Online Text Creation Partnership (EEBO-TCP) ran from 1999 as an innovative collaboration between the Universities of Oxford and Michigan, funded by Jisc in the UK and by over 150 academic partner institutions worldwide. Its aim was to capture the earliest extant edition of every English-language work published during the first two centuries of printing in England, and to convert this material into fully-searchable texts. The EEBO-TCP corpus covers the period from 1473 to 1700 and is estimated to comprise more than two million pages and nearly a billion words. It represents a history of the printed word in England from the birth of the printing press to the reign of William and Mary, and it contains texts of incomparable significance for research across all academic disciplines, including literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics, and science. Having previously been available only to academic institutions which subscribe to ProQuest’s Early English Books Online resource, over 25,000 texts from the first phase of EEBO-TCP were made freely available as open data in the public domain from January 2015. |
first_indexed | 2024-03-07T02:50:23Z |
format | Dataset |
id | oxford-uuid:ad7da8fc-cd8e-4637-8b7c-99498436dbaa |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T02:50:23Z |
publishDate | 2015 |
publisher | University of Oxford |
record_format | dspace |
spelling | oxford-uuid:ad7da8fc-cd8e-4637-8b7c-99498436dbaa2022-03-27T03:35:59ZEEBO-TCP Phase 1 Texts (XML Files TEI P3)Datasethttp://purl.org/coar/resource_type/c_ddb1uuid:ad7da8fc-cd8e-4637-8b7c-99498436dbaaEnglish literature--Early modernHumanities--Digital librariesEditingEnglishORA DepositUniversity of Oxford2015Siefring, JHuber, EBlaney, JCharles, SWillcox, PPopham, MFlynn, AMacCrossan, CFrom the 1 January 2015 the first phase of EEBO-TCP (Early English Books Online – Text Creation Partnership) transcribed books entered the public domain. The goal of the Text Creation Partnership is to create accurate XML/SGML encoded electronic text editions of early printed books, transcribing and encoding the page images of books from ProQuest’s Early English Books Online. The work the TCP does, and hence the resulting transcriptions that they create, are jointly funded and owned by more than 150 libraries worldwide. The TCP began in 1999 as a partnership led by the University of Michigan Library and the Bodleian Libraries of the University of Oxford, working closely with ProQuest and receiving funding in the UK from Jisc. 25,368 texts are deposited here as XML files, divided into nine zipped folders by year of creation, e.g. EEBO-TCP Phase 1 XML Files TEI P3 (2001) to EEBO-TCP Phase 1 XML Files TEI P3 (2009) with an additional zipped folder containing three sort lists. The XML files were derived from SGML files, and the character encoding of the XML is UTF8 Unicode. Character entities with unicode equivalents have been converted to the appropriate codepoint, rendered in UTF8; entities with no unicode equivalent have generally been captured as text strings within curly braces, e.g. {quod} for the Latin brevigraph "quod"; in a few cases, a look-alike Unicode character has been substituted. The same is true in the case of entities whose Unicode equivalents are ill-supported by fonts or browsers. The intent of this variety of character transformations is to supply a text that will be readily displayable, and therefore human-readable. The XML files have been combined with bibliographic headers which are also XML in format and UTF8 in character encoding. They have been derived largely and ultimately from MARC catalogue records for the books in question, and contain not only the transcription data itself but also additional metadata, both bibliographic (describing the book) and administrative (describing the process that produced the transcript). The texts have been encoded to Text Encoding Initiative (TEI) P3 standard (though TEI P5 versions also exist), and users of the data should be aware of the process of creating the TCP texts, and therefore the assumptions that can be made about the data: 1) Text selection was based on the New Cambridge Bibliography of English Literature (NCBEL). If an author (or for an anonymous work, the title) appears in NCBEL, then their works are eligible for inclusion. Selection was intended to range over a wide variety of subject areas, to reflect the true nature of the print record of the period. In general, first editions of a work in English were prioritized, although there are a number of works in other languages, notably Latin and Welsh, and sometimes a second or later edition of a work was chosen if there was a compelling reason to do so; 2) Image sets were sent to external keying companies for transcription and basic encoding. Quality assurance was then carried out by editorial teams in Oxford and Michigan. 5% (or 5 pages, whichever is the greater) of each text was proofread for accuracy and those which did not meet QA standards were returned for rekeying. After proofreading, the encoding was enhanced and/or corrected and characters marked as illegible were corrected where possible up to a limit of 100 instances per text. Any remaining illegibles were encoded as < GAP >s. Understanding these processes should make clear that, while the overall quality of TCP data is very good, some errors will remain and some readable characters will be marked as illegible. Users should bear in mind that in all likelihood such instances will never have been looked at by a TCP editor; 3) Special characters such as alchemical or astronomical symbols have been captured, but material in non-Roman alphabets has not been captured, except where it forms part or all of the title of a work, or there is another practical reason to include it. Any handwritten material has been excluded, and damage is indicated usually by a < GAP DESC=”illegible” >, but no information about the reason for damage is given; 4) Encoding of TCP texts is, in the main, structural. Textual divisions (and hierarchies) are defined and given a type. Features such as lists, tables, speeches, speakers, stage directions, verse (< l > and < lg >) and prose (< p >) are encoded, as are quotations and letters (including opening and closing elements). Features which would require more significant editorial input or specialist expertise, such as complex mathematical or musical notation are not. The presence of such complex material is indicated by e.g. < GAP DESC=”math” > or < GAP DESC=”music” >. Every effort has been made to be consistent across the project, but inevitably differences in the encoding of similar features will be found, especially when comparing text created in the very early days of the project with those created years later when editorial guidelines were more fully established. The Early English Books Online Text Creation Partnership (EEBO-TCP) ran from 1999 as an innovative collaboration between the Universities of Oxford and Michigan, funded by Jisc in the UK and by over 150 academic partner institutions worldwide. Its aim was to capture the earliest extant edition of every English-language work published during the first two centuries of printing in England, and to convert this material into fully-searchable texts. The EEBO-TCP corpus covers the period from 1473 to 1700 and is estimated to comprise more than two million pages and nearly a billion words. It represents a history of the printed word in England from the birth of the printing press to the reign of William and Mary, and it contains texts of incomparable significance for research across all academic disciplines, including literature, history, philosophy, linguistics, theology, music, fine arts, education, mathematics, and science. Having previously been available only to academic institutions which subscribe to ProQuest’s Early English Books Online resource, over 25,000 texts from the first phase of EEBO-TCP were made freely available as open data in the public domain from January 2015. |
spellingShingle | English literature--Early modern Humanities--Digital libraries Editing Siefring, J Huber, E Blaney, J Charles, S Willcox, P EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title | EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title_full | EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title_fullStr | EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title_full_unstemmed | EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title_short | EEBO-TCP Phase 1 Texts (XML Files TEI P3) |
title_sort | eebo tcp phase 1 texts xml files tei p3 |
topic | English literature--Early modern Humanities--Digital libraries Editing |
work_keys_str_mv | AT siefringj eebotcpphase1textsxmlfilesteip3 AT hubere eebotcpphase1textsxmlfilesteip3 AT blaneyj eebotcpphase1textsxmlfilesteip3 AT charless eebotcpphase1textsxmlfilesteip3 AT willcoxp eebotcpphase1textsxmlfilesteip3 |