A machine reading system for assembling synthetic paleontological databases.

Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types....

Full description

Bibliographic Details
Main Authors: Shanan E Peters, Ce Zhang, Miron Livny, Christopher Ré
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2014-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4250071?pdf=render
_version_ 1818416055708549120
author Shanan E Peters
Ce Zhang
Miron Livny
Christopher Ré
author_facet Shanan E Peters
Ce Zhang
Miron Livny
Christopher Ré
author_sort Shanan E Peters
collection DOAJ
description Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry.
first_indexed 2024-12-14T11:44:48Z
format Article
id doaj.art-c8c5a154049646a8b8a734cac747c7f5
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-14T11:44:48Z
publishDate 2014-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-c8c5a154049646a8b8a734cac747c7f52022-12-21T23:02:39ZengPublic Library of Science (PLoS)PLoS ONE1932-62032014-01-01912e11352310.1371/journal.pone.0113523A machine reading system for assembling synthetic paleontological databases.Shanan E PetersCe ZhangMiron LivnyChristopher RéMany aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry.http://europepmc.org/articles/PMC4250071?pdf=render
spellingShingle Shanan E Peters
Ce Zhang
Miron Livny
Christopher Ré
A machine reading system for assembling synthetic paleontological databases.
PLoS ONE
title A machine reading system for assembling synthetic paleontological databases.
title_full A machine reading system for assembling synthetic paleontological databases.
title_fullStr A machine reading system for assembling synthetic paleontological databases.
title_full_unstemmed A machine reading system for assembling synthetic paleontological databases.
title_short A machine reading system for assembling synthetic paleontological databases.
title_sort machine reading system for assembling synthetic paleontological databases
url http://europepmc.org/articles/PMC4250071?pdf=render
work_keys_str_mv AT shananepeters amachinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT cezhang amachinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT mironlivny amachinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT christopherre amachinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT shananepeters machinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT cezhang machinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT mironlivny machinereadingsystemforassemblingsyntheticpaleontologicaldatabases
AT christopherre machinereadingsystemforassemblingsyntheticpaleontologicaldatabases