From Structured Document To Structured Knowledge

Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essentia...

Full description

Bibliographic Details
Main Author:	Qian, Yujie
Other Authors:	Barzilay, Regina
Format:	Thesis
Published:	Massachusetts Institute of Technology 2023
Online Access:	https://hdl.handle.net/1721.1/151225

_version_	1826215346259886080
author	Qian, Yujie
author2	Barzilay, Regina
author_facet	Barzilay, Regina Qian, Yujie
author_sort	Qian, Yujie
collection	MIT
description	Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams. Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs.
first_indexed	2024-09-23T16:24:48Z
format	Thesis
id	mit-1721.1/151225
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T16:24:48Z
publishDate	2023
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1512252023-08-01T03:04:33Z From Structured Document To Structured Knowledge Qian, Yujie Barzilay, Regina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams. Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs. Ph.D. 2023-07-31T19:24:10Z 2023-07-31T19:24:10Z 2023-06 2023-07-13T14:26:36.062Z Thesis https://hdl.handle.net/1721.1/151225 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Qian, Yujie From Structured Document To Structured Knowledge
title	From Structured Document To Structured Knowledge
title_full	From Structured Document To Structured Knowledge
title_fullStr	From Structured Document To Structured Knowledge
title_full_unstemmed	From Structured Document To Structured Knowledge
title_short	From Structured Document To Structured Knowledge
title_sort	from structured document to structured knowledge
url	https://hdl.handle.net/1721.1/151225
work_keys_str_mv	AT qianyujie fromstructureddocumenttostructuredknowledge

From Structured Document To Structured Knowledge

Similar Items