From Structured Document To Structured Knowledge

Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essentia...

Full description

Bibliographic Details
Main Author: Qian, Yujie
Other Authors: Barzilay, Regina
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151225
_version_ 1826215346259886080
author Qian, Yujie
author2 Barzilay, Regina
author_facet Barzilay, Regina
Qian, Yujie
author_sort Qian, Yujie
collection MIT
description Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams. Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs.
first_indexed 2024-09-23T16:24:48Z
format Thesis
id mit-1721.1/151225
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T16:24:48Z
publishDate 2023
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1512252023-08-01T03:04:33Z From Structured Document To Structured Knowledge Qian, Yujie Barzilay, Regina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams. Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs. Ph.D. 2023-07-31T19:24:10Z 2023-07-31T19:24:10Z 2023-06 2023-07-13T14:26:36.062Z Thesis https://hdl.handle.net/1721.1/151225 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Qian, Yujie
From Structured Document To Structured Knowledge
title From Structured Document To Structured Knowledge
title_full From Structured Document To Structured Knowledge
title_fullStr From Structured Document To Structured Knowledge
title_full_unstemmed From Structured Document To Structured Knowledge
title_short From Structured Document To Structured Knowledge
title_sort from structured document to structured knowledge
url https://hdl.handle.net/1721.1/151225
work_keys_str_mv AT qianyujie fromstructureddocumenttostructuredknowledge