From Structured Document To Structured Knowledge
Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essentia...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2023
|
Online Access: | https://hdl.handle.net/1721.1/151225 |
_version_ | 1826215346259886080 |
---|---|
author | Qian, Yujie |
author2 | Barzilay, Regina |
author_facet | Barzilay, Regina Qian, Yujie |
author_sort | Qian, Yujie |
collection | MIT |
description | Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents.
First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text.
Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams.
Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs. |
first_indexed | 2024-09-23T16:24:48Z |
format | Thesis |
id | mit-1721.1/151225 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T16:24:48Z |
publishDate | 2023 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1512252023-08-01T03:04:33Z From Structured Document To Structured Knowledge Qian, Yujie Barzilay, Regina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Structured documents, such as scientific literature and medical records, are rich resources of knowledge. However, most natural language processing techniques treat these documents as plain text, neglecting the significance of layout structure and visual signals. Modeling such structures is essential for a comprehensive understanding of these documents. This thesis presents novel algorithms for extracting structured knowledge from structured documents. First, we propose GraphIE, an information extraction framework designed to model the non-local and non-sequential dependencies in structured documents. GraphIE leverages structural information through graph neural networks to enhance word-level tagging predictions. In evaluations across three extraction tasks, GraphIE consistently outperforms a sequential model that operates solely on plain text. Next, we delve into information extraction in the chemistry domain. Scientific literature often depicts molecules and reactions in the form of infographics. To extract these molecules, we develop MolScribe, a tool that translates a molecular image into its graph structure. MolScribe integrates symbolic chemistry constraints within an image-to-graph generation model, demonstrating robust performance in handling diverse drawing styles and conventions. To extract reaction schemes, we propose Rxn- Scribe, which parses reaction diagrams through a sequence generation formulation. Despite being trained on a modest dataset, RxnScribe achieves strong performance across different types of diagrams. Finally, we introduce TextReact, a novel method that directly augments predictive chemistry with text retrieval, bypassing the intermediate information extraction step. Our experiments on reaction condition recommendation and retrosynthetic prediction demonstrate TextReact’s efficacy in retrieving relevant information from the literature and generalizing to new inputs. Ph.D. 2023-07-31T19:24:10Z 2023-07-31T19:24:10Z 2023-06 2023-07-13T14:26:36.062Z Thesis https://hdl.handle.net/1721.1/151225 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Qian, Yujie From Structured Document To Structured Knowledge |
title | From Structured Document To Structured Knowledge |
title_full | From Structured Document To Structured Knowledge |
title_fullStr | From Structured Document To Structured Knowledge |
title_full_unstemmed | From Structured Document To Structured Knowledge |
title_short | From Structured Document To Structured Knowledge |
title_sort | from structured document to structured knowledge |
url | https://hdl.handle.net/1721.1/151225 |
work_keys_str_mv | AT qianyujie fromstructureddocumenttostructuredknowledge |