Using Language Models to Understand Molecular Structures

In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve...

Full description

Bibliographic Details
Main Author: Fan, Vincent K.
Other Authors: Barzilay, Regina
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156795
_version_ 1811072601191088128
author Fan, Vincent K.
author2 Barzilay, Regina
author_facet Barzilay, Regina
Fan, Vincent K.
author_sort Fan, Vincent K.
collection MIT
description In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach.
first_indexed 2024-09-23T09:08:39Z
format Thesis
id mit-1721.1/156795
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T09:08:39Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1567952024-09-17T03:58:13Z Using Language Models to Understand Molecular Structures Fan, Vincent K. Barzilay, Regina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach. M.Eng. 2024-09-16T13:49:44Z 2024-09-16T13:49:44Z 2024-05 2024-07-11T14:36:51.442Z Thesis https://hdl.handle.net/1721.1/156795 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Fan, Vincent K.
Using Language Models to Understand Molecular Structures
title Using Language Models to Understand Molecular Structures
title_full Using Language Models to Understand Molecular Structures
title_fullStr Using Language Models to Understand Molecular Structures
title_full_unstemmed Using Language Models to Understand Molecular Structures
title_short Using Language Models to Understand Molecular Structures
title_sort using language models to understand molecular structures
url https://hdl.handle.net/1721.1/156795
work_keys_str_mv AT fanvincentk usinglanguagemodelstounderstandmolecularstructures