Using Language Models to Understand Molecular Structures

In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve...

Full description

Bibliographic Details
Main Author:	Fan, Vincent K.
Other Authors:	Barzilay, Regina
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156795

_version_	1811072601191088128
author	Fan, Vincent K.
author2	Barzilay, Regina
author_facet	Barzilay, Regina Fan, Vincent K.
author_sort	Fan, Vincent K.
collection	MIT
description	In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach.
first_indexed	2024-09-23T09:08:39Z
format	Thesis
id	mit-1721.1/156795
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T09:08:39Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1567952024-09-17T03:58:13Z Using Language Models to Understand Molecular Structures Fan, Vincent K. Barzilay, Regina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science In data rich modalities such as text and images, large foundation models have demonstrated remarkable capabilities. However, in life sciences, datasets of comparable scale are prohibitively costly to assemble, pointing towards the imperative need to leverage advances in language modelling to improve machine learning techniques for life sciences. This thesis details research in two such directions, information extraction and text retrieval. Information extraction from chemistry literature is vital for constructing up-to-date reaction databases. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this thesis, I present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities with specialized neural models and then integrating the results via chemistry-informed algorithms to obtain a final list of reactions. I meticulously annotated a challenging dataset of reaction schemes with R-groups to evaluate OpenChemIE, which achieves an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. Additionally, I detail preliminary research in developing a tool to retrieve full text documents that are relevant to specific protein sequences. I describe the dataset which is currently in construction, as well as experiments pointing at the promise of this approach. M.Eng. 2024-09-16T13:49:44Z 2024-09-16T13:49:44Z 2024-05 2024-07-11T14:36:51.442Z Thesis https://hdl.handle.net/1721.1/156795 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Fan, Vincent K. Using Language Models to Understand Molecular Structures
title	Using Language Models to Understand Molecular Structures
title_full	Using Language Models to Understand Molecular Structures
title_fullStr	Using Language Models to Understand Molecular Structures
title_full_unstemmed	Using Language Models to Understand Molecular Structures
title_short	Using Language Models to Understand Molecular Structures
title_sort	using language models to understand molecular structures
url	https://hdl.handle.net/1721.1/156795
work_keys_str_mv	AT fanvincentk usinglanguagemodelstounderstandmolecularstructures

Using Language Models to Understand Molecular Structures

Similar Items