Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries

Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing sc...

Full description

Bibliographic Details
Main Author:	Gupta, Sejal
Other Authors:	Cafarella, Michael
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/156824

_version_	1826198965655175168
author	Gupta, Sejal
author2	Cafarella, Michael
author_facet	Cafarella, Michael Gupta, Sejal
author_sort	Gupta, Sejal
collection	MIT
description	Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing scientific imagery such as plots and graphs in scientific literature. In this thesis, we discuss the development of a pioneering reconstruction pipeline to extract metadata, regenerate plot data, and filter out extraneous noise like legends from plot images. Ultimately, the collected information is presented to the VLM in structured, textual manner to assist in answering domain specific queries. The efficacy of this pipeline is evaluated using a novel dataset comprised of scientific plots extracted from battery domain literature, alongside the existing benchmark datasets including PlotQA and ChartQA. Results about the component accuracy, task accuracy, and question-answering with augmented inputs to a VLM show promise in the future capabilities of this work. By assisting VLMs with scientific imagery, we aim to not only enhance the capabilities of VLMs in specialized scientific areas but also to transform the performance of VLMs in domain specific areas as a whole. This thesis provides a detailed overview of the work, encompassing a literature review, methodology, results, and recommendations for future work.
first_indexed	2024-09-23T11:12:18Z
format	Thesis
id	mit-1721.1/156824
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T11:12:18Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1568242024-09-17T03:42:54Z Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries Gupta, Sejal Cafarella, Michael Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing scientific imagery such as plots and graphs in scientific literature. In this thesis, we discuss the development of a pioneering reconstruction pipeline to extract metadata, regenerate plot data, and filter out extraneous noise like legends from plot images. Ultimately, the collected information is presented to the VLM in structured, textual manner to assist in answering domain specific queries. The efficacy of this pipeline is evaluated using a novel dataset comprised of scientific plots extracted from battery domain literature, alongside the existing benchmark datasets including PlotQA and ChartQA. Results about the component accuracy, task accuracy, and question-answering with augmented inputs to a VLM show promise in the future capabilities of this work. By assisting VLMs with scientific imagery, we aim to not only enhance the capabilities of VLMs in specialized scientific areas but also to transform the performance of VLMs in domain specific areas as a whole. This thesis provides a detailed overview of the work, encompassing a literature review, methodology, results, and recommendations for future work. M.Eng. 2024-09-16T13:51:22Z 2024-09-16T13:51:22Z 2024-05 2024-07-11T14:37:05.851Z Thesis https://hdl.handle.net/1721.1/156824 Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Gupta, Sejal Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title_full	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title_fullStr	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title_full_unstemmed	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title_short	Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
title_sort	augmenting inputs using a novel figure to text pipeline to assist visual language models in answering scientific domain queries
url	https://hdl.handle.net/1721.1/156824
work_keys_str_mv	AT guptasejal augmentinginputsusinganovelfiguretotextpipelinetoassistvisuallanguagemodelsinansweringscientificdomainqueries

Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries

Similar Items