Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries

Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing sc...

Full description

Bibliographic Details
Main Author: Gupta, Sejal
Other Authors: Cafarella, Michael
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156824