Augmenting Inputs using a Novel Figure-to-Text Pipeline to Assist Visual Language Models in Answering Scientific Domain Queries
Recent advancements in visual language models (VLMs) have transformed the way we interpret and interact with digital imagery, bridging the gap between visual and textual data. However, these models, like Bard, GPT4-v, and LLava, often struggle with specialized fields, particularly when processing sc...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156824 |