Visual dialog system

In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have prov...

Full description

Bibliographic Details
Main Author: Luong, Hien Nga
Other Authors: Hanwang Zhang
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175080
Description
Summary:In this era of Artificial Intelligence, Large Language Models (LLMs) have emerged as powerful tools, facilitating the revolution of natural language understanding and generation tasks across different domains. Many innovations such as OpenAI’s Generative Pretrained Transformer (GPT) series have proven their outstanding ability to comprehend and generate coherent text. The continuous evolution of Artificial Intelligence has led to advancement beyond linguistic abilities, enabling integration of multimodal functionalities. Multimodal Large Language Models (MLLMs) demonstrate a remarkable development by extending the capabilities of LLMs to visual and auditory information. With this modality integration, models are able to process and comprehend more diverse input channels, such as images, audios, or videos. Among the modalities, images stand out as the most utilised mean of communication, attracting a large volume of research and development to visual language models. With an incredible proficiency in visual comprehension and reasoning, MLLMs can serve as a significant aid in practical applications, including image captioning and visual question answering.  Understanding the need to make the powerful MLLMs accessible to general users, development of a user-friendly Visual Dialog System becomes pivotal. Serving as a bridge between users and MLLMs, this system can facilitate seamless multi-round conversation involving images and texts. Additionally, to give MLLM necessary contextual information to ensure smooth multi-round conversation, a proper instructional prompting scheme is important. This project aims to develop a visual dialog system employing MLLM with an appropriate prompting scheme and web User Interface (UI) that integrates textual and visual elements cohesively, allowing interactive conversations between state-of-the-art MLLMs and users. The initial step involves providing model with historical information to ensure a smooth multi-round conversation. Subsequently, a UI needs to be created with interactive components allowing users to input images, queries and receive responses from the MLLM. In this project, prompting combining the new question with the summarisation of two previous answers yields an impressive result in increasing users' satisfaction by nearly 50\% compared to no contextual prompting, highlighting its potential as a promising cost-efficient contextual provision at inference time for visual dialog system. The outcome of this project potentially lays a groundwork for further domain-specific applications, including education, content creation or virtual assistants, where visual dialogs play crucial role in helping humans utilise the powerfulness of AI.