AimigoTutor - tutoring application using multi-modal capabilities

Video captioning has been an up-and-coming research topic. Thanks to the recent advances in the performance of deep neural networks, especially with transformers, video captioning is seeing a huge potential improvement in accuracy and versatility. Most state-of-the-art video captioning models employ...

Full description

Bibliographic Details
Main Author: Nguyen, Viet Hoang
Other Authors: Hanwang Zhang
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175732
Description
Summary:Video captioning has been an up-and-coming research topic. Thanks to the recent advances in the performance of deep neural networks, especially with transformers, video captioning is seeing a huge potential improvement in accuracy and versatility. Most state-of-the-art video captioning models employ a multi-modal approach, whereby both the visual information of the video frames and the audio information of the video are used to extract the semantic meaning of the video. This project will explore the capability of multi-modal video captioning in a much-needed context: building a video tutoring application for students, called AimigoTutor. This report will discuss the requirements, design, implementation and evaluation of the application.