Generalization capacity of natural language video localization (NLVL) models

Generalization is a critical feature of any machine learning model. Natural Language Video Localization (NLVL) tasks involve processing diverse video content, text queries, and timestamp distributions, making generalization a crucial aspect of model performance. Many NLVL datasets, such as Charades-...

Full description

Bibliographic Details
Main Author: Dhanyamraju, Harsh Rao
Other Authors: Sun Aixin
Format: Final Year Project (FYP)
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/175072
Description
Summary:Generalization is a critical feature of any machine learning model. Natural Language Video Localization (NLVL) tasks involve processing diverse video content, text queries, and timestamp distributions, making generalization a crucial aspect of model performance. Many NLVL datasets, such as Charades-STA, exhibit distributional biases in both the timestamps associated with actions in videos and the corresponding textual queries. This bias poses a significant obstacle to building robust models with strong generalization capabilities. In this study, we conducted a comprehensive evaluation of NLVL models across various perturbation scenarios to assess its robustness and sensitivities. Leveraging synthetic perturbation sets, including textual, positional, and stylistic alterations, we examined a model’s performance and elucidated its strengths, weaknesses, and underlying mechanisms. Our findings revealed nuanced patterns, highlighting the model's resilience to certain perturbations, such as character swaps, while showcasing heightened sensitivity to others, such as text style variations. Additionally, we explored the implications of dataset curation on model performance, demonstrating the effectiveness of bias mitigation techniques in reducing distributional bias within datasets. Furthermore, we introduced two new datasets, Charades-STAMerged and Charades-Ego STA, aimed at mitigating distributional bias and evaluating NLVL models' generalization on first-person video data. Through these efforts, we offer valuable insights into the performance and interpretability of NLVL models, contributing to the enhancement of model robustness, fairness, and applicability in real world scenarios.