Verbs in action: improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting thei...

Full description

Bibliographic Details
Main Authors: Momeni, L, Caron, M, Nagrani, A, Zisserman, A, Schmid, C
Format: Conference item
Language:English
Published: IEEE 2024
_version_ 1826313178705821696
author Momeni, L
Caron, M
Nagrani, A
Zisserman, A
Schmid, C
author_facet Momeni, L
Caron, M
Nagrani, A
Zisserman, A
Schmid, C
author_sort Momeni, L
collection OXFORD
description Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action.
first_indexed 2024-09-25T04:09:05Z
format Conference item
id oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e
institution University of Oxford
language English
last_indexed 2024-09-25T04:09:05Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e2024-06-13T15:44:55ZVerbs in action: improving verb understanding in video-language modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9eEnglishSymplectic ElementsIEEE2024Momeni, LCaron, MNagrani, AZisserman, ASchmid, CUnderstanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action.
spellingShingle Momeni, L
Caron, M
Nagrani, A
Zisserman, A
Schmid, C
Verbs in action: improving verb understanding in video-language models
title Verbs in action: improving verb understanding in video-language models
title_full Verbs in action: improving verb understanding in video-language models
title_fullStr Verbs in action: improving verb understanding in video-language models
title_full_unstemmed Verbs in action: improving verb understanding in video-language models
title_short Verbs in action: improving verb understanding in video-language models
title_sort verbs in action improving verb understanding in video language models
work_keys_str_mv AT momenil verbsinactionimprovingverbunderstandinginvideolanguagemodels
AT caronm verbsinactionimprovingverbunderstandinginvideolanguagemodels
AT nagrania verbsinactionimprovingverbunderstandinginvideolanguagemodels
AT zissermana verbsinactionimprovingverbunderstandinginvideolanguagemodels
AT schmidc verbsinactionimprovingverbunderstandinginvideolanguagemodels