Verbs in action: improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting thei...
Main Authors: | , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2024
|
_version_ | 1826313178705821696 |
---|---|
author | Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C |
author_facet | Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C |
author_sort | Momeni, L |
collection | OXFORD |
description | Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action. |
first_indexed | 2024-09-25T04:09:05Z |
format | Conference item |
id | oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e |
institution | University of Oxford |
language | English |
last_indexed | 2024-09-25T04:09:05Z |
publishDate | 2024 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e2024-06-13T15:44:55ZVerbs in action: improving verb understanding in video-language modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9eEnglishSymplectic ElementsIEEE2024Momeni, LCaron, MNagrani, AZisserman, ASchmid, CUnderstanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action. |
spellingShingle | Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C Verbs in action: improving verb understanding in video-language models |
title | Verbs in action: improving verb understanding in video-language models |
title_full | Verbs in action: improving verb understanding in video-language models |
title_fullStr | Verbs in action: improving verb understanding in video-language models |
title_full_unstemmed | Verbs in action: improving verb understanding in video-language models |
title_short | Verbs in action: improving verb understanding in video-language models |
title_sort | verbs in action improving verb understanding in video language models |
work_keys_str_mv | AT momenil verbsinactionimprovingverbunderstandinginvideolanguagemodels AT caronm verbsinactionimprovingverbunderstandinginvideolanguagemodels AT nagrania verbsinactionimprovingverbunderstandinginvideolanguagemodels AT zissermana verbsinactionimprovingverbunderstandinginvideolanguagemodels AT schmidc verbsinactionimprovingverbunderstandinginvideolanguagemodels |