Verbs in action: improving verb understanding in video-language models

Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting thei...

Full description

Bibliographic Details
Main Authors:	Momeni, L, Caron, M, Nagrani, A, Zisserman, A, Schmid, C
Format:	Conference item
Language:	English
Published:	IEEE 2024

_version_	1826313178705821696
author	Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C
author_facet	Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C
author_sort	Momeni, L
collection	OXFORD
description	Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action.
first_indexed	2024-09-25T04:09:05Z
format	Conference item
id	oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:09:05Z
publishDate	2024
publisher	IEEE
record_format	dspace
spelling	oxford-uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9e2024-06-13T15:44:55ZVerbs in action: improving verb understanding in video-language modelsConference itemhttp://purl.org/coar/resource_type/c_5794uuid:861ffd96-ffe8-4c54-a3d5-4fd90629cc9eEnglishSymplectic ElementsIEEE2024Momeni, LCaron, MNagrani, AZisserman, ASchmid, CUnderstanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available at [16] : scenic/projects/verbs_in_action.
spellingShingle	Momeni, L Caron, M Nagrani, A Zisserman, A Schmid, C Verbs in action: improving verb understanding in video-language models
title	Verbs in action: improving verb understanding in video-language models
title_full	Verbs in action: improving verb understanding in video-language models
title_fullStr	Verbs in action: improving verb understanding in video-language models
title_full_unstemmed	Verbs in action: improving verb understanding in video-language models
title_short	Verbs in action: improving verb understanding in video-language models
title_sort	verbs in action improving verb understanding in video language models
work_keys_str_mv	AT momenil verbsinactionimprovingverbunderstandinginvideolanguagemodels AT caronm verbsinactionimprovingverbunderstandinginvideolanguagemodels AT nagrania verbsinactionimprovingverbunderstandinginvideolanguagemodels AT zissermana verbsinactionimprovingverbunderstandinginvideolanguagemodels AT schmidc verbsinactionimprovingverbunderstandinginvideolanguagemodels

Verbs in action: improving verb understanding in video-language models

Similar Items