Compositional prompting video-language models to understand procedure in instructional videos

Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers...

Full description

Bibliographic Details
Main Authors: Hu, Guyue, He, Bin, Zhang, Hanwang
Other Authors: School of Computer Science and Engineering
Format: Journal Article
Language:English
Published: 2023
Subjects:
Online Access:https://hdl.handle.net/10356/168985
_version_ 1811690617772703744
author Hu, Guyue
He, Bin
Zhang, Hanwang
author2 School of Computer Science and Engineering
author_facet School of Computer Science and Engineering
Hu, Guyue
He, Bin
Zhang, Hanwang
author_sort Hu, Guyue
collection NTU
description Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
first_indexed 2024-10-01T06:06:51Z
format Journal Article
id ntu-10356/168985
institution Nanyang Technological University
language English
last_indexed 2024-10-01T06:06:51Z
publishDate 2023
record_format dspace
spelling ntu-10356/1689852023-06-26T04:45:12Z Compositional prompting video-language models to understand procedure in instructional videos Hu, Guyue He, Bin Zhang, Hanwang School of Computer Science and Engineering Engineering::Computer science and engineering Prompt Learning Instructional Videos Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. 2023-06-26T04:45:12Z 2023-06-26T04:45:12Z 2023 Journal Article Hu, G., He, B. & Zhang, H. (2023). Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2), 249-262. https://dx.doi.org/10.1007/s11633-022-1409-1 2731-538X https://hdl.handle.net/10356/168985 10.1007/s11633-022-1409-1 2-s2.0-85149147475 2 20 249 262 en Machine Intelligence Research © Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2023.
spellingShingle Engineering::Computer science and engineering
Prompt Learning
Instructional Videos
Hu, Guyue
He, Bin
Zhang, Hanwang
Compositional prompting video-language models to understand procedure in instructional videos
title Compositional prompting video-language models to understand procedure in instructional videos
title_full Compositional prompting video-language models to understand procedure in instructional videos
title_fullStr Compositional prompting video-language models to understand procedure in instructional videos
title_full_unstemmed Compositional prompting video-language models to understand procedure in instructional videos
title_short Compositional prompting video-language models to understand procedure in instructional videos
title_sort compositional prompting video language models to understand procedure in instructional videos
topic Engineering::Computer science and engineering
Prompt Learning
Instructional Videos
url https://hdl.handle.net/10356/168985
work_keys_str_mv AT huguyue compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos
AT hebin compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos
AT zhanghanwang compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos