Compositional prompting video-language models to understand procedure in instructional videos
Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers...
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Journal Article |
Language: | English |
Published: |
2023
|
Subjects: | |
Online Access: | https://hdl.handle.net/10356/168985 |
_version_ | 1811690617772703744 |
---|---|
author | Hu, Guyue He, Bin Zhang, Hanwang |
author2 | School of Computer Science and Engineering |
author_facet | School of Computer Science and Engineering Hu, Guyue He, Bin Zhang, Hanwang |
author_sort | Hu, Guyue |
collection | NTU |
description | Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. |
first_indexed | 2024-10-01T06:06:51Z |
format | Journal Article |
id | ntu-10356/168985 |
institution | Nanyang Technological University |
language | English |
last_indexed | 2024-10-01T06:06:51Z |
publishDate | 2023 |
record_format | dspace |
spelling | ntu-10356/1689852023-06-26T04:45:12Z Compositional prompting video-language models to understand procedure in instructional videos Hu, Guyue He, Bin Zhang, Hanwang School of Computer Science and Engineering Engineering::Computer science and engineering Prompt Learning Instructional Videos Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine-tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few-shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. 2023-06-26T04:45:12Z 2023-06-26T04:45:12Z 2023 Journal Article Hu, G., He, B. & Zhang, H. (2023). Compositional prompting video-language models to understand procedure in instructional videos. Machine Intelligence Research, 20(2), 249-262. https://dx.doi.org/10.1007/s11633-022-1409-1 2731-538X https://hdl.handle.net/10356/168985 10.1007/s11633-022-1409-1 2-s2.0-85149147475 2 20 249 262 en Machine Intelligence Research © Institute of Automation, Chinese Academy of Sciences and Springer-Verlag GmbH Germany, part of Springer Nature 2023. |
spellingShingle | Engineering::Computer science and engineering Prompt Learning Instructional Videos Hu, Guyue He, Bin Zhang, Hanwang Compositional prompting video-language models to understand procedure in instructional videos |
title | Compositional prompting video-language models to understand procedure in instructional videos |
title_full | Compositional prompting video-language models to understand procedure in instructional videos |
title_fullStr | Compositional prompting video-language models to understand procedure in instructional videos |
title_full_unstemmed | Compositional prompting video-language models to understand procedure in instructional videos |
title_short | Compositional prompting video-language models to understand procedure in instructional videos |
title_sort | compositional prompting video language models to understand procedure in instructional videos |
topic | Engineering::Computer science and engineering Prompt Learning Instructional Videos |
url | https://hdl.handle.net/10356/168985 |
work_keys_str_mv | AT huguyue compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos AT hebin compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos AT zhanghanwang compositionalpromptingvideolanguagemodelstounderstandprocedureininstructionalvideos |