Optimizing Continuous Prompts for Visual Relationship Detection by Affix-Tuning

Visual relationship detection is crucial for understanding visual scenes and is widely used in many areas, including visual navigation, visual question answering, and machine trouble detection. Traditional detection methods often fuse multiple region modules, which takes considerable time and resour...

Full description

Bibliographic Details
Main Authors: Shouguan Xiao, Weiping Fu
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9815128/
Description
Summary:Visual relationship detection is crucial for understanding visual scenes and is widely used in many areas, including visual navigation, visual question answering, and machine trouble detection. Traditional detection methods often fuse multiple region modules, which takes considerable time and resources to train every module with extensive samples. As every module is independent, the computation process has difficulty achieving unity and lacks a higher level of logical reasonability. In response to the above problems, we propose a novel method of affix-tuning transformers for visual relationship detection tasks, which keeps transformer model parameters frozen and optimizes a small continuous task-specific vector. It not only makes the model unified and reduces the training cost but also maintains the common-sense reasonability without multiscale training. In addition, we design a vision-and-language sentence expression prompt template and train a few transformer model parameters for downstream tasks. Our method, Prompt Template and Affix-Tuning Transformers (PTAT), is evaluated on visual relationship detection and Visual Genome datasets. Finally, the results of the proposed method are close to or even higher than those of the state-of-the-art methods on some evaluation metrics.
ISSN:2169-3536