Shrnutí: | <p>The aspiration of computer vision is to enable computers to make sense of a world, one that exists in a vast sea of colorful pixels reflecting what humans see every day. Central to this goal perhaps, is the discovery of object representations from these pixels: clusters of pixels representing the things from big to small that constitute our physical world and the interactions with which form our daily life. Visual object segmentation is a subject that deals with this task. It concerns the prediction of pixel-wise masks that delineate objects of interest from visual data. Different types of evidence have been leveraged for segmenting out objects (according to the application), e.g., contours, saliency, motion, extreme points, and pre-defined semantic categories, among others. In this thesis, we investigate the exploitation of temporal correspondences and natural language descriptions for visual object segmentation, motivated by their omnipresence and efficacy, and contribute new methods that improve the state of the art for the relevant applications.</p>
<p>In the first part of the thesis, we consider the modeling of temporal dependencies for object segmentation in videos, where the continuous movement of an object offers natural correspondences in time. Conventional methods based on sequential modeling are ineffective at exploiting such cues, and they still perform on a par with static segmentation models. In this thesis, we address this issue by introducing a technique that learns dense correspondences between pixels of frames that can be arbitrarily far apart in time. This enables the learning of long-term dependencies without conditioning on intermediate frames, which significantly improves the (deep) representation consistency of the object over time and leads to accurate object segmentation in videos.</p>
<p>In the second part, we explore the use of natural language for object segmentation in images, where an object that corresponds to a natural language description is segmented out. With the aim of aligning linguistic meanings and visual clues in a common feature space, previous methods adopt the paradigm of fusing visual and linguistic features after they have been independently extracted from a vision encoder and a language encoder, respectively. In this thesis, we introduce a new paradigm that relocates cross-modal feature alignment to an earlier phase, i.e., to be conducted jointly with image encoding. The main idea is to supplement linguistic features to multiple intermediate layers of a vision Transformer network, to have linguistic information embedded jointly with visual information by the Transformer layers during the forward pass. With this approach, we tap into the correlation modeling power of the Transformer model at an early stage, and can harvest segmentation with a lightweight mask predictor.</p>
<p>In the last part, we investigate the joint utilization of temporal and linguistic cues for visual object segmentation in the task of video object segmentation from referring expressions. We show how appearance, language, and motion features can be simultaneously aligned in a single network that is end-to-end trainable by exploiting the hierarchical structure of convolutional neural networks. Compared with previous methods, the proposed method has a much-enhanced capability in representing multi-modal inputs at different semantic and spatial granularity levels while enjoying a much simpler design.</p>
|