Summary: | <p>Videos are an appealing source of data to train computer vision models. There
exist almost infinite supplies of videos online, but exhaustive manual annotation is
infeasible. The goal of this thesis is to learn strong video representations efficiently
via self-supervised learning: a method that learns from the data rather than human
annotations.</p>
<p>The thesis is structured around three themes: (1) self-supervised learning
for short-term videos, (2) efficient video representation learning, and (3) self-
supervised learning for long-term videos.</p>
<p>For short-term videos lasting only a few seconds, we show that predicting the
video in the future is a strong learning signal at a large scale. We further show
that strong video representations can be learned by taking two complementary
modalities, namely RGB and optical flow, and using them to teach each other.</p>
<p>For efficient video representation learning, we show that large-scale pre-trained
vision-language models can be effectively adapted via a prompt tuning technique.
We also show that dropping image patches can accelerate the finetuning of
classification tasks and pre-training of video-language models.</p>
<p>For long-term videos that last longer than a few minutes, we show that temporal
alignment networks can be trained from the weak visual-textual correspondence
within instructional videos. The resulting networks can automatically clean up
the natural videos for effective vision-language training. In addition, we show
that movie description models can be trained by leveraging the pre-trained vision-
language models.</p>
|