Semantic-aware auto-encoders for self-supervised representation learning
The resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\mathcal{G})$ and discriminative $(\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\mathcal{D}$ models. Howe...
Main Authors: | , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
IEEE
2022
|
_version_ | 1826308885467627520 |
---|---|
author | Wang, G Tang, Y Lin, L Torr, PHS |
author_facet | Wang, G Tang, Y Lin, L Torr, PHS |
author_sort | Wang, G |
collection | OXFORD |
description | The resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\mathcal{G})$ and discriminative $(\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\mathcal{D}$ models. However, designing a $\mathcal{D}$ model could be over-complicated; also, some studies hinted that a $\mathcal{D}$ model might not be as general and interpretable as a $\mathcal{G}$ model. In this paper, we switch from $\mathcal{D}$ models to $\mathcal{G}$ models using the classical auto-encoder $(AE)$ . Note that a vanilla $\mathcal{G}$ model was far less efficient than a $\mathcal{D}$ model in self-supervised computer vision tasks, as it wastes model capability on overfitting semantic-agnostic high-frequency details. Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics 1 1 Following [26], we refer to semantics as visual concepts, e.g., a semantic-ware model indicates the model can perceive visual concepts, and the learned features are efficient in object recognition, detection, etc., we propose a novel $AE$ that could learn semantic-aware representation via cross-view image reconstruction. We use one view of an image as the input and another view of the same image as the reconstruction target. This kind of $AE$ has rarely been studied before, and the optimization is very difficult. To enhance learning ability and find a feasible solution, we propose a semantic aligner that uses geometric transformation knowledge to align the hidden code of $AE$ to help optimization. These techniques significantly improve the representation learning ability of $AE$ and make selfsupervised learning with $\mathcal{G}$ models possible. Extensive experiments on many large-scale benchmarks (e.g., ImageNet, COCO 2017, and SYSU-30k) demonstrate the effectiveness of our methods. Code is available at https://github.com/wanggrun/Semantic-Aware-AE. |
first_indexed | 2024-03-07T07:25:55Z |
format | Conference item |
id | oxford-uuid:91b5368a-5150-4801-8875-8b6b0111e3c8 |
institution | University of Oxford |
language | English |
last_indexed | 2024-03-07T07:25:55Z |
publishDate | 2022 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:91b5368a-5150-4801-8875-8b6b0111e3c82022-11-14T12:31:39ZSemantic-aware auto-encoders for self-supervised representation learningConference itemhttp://purl.org/coar/resource_type/c_5794uuid:91b5368a-5150-4801-8875-8b6b0111e3c8EnglishSymplectic ElementsIEEE2022Wang, GTang, YLin, LTorr, PHSThe resurgence of unsupervised learning can be attributed to the remarkable progress of self-supervised learning, which includes generative $(\mathcal{G})$ and discriminative $(\mathcal{D})$ models. In computer vision, the mainstream self-supervised learning algorithms are $\mathcal{D}$ models. However, designing a $\mathcal{D}$ model could be over-complicated; also, some studies hinted that a $\mathcal{D}$ model might not be as general and interpretable as a $\mathcal{G}$ model. In this paper, we switch from $\mathcal{D}$ models to $\mathcal{G}$ models using the classical auto-encoder $(AE)$ . Note that a vanilla $\mathcal{G}$ model was far less efficient than a $\mathcal{D}$ model in self-supervised computer vision tasks, as it wastes model capability on overfitting semantic-agnostic high-frequency details. Inspired by perceptual learning that could use cross-view learning to perceive concepts and semantics 1 1 Following [26], we refer to semantics as visual concepts, e.g., a semantic-ware model indicates the model can perceive visual concepts, and the learned features are efficient in object recognition, detection, etc., we propose a novel $AE$ that could learn semantic-aware representation via cross-view image reconstruction. We use one view of an image as the input and another view of the same image as the reconstruction target. This kind of $AE$ has rarely been studied before, and the optimization is very difficult. To enhance learning ability and find a feasible solution, we propose a semantic aligner that uses geometric transformation knowledge to align the hidden code of $AE$ to help optimization. These techniques significantly improve the representation learning ability of $AE$ and make selfsupervised learning with $\mathcal{G}$ models possible. Extensive experiments on many large-scale benchmarks (e.g., ImageNet, COCO 2017, and SYSU-30k) demonstrate the effectiveness of our methods. Code is available at https://github.com/wanggrun/Semantic-Aware-AE. |
spellingShingle | Wang, G Tang, Y Lin, L Torr, PHS Semantic-aware auto-encoders for self-supervised representation learning |
title | Semantic-aware auto-encoders for self-supervised representation learning |
title_full | Semantic-aware auto-encoders for self-supervised representation learning |
title_fullStr | Semantic-aware auto-encoders for self-supervised representation learning |
title_full_unstemmed | Semantic-aware auto-encoders for self-supervised representation learning |
title_short | Semantic-aware auto-encoders for self-supervised representation learning |
title_sort | semantic aware auto encoders for self supervised representation learning |
work_keys_str_mv | AT wangg semanticawareautoencodersforselfsupervisedrepresentationlearning AT tangy semanticawareautoencodersforselfsupervisedrepresentationlearning AT linl semanticawareautoencodersforselfsupervisedrepresentationlearning AT torrphs semanticawareautoencodersforselfsupervisedrepresentationlearning |