This paper introduces a contrastive self-supervised framework for learning generalizable representations on the synthetic data that can be obtained easily with complete controllability. Specifically, it proposes to optimize a contrastive learning task and a physical property prediction task simultaneously. In addition, a feature-level domain adaptation technique with adversarial training is applied to reduce the domain difference between the realistic and the synthetic data. Experiments demonstrate that the proposed method achieves state-of-the-art performance on several visual recognition datasets.
Convolutional neural networks (ConvNets) have made tremendous progress in the computer vision field. However, such achievements are mainly backed up by supervised learning of networks on a massive collection of training data. More recently, various methods try to learn visual representations from large-scale unlabeled data without using any human annotation. A natural solution is self-supervised learning (SSL), which defines an annotation-free surrogate task and uses input itself as the supervision signal. The intuition is that solving tasks like inferring geometrical configuration and recovering missing parts of images can force the ConvNets to learn the semantic representations.
Unlike the existing self-supervised learning that learns representation from realistic data, this paper aims to learn general-purpose visual representations leveraging the synthetic data and their various ‘free’ annotations. Compared with collecting and annotating photos from the real-world, synthesized data can be easier and cheaper to obtain. For example, it is labor-consuming and impractical to take photos of some objects like birds, while it is feasible to generate a panoramic view of synthetic data. The attributes (e.g., lighting, physics, position) of the synthetic objects can be fully controlled and easily obtained, which can greatly enhance model robustness.
In this work, we present a multi-task self-supervised framework for learning general-purpose visual representations leveraging the semantic information from synthetic data. Specifically, given the synthetic scene, our proposed framework maximizes the agreement between different views of the same scene via a contrastive loss and predicts the free physical cues, including depth, instance contour maps, and surface normal, simultaneously. Besides, to tackle the domain difference between synthetic images and realistic images, we also employ a feature-level domain adaptation technique with adversarial training. Experiments demonstrate that our proposed method achieves state-of-the-art results in self-supervised learning, verifying the effectiveness of the proposed method.
The rest of this paper is organized as follows. Section 2 summarizes the related work on self-supervised learning methods. Section 3 introduces our proposed framework for representation learning on synthetic data. In Section 4, we present the experimental results on the popular benchmark datasets. Finally, Section 5 concludes this paper.
Download full text:
Contrastive Self-supervised Representation Learning Using Synthetic Data
Dong-Yu She, Kun Xu
For more up-to-date information:
1) WeChat: IJAC
3) Facebook:International Journal of Automation and Computing
4) Linkedin: Int.J. of Automation and Computing
5) Sina Weibo:IJAC-国际自动化与计算杂志