Dong-Yu She, Kun Xu. Contrastive Self-supervised Representation Learning Using Synthetic Data. International Journal of Automation and Computing, vol. 18, no. 4, pp.556-567, 2021. https://doi.org/10.1007/s11633-021-1297-9
Citation: Dong-Yu She, Kun Xu. Contrastive Self-supervised Representation Learning Using Synthetic Data. International Journal of Automation and Computing, vol. 18, no. 4, pp.556-567, 2021. https://doi.org/10.1007/s11633-021-1297-9

Contrastive Self-supervised Representation Learning Using Synthetic Data

doi: 10.1007/s11633-021-1297-9
More Information
  • Author Bio:

    Dong-Yu She received the B. Eng. and the M. Eng. degrees in computer science and technology from Nankai University, China in 2019 and 2016, respectively. She is a Ph. D. degree candidate in Department of Computer Science and Technology, Tsinghua University, China. Her research interests include deep learning and computer vision. E-mail: shedy19@mails.tsinghua.edu.cn ORCID iD: 0000-0002-1434-562X

    Kun Xu received B. Eng. and Ph.D. degrees in computer science and technology from Tsinghua University, China in 2005 and 2009, respectively. He is an associate professor in Department of Computer Science and Technology, Tsinghua University, China. His research interests include realistic rendering and image/video editing. E-mail: xukun@tsinghua.edu.cn (Corresponding author) ORCID iD: 0000-0002-2671-4170

  • Received Date: 2020-10-15
  • Accepted Date: 2021-03-29
  • Publish Online: 2021-05-11
  • Publish Date: 2021-06-30
  • Learning discriminative representations with deep neural networks often relies on massive labeled data, which is expensive and difficult to obtain in many real scenarios. As an alternative, self-supervised learning that leverages input itself as supervision is strongly preferred for its soaring performance on visual representation learning. This paper introduces a contrastive self-supervised framework for learning generalizable representations on the synthetic data that can be obtained easily with complete controllability. Specifically, we propose to optimize a contrastive learning task and a physical property prediction task simultaneously. Given the synthetic scene, the first task aims to maximize agreement between a pair of synthetic images generated by our proposed view sampling module, while the second task aims to predict three physical property maps, i.e., depth, instance contour maps, and surface normal maps. In addition, a feature-level domain adaptation technique with adversarial training is applied to reduce the domain difference between the realistic and the synthetic data. Experiments demonstrate that our proposed method achieves state-of-the-art performance on several visual recognition datasets.

     

  • 1https://projectchrono.org/
  • loading
  • [1]
    B. Zhao, J. S. Feng, X. Wu, S. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, vol. 14, no. 2, pp. 119–135, 2017. DOI: 10.1007/s11633-017-1053-3.
    [2]
    V. K. Ha, J. C. Ren, X. Y. Xu, S. Zhao, G. Xie, V. Masero, A. Hussain. Deep learning based single image super-resolution: A survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413–426, 2019. DOI: 10.1007/s11633-019-1183-x.
    [3]
    K. Aukkapinyo, S. Sawangwong, P. Pooyoi, W. Kusakunniran. Localization and classification of rice-grain images using region proposals-based convolutional neural network. International Journal of Automation and Computing, vol. 17, no. 2, pp. 233–246, 2020. DOI: 10.1007/s11633-019-1207-6.
    [4]
    X. L. Wang, A. Gupta. Unsupervised learning of visual representations using videos. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 2794−2802, 2015.
    [5]
    C. Doersch, A. Gupta, A. A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1422−1430, 2015.
    [6]
    C. Doersch, A. Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2070−2079, 2017.
    [7]
    S. Gidaris, P. Singh, N. Komodakis. Unsupervised representation learning by predicting image rotations. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
    [8]
    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros. Context encoders: Feature learning by inpainting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2536−2544, 2016.
    [9]
    G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504–507, 2006. DOI: 10.1126/science.1127647.
    [10]
    P. Vincent, H. Larochelle, Y. Bengio, P. A. Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine learning, ACM, Helsinki, Finland, pp. 1096−1103, 2008.
    [11]
    R. Lopez, J. Regier, M. I. Jordan, N. Yosef. Information constraints on auto-encoding variational bayes. In Advances in Neural Information Processing, Montreal, Canada, pp. 6117−6128, 2018.
    [12]
    X. Liu, F. J. Zhang, Z. Y. Hou, Z. Y. Wang, L. Mian, J. Zhang, J. Tang. Self-supervised learning: Generative or contrastive. [Online], Available: https://arxiv.org/abs/2006.08218, 2020.
    [13]
    Z. Z. Ren, Y. Jae Lee. Cross-domain self-supervised multi-task feature learning using synthetic imagery. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, UT, USA, pp. 762−771, 2018.
    [14]
    R. Zhang, P. Isola, A. A. Efros. Colorful image colorization. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 649−666, 2016.
    [15]
    R. Hadsell, S. Chopra, Y. LeCun. Dimensionality reduction by learning an invariant mapping. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern, IEEE, New York, USA, pp. 1735−1742, 2006.
    [16]
    A. van den Oord, Y. Z. Li, O. Vinyals. Representation learning with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1807.03748, 2018.
    [17]
    R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y. Bengio. Learning deep representations by mutual information estimation and maximization. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [18]
    N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, H. Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, California, USA, pp. 5628−5637, 2019.
    [19]
    T. Nathan Mundhenk, D. Ho, B. Y. Chen. Improvements to context based self-supervised learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9339−9348, 2018.
    [20]
    M. Noroozi, P. Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 69−84, 2016.
    [21]
    H. Y. Lee, J. B. Huang, M. Singh, M. H. Yang. Unsupervised representation learning by sorting sequences. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 667−676, 2017.
    [22]
    D. Kim, D. Cho, D. Yoo, I. S. Kweon. Learning image representations by completing damaged jigsaw puzzles. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Lake Tahoe, USA, pp. 793−802, 2018.
    [23]
    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 3111−3119, 2013.
    [24]
    X. H. Zhan, X. G. Pan, Z. W. Liu, D. H. Lin, C. C. Loy. Self-supervised learning via conditional motion propagation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 1881−1889, 2019.
    [25]
    Z. Y. Feng, C. Xu, D. C. Tao. Self-supervised representation learning by rotation feature decoupling. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 10364−10374, 2019.
    [26]
    X. L. Wang, K. M. He, A. Gupta. Transitive invariance for self-supervised visual representation learning. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 1338−1347, 2017.
    [27]
    L. H. Zhang, G. J. Qi, L. Q. Wang, J. B. Luo. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Long Beach, USA, pp. 2542−2550, 2019.
    [28]
    J. Donahue, K. Simonyan. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 10541−10551, 2019.
    [29]
    R. Zhang, P. Isola, A. A. Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 645−654, 2017.
    [30]
    X. C. Peng, B. C. Sun, K. Ali, K. Saenko. Learning deep object detectors from 3D models. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1278−1286, 2015.
    [31]
    O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, A. van den Oord. Data-efficient image recognition with contrastive predictive coding. [Online], Available: https://arxiv.org/abs/1905.09272, 2019.
    [32]
    P. Bachman, R. D. Hjelm, W. Buchwalter. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, Vancouver, Canada, pp. 15509−15519, 2019.
    [33]
    M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, M. Lucic. On mutual information maximization for representation learning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [34]
    K. M. He, H. Q. Fan, Y. X. Wu, S. N. Xie, R. Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9726−9735, 2020.
    [35]
    T. Chen, S. Kornblith, M. Norouzi, G. Hinton. A simple framework for contrastive learning of visual representations. [Online], Available: https://arxiv.org/abs/2002.05709, 2020.
    [36]
    Y. L. Tian, D. Krishnan, P. Isola. Contrastive Multiview coding. In Proceedings of the 16th European Conference on Computer Vision, Springer, Glasgow, UK, pp. 776−794, 2020.
    [37]
    T. Chen, Y. Z. Sun, Y. Shi, L. J. Hong. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax, Canada, pp. 767−776, 2017.
    [38]
    J. McCormac, A. Handa, S. Leutenegger, A. J. Davison. SceneNet RGB-D: Can 5M synthetic images beat generic imagenet pre-training on indoor segmentation? In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 2697−2706, 2017.
    [39]
    T. Hachisuka, H. W. Jensen. Parallel progressive photon mapping on GPUS. In Proceedings of ACM SIGGRAPH ASIA, Seoul, Korea, pp. 54:1, 2010.
    [40]
    S. N. Xie, Z. W. Tu. Holistically-nested edge detection. International Journal of Computer Vision, vol. 125, no. 1−3, pp. 3–18, 2017. DOI: 10.1007/s11263-017-1004-z.
    [41]
    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 2672−2680, 2014.
    [42]
    Y. Ganin, V. S. Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1180−1189, 2015.
    [43]
    K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 3722−3731, 2017.
    [44]
    E. Tzeng, J. Hoffman, K. Saenko, T. Darrell. Adversarial discriminative domain adaptation. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 7167−7176, 2017.
    [45]
    K. Sohn, W. L. Shang, X. Yu, M. Chandraker. Unsupervised domain adaptation for distance metric learning. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, USA, 2019.
    [46]
    A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097−1105, 2012.
    [47]
    B. L. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018. DOI: 10.1109/TPAMI.2017.2723009.
    [48]
    M. Noroozi, A. Vinjimoor, P. Favaro, H. Pirsiavash. Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 9359−9367, 2018.
    [49]
    P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell. Data-dependent initializations of convolutional neural networks. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2016.
    [50]
    M. Noroozi, H. Pirsiavash, P. Favaro. Representation learning by learning to count. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5899−5907, 2017.
    [51]
    B. Zhou, À. Lapedriza, J. X. Xiao, A. Torralba, A. Oliva. Learning deep features for scene recognition using places database. In Proceedings of Conference in Neural Information Processing Systems, Montreal, Canada, pp. 487−495, 2014.
    [52]
    M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, 2015. DOI: 10.1007/s11263-014-0733-5.
    [53]
    R. Girshick. Fast R-CNN. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1440−1448, 2015.
    [54]
    J. Long, E. Shelhamer, T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3431−3440, 2015.
    [55]
    N. Silberman, D. Hoiem, P. Kohli, R. Fergus. Indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision, Springer, Florence, Italy, pp. 746-760, 2012.
    [56]
    L. Ladicky, B. Zeisl, M. Pollefeys. Discriminatively trained dense surface normal estimation. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 468−484, 2014.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Figures(3)  / Tables(6)

    Article Metrics

    Article views (329) PDF downloads(63) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return