Volume 16 Number 4
August 2019
Article Contents
Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero and Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413-426, 2019. doi: 10.1007/s11633-019-1183-x
Cite as: Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero and Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413-426, 2019. doi: 10.1007/s11633-019-1183-x

Deep Learning Based Single Image Super-resolution: A Survey

Author Biography:
  • Viet Khanh Ha received the B. Eng. degrees in electrical and electronics from Le Quy Don University, Viet Nam in 2008, the M. Eng. degree in electrical and electronics from Wollongong University, Australia in 2012. He is currently a Ph. D. degree candidate at the University of Strathclyde, UK. His research interests include image super resolution using deep learning. E-mail: ha-viet-khanh@strath.ac.uk ORCID iD: 0000-0002-6965-4024

    Jin-Chang Ren received the B. Eng. degree in computer software in 1992, the M. Eng. degree in image processing in 1997, the Ph. D. degree in computer vision in 2000, all from the North-western Polytechnical University (NWPU), China. He was also awarded a Ph. D. in electronic imaging and media communication from Bradford University, UK in 2009. Currently, he is with Centre for Signal and Image Processing (CeSIP), University of Strathclyde, UK. He has published over 150 peer reviewed journals and conferences papers. He acts as an associate editor for two international journals including Multidimensional Systems and Signal Processing and International Journal of Pattern Recognition and Artificial Intelligence.His research interests focus mainly on visual computing and multi-media signal processing, especially on semantic content extraction for video analysis and understanding more recently hyperspectral imaging.E-mail: jinchang.ren@strath.ac.uk (Corresponding author)ORCID iD: 0000-0001-6116-3194

    Xin-Ying Xu received the B. Sc. and Ph. D. degrees from the Taiyuan University of Technology, China, in 2002 and 2009, respectively. He is currently a professor with the College of Information Engineering, Taiyuan University of Technology, China. He has published more than 30 academic papers. He is a member of the Chinese Computer Society, and has been a visiting scholar in Department of Computer Science, San Jose State University, USA.His research interests include computational intelligence, data mining, wireless networking, image processing, and fault diagnosis.E-mail: xuxinying@tyut.edu.cn

    Sophia Zhao received the B. Sc. degree in education from Henan University, China in 1999, and several qualifications from Shipley College, UK during 2003–2005. Currently, she is a research assistant with the Department of Electronic and Electrical Engineering, University of Strathclyde, UK. Her research interests include image/ signal analysis, machine learning and optimisation. E-mail: sophia.zhao@strath.ac.uk

    Gang Xie received the B. S. degree in control theory and the Ph. D. degree in circuits and systems from the Taiyuan University of Technology, China, in 1994 and 2006, respectively. He has been a professor and vice principle of Taiyuan University of Science and Technology, China. He has published over 80 research papers. His research interests include rough sets, intelligent computing, image processing, automation and big data analysis. E-mail: xiegang@tyut.edu.cn

    Valentin Masero received the B. Eng. degree in computer science and business administration from University of Extremadura (UEX), Spain, and another B. Eng. degree in computer engineering specialized in software development and artificial intelligence from University of Granada, Spain. He received the Ph. D. degree in computer engineering from UEX, Spain. Now he is an associate professor at UEX. His research interests include image processing, machine learning, artificial intelligence, computer graphics, computer programming, software development, computer applications in industrial engineering, computer applications in agricultural engineering and computer applications in healthcare. E-mail: vmasero@unex.es

    Amir Hussain received the B. Eng. and Ph. D. degrees from the University of Strathclyde in Glasgow, UK, in 1992 and 1997, respectively. Following postdoctoral and academic positions at the Universities of West of Scotland (1996–1998), Dundee (1998–2000) and Stirling (2000–2018), respectively, he joined Edinburgh Napier University (UK) in 2018, as founding director of the Cognitive Big Data and Cybersecurity (CogBiD) Research Laboratory, managing over 25 academic and research staffs. He has been appointed to invited visiting professorships at several Universities and Research and Innovation Centres, including at Anhui University (China) and Taibah Valley (Taibah University, Saudi Arabia). He has (co)authored three international patents, around 400 publications, including over a dozen books and 150 journal papers. He has led major multi-disciplinary research projects, funded by national and European research councils, local and international charities and industry, and supervised more than 35 Ph. D. students. He is founding Editor-in-Chief of (Springer Nature′s) Cognitive Computation journal and BMC Big Data Analytics journal. He has been appointed Associate Editor of several other world-leading journals including, IEEE Transactions on Neural Networks and Learning Systems, (Elsevier′s) Information Fusion journal, IEEE Transactions on Emerging Topics in Computational Intelligence, and IEEE Computational Intelligence Magazine. Amongst other distinguished roles, he is General Chair for IEEE WCCI 2020 (the world′s largest and top IEEE technical event in computational intelligence, comprising IJCNN, FUZZ-IEEE and IEEE CEC), Vice-Chair of Emergent Technologies Technical Committee of the IEEE Computational Intelligence Society, and chapter Chair of the IEEE UK & Ireland, Industry Applications Society Chapter. His research interests include developing cognitive data science and AI technologies, to engineer the smart and secure systems of tomorrow. E-mail: A.Hussain@napier.ac.uk

  • Received: 2019-01-10
  • Accepted: 2019-04-19
  • Single image super-resolution has attracted increasing attention and has a wide range of applications in satellite imaging, medical imaging, computer vision, security surveillance imaging, remote sensing, objection detection, and recognition. Recently, deep learning techniques have emerged and blossomed, producing " the state-of-the-art” in many domains. Due to their capability in feature extraction and mapping, it is very helpful to predict high-frequency details lost in low-resolution images. In this paper, we give an overview of recent advances in deep learning-based models and methods that have been applied to single image super-resolution tasks. We also summarize, compare and discuss various models from the past and present for comprehensive understanding and finally provide open problems and possible directions for future research.
  • 加载中
  • [1] D. Glasner, S. Bagon, M. Irani. Super-resolution from a single image. In Proceedings of the 12th International Conference on Computer Vision, IEEE, Kyoto, Japan, pp. 349–356, 2009. DOI: 10.1109/ICCV.2009.5459271.
    [2] J. B. Huang, A. Singh, N. Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 5197–5206, 2015. DOI: 10.1109/CVPR.2015.7299156.
    [3] W. T. Freeman, E. C. Pasztor, O. T. Carmichael.  Learning low-level vision[J]. International Journal of Computer Vision, 2000, 40(1): 25-47. doi: 10.1023/A:1026501619075
    [4] W. T. Freeman, T. R. Jones, E. C. Pasztor.  Example-based super-resolution[J]. IEEE Computer Graphics and Applications, 2002, 22(2): 56-65. doi: 10.1109/38.988747
    [5] H. Chang, D. Y. Yeung, Y. M. Xiong. Super-resolution through neighbor embedding. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, USA, 2004. DOI: 10.1109/CVPR.2004.1315043.
    [6] C. Y. Yang, M. H. Yang. Fast direct super-resolution by simple functions. In Proceedings of IEEE International Conference on Computer Vision, Sydney, Australia, pp. 561–568, 2013. DOI: 10.1109/ICCV.2013.75.
    [7] R. Timofte, V. De Smet, L. Van Gool. Anchored neighborhood regression for fast example-based super-resolution. In Proceedings of IEEE International Conference on Computer Vision, Sydney, Australia, pp. 1920–1927, 2013. DOI: 10.1109/ICCV.2013.241.
    [8] R. Timofte, V. De Smet, L. Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Proceedings of the 12th Asian Conference on Computer Vision, Springer, Singapore, pp. 111–126, 2015. DOI: 10.1007/978-3-319-16817-3_8.
    [9] S. Schulter, C. Leistner, H. Bischof. Fast and accurate image upscaling with super-resolution forests. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, pp. 3791–3799, 2015. DOI: 10.1109/CVPR.2015.7299003.
    [10] E. Pérez-Pellitero, J. Salvador, J. Ruiz-Hidalgo, B. Rosenhahn. PSyCo: Manifold span reduction for super resolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1837–1845, 2016. DOI: 10.1109/CVPR.2016.203.
    [11] J. C. Yang, J. Wright, T. S. Huang, Y. Ma.  Image super-resolution via sparse representation[J]. IEEE Transactions on Image Processing, 2010, 19(11): 2861-2873. doi: 10.1109/TIP.2010.2050625
    [12] J. C. Yang, Z. W. Wang, Z. Lin, S. Cohen, T. Huang.  Coupled dictionary training for image super-resolution[J]. IEEE Transactions on Image Processing, 2012, 21(8): 3467-3478. doi: 10.1109/TIP.2012.2192127
    [13] T. Peleg, M. Elad.  A statistical prediction model based on sparse representations for single image super-resolution[J]. IEEE Transactions on Image Processing, 2014, 23(6): 2569-2582. doi: 10.1109/TIP.2014.2305844
    [14] S. L. Wang, L. Zhang, Y. Liang, Q. Pan. Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, pp. 2216–2223, 2012. DOI: 10.1109/CVPR.2012.6247930.
    [15] L. He, H. R. Qi, R. Zaretzki. Beta process joint dictionary learning for coupled feature spaces with application to single image super-resolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, pp. 345–352, 2013. DOI: 10.1109/CVPR.2013.51.
    [16] C. Dong, C. C. Loy, K. M. He, X. O. Tang. Learning a deep convolutional network for image super-resolution. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 184–199, 2014. DOI: 10.1007/978-3-319-10593-2_13.
    [17] C. Dong, C. C. Loy, K. M. He, X. O. Tang.  Image super-resolution using deep convolutional networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(2): 295-307. doi: 10.1109/TPAMI.2015.2439281
    [18] J. Kim, J. Kwon Lee, K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1646–1654, 2016. DOI: 10.1109/CVPR.2016.182.
    [19] J. Kim, J. Kwon Lee, K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, pp. 1637–1645, 2016. DOI: 10.1109/CVPR.2016.181.
    [20] Y. Tai, J. Yang, X. M. Liu. Image super-resolution via deep recursive residual network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, vol. 1, pp. 2790–2798, 2017. DOI: 10.1109/CVPR.2017.298.
    [21] X. J. Mao, C. H. Shen, Y. B. Yang. Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections, [Online], Available: https://arxiv.org/abs/1606.08921, May, 2018.
    [22] J. Yamanaka, S. Kuwashima, T. Kurita. Fast and accurate image super resolution by deep CNN with skip connection and network in network. In Proceedings of the 24th International Conference on Neural Information Processing, Springer, Guangzhou, China, 2017. DOI: 10.1007/978-3-319-70096-0_23.
    [23] T. Tong, G. Li, X. J. Liu, Q. Q. Gao. Image super-resolution using dense skip connections. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 4809–4817, 2017. DOI: 10.1109/ICCV.2017.514.
    [24] Y. L. Zhang, K. P. Li, K. Li, L. C. Wang, B. N. Zhong, Y. Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 286–301, 2018. DOI: 10.1007/978-3-030-01234-2_18.
    [25] Z. S. Zhong, T. C. Shen, Y. B. Yang, Z. C. Lin, C. Zhang. Joint sub-bands learning with clique structures for wavelet domain super-resolution. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Curran Associates, Inc., Montréal, Canada, pp. 165–175, 2018.
    [26] Y. L. Zhang, Y. P. Tian, Y. Kong, B. N. Zhong, Y. Fu. Residual dense network for image super-resolution. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 2472–2481, 2018. DOI: 10.1109/CVPR.2018.00262.
    [27] J. H. Yu, Y. C. Fan, J. C. Yang, N. Xu, Z. W. Wang, X. C. Wang, T. Huang. Wide Activation for Efficient and Accurate Image Super-resolution, [Online], Available: https://arxiv.org/abs/1808.08718v1, April 8, 2019.
    [28] N. Ahn, B. Kang, K. A. Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 252–268, 2018. DOI: 10.1007/978-3-030-01249-6_16.
    [29] Z. Hui, X. M. Wang, X. B. Gao. Fast and accurate single image super-resolution via information distillation network. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 723–731, 2018. DOI: 10.1109/CVPR.2018.00082.
    [30] W. S. Lai, J. B. Huang, N. Ahuja, M. H. Yang. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, vol. 2, pp. 5835-5843, 2017. DOI: 10.1109/CVPR.2017.618.
    [31] R. S. Asamwar, K. M. Bhurchandi, A. S. Gandhi.  Interpolation of images using discrete wavelet transform to simulate image resizing as in human vision[J]. International Journal of Automation and Computing, 2010, 7(1): 9-16. doi: 10.1007/s11633-010-0009-7
    [32] B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Honolulu, USA, vol. 1, pp. 1132–1140, 2017. DOI: 10.1109/CVPRW.2017.151.
    [33] Y. F. Wang, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, C. Schroers. A fully progressive approach to single-image super-resolution. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Salt Lake City, USA, 2018. DOI: 10.1109/CVPRW.2018.00131.
    [34] M. Haris, G. Shakhnarovich, N. Ukita. Deep back-projection networks for super-resolution. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1664–1673, 2018. DOI: 10.1109/CVPR.2018.00179.
    [35] K. Zhang, W. M. Zuo, L. Zhang. Learning a single convolutional super-resolution network for multiple degradations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3262–3271, 2018. DOI: 10.1109/CVPR.2018.00344.
    [36] A. Shocher, N. Cohen, M. Irani. Zero-shot super-resolution using deep internal learning. In Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 3118-3126, 2018. DOI: 10.1109/CVPR.2018.00329.
    [37] Q. L. Liao, T. Poggio. Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex, [Online], Available: https://arxiv.org/abs/1604.03640, July 10, 2018.
    [38] W. Han, S. Y. Chang, D. Liu, M. Yu, M. Witbrock, T. S. Huang. Image super-resolution via dual-state recurrent networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1654–1663, 2018. DOI: 10.1109/CVPR.2018.00178.
    [39] Y. Tai, J. Yang, X. M. Liu, C. Y. Xu. MemNet: A persistent memory network for image restoration. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 4539–4547, 2017. DOI: 10.1109/ICCV.2017.486.
    [40] X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794–7803, 2018. DOI: 10.1109/CVPR.2018.00813.
    [41] D. Liu, B. H. Wen, Y. C. Fan, C. C. Loy, T. S. Huang. Non-local recurrent network for image restoration. In Proceedings of the 32nd Conference on Neural Information Processing Systems, Curran Associates, Inc., Montréal, Canada, pp. 1680–1689, 2018.
    [42] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 2672–2680, 2014.
    [43] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. H. Wang, W. Z. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, vol. 2, pp. 105–114, 2017. DOI: 10.1109/CVPR.2017.19.
    [44] M. S. Sajjadi, B. Schölkopf, M. Hirsch. EnhanceNet: Single image super-resolution through automated texture synthesis. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 4501–4510, 2017. DOI: 10.1109/ICCV.2017.481.
    [45] J. Johnson, A. Alahi, F. F. Li. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 694–711, 2016. DOI: 10.1007/978-3-319-46475-6_43.
    [46] S. J. Park, H. Son, S. Cho, K. S. Hong, S. Lee. SRFeat: Single image super-resolution with feature discrimination. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 439–455, 2018. DOI: 10.1007/978-3-030-01270-0_27.
    [47] M. Bevilacqua, A. Roumy, C. Guillemot, M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of British Machine Vision Conference, BMVA Press, Surrey, UK, 2012.
    [48] R. Zeyde, M. Elad, M. Protter. On single image scale-up using sparse-representations. In Proceedings of the 7th International Conference on Curves and Surfaces, Springer, Avignon, France, pp. 711–730, 2010. DOI: 10.1007/978-3-642-27413-8_47.
    [49] D. Martin, C. Fowlkes, D. Tal, J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings the 8th IEEE International Conference on Computer Vision, IEEE, Vancouver, Canada, 2001. DOI: 10.1109/ICCV.2001.937655.
    [50] V. K. Ha, J. C. Ren, X. Y. Xu, S. Zhao, G. Xie, V. M. Vargas. Deep learning based single image super-resolution: A survey. In Proceedings of the 9th International Conference on Brain Inspired Cognitive Systems, Springer, Xi′an, China, pp. 106–119, 2018. DOI: 10.1007/978-3-030-00563-4_11.
    [51] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen. Improved techniques for training GANs. In Proceedings of the 30th Conference on Neural Information Processing Systems, Curran Associates, Inc., Barcelona, Spain, pp. 2234–2242, 2016.
    [52] M. Arjovsky, L. Bottou. Towards Principled Methods for Training Generative Adversarial Networks, [Online], Available: https://arxiv.org/abs/1701.04862, April 8, 2018.
    [53] L. Metz, B. Poole, D. Pfau, J. Sohl-Dickstein. Unrolled Generative Adversarial Networks, [Online], Available: https://arxiv.org/abs/1611.02163, June 10–20, 2018.
    [54] X. T. Wang, K. Yu, C. Dong, C. C. Loy. Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform, [Online], Available: https://arxiv.org/abs/1804.02815, October, 2018.
  • 加载中
  • [1] Han Xu, Yao Ma, Hao-Chen Liu, Debayan Deb, Hui Liu, Ji-Liang Tang, Anil K. Jain. Adversarial Attacks and Defenses in Images, Graphs and Text: A Review . International Journal of Automation and Computing, 2020, 17(2): 151-178.  doi: 10.1007/s11633-019-1211-x
    [2] Chang-Hao Zhu, Jie Zhang. Developing Soft Sensors for Polymer Melt Index in an Industrial Polymerization Process Using Deep Belief Networks . International Journal of Automation and Computing, 2020, 17(1): 44-54.  doi: 10.1007/s11633-019-1203-x
    [3] Fu-Qiang Liu, Zong-Yi Wang. Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning . International Journal of Automation and Computing, 2020, 17(4): 539-550.  doi: 10.1007/s11633-020-1221-8
    [4] Bin Hu, Jiacun Wang. Deep Learning Based Hand Gesture Recognition and UAV Flight Controls . International Journal of Automation and Computing, 2020, 17(1): 17-29.  doi: 10.1007/s11633-019-1194-7
    [5] Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi, Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network . International Journal of Automation and Computing, 2020, 17(2): 233-246.  doi: 10.1007/s11633-019-1207-6
    [6] Hao Wu, Zhao-Wei Chen, Guo-Hui Tian, Qing Ma, Meng-Lin Jiao. Item Ownership Relationship Semantic Learning Strategy for Personalized Service Robot . International Journal of Automation and Computing, 2020, 17(3): 390-402.  doi: 10.1007/s11633-019-1206-7
    [7] Bing-Xing Wu, Suat Utku Ay, Ahmed Abdel-Rahim. Pedestrian Height Estimation and 3D Reconstruction Using Pixel-resolution Mapping Method Without Special Patterns . International Journal of Automation and Computing, 2019, 16(4): 449-461.  doi: 10.1007/s11633-019-1170-2
    [8] Xiang Zhang, Qiang Yang. Transfer Hierarchical Attention Network for Generative Dialog System . International Journal of Automation and Computing, 2019, 16(6): 720-736.  doi: 10.1007/s11633-019-1200-0
    [9] Zhen-Jie Yao, Jie Bi, Yi-Xin Chen. Applying Deep Learning to Individual and Community Health Monitoring Data: A Survey . International Journal of Automation and Computing, 2018, 15(6): 643-655.  doi: 10.1007/s11633-018-1136-9
    [10] Hong-Kai Chen, Xiao-Guang Zhao, Shi-Ying Sun, Min Tan. PLS-CCA Heterogeneous Features Fusion-based Low-resolution Human Detection Method for Outdoor Video Surveillance . International Journal of Automation and Computing, 2017, 14(2): 136-146.  doi: 10.1007/s11633-016-1029-8
    [11] Ting Zhang, Ri-Zhen Qin, Qiu-Lei Dong, Wei Gao, Hua-Rong Xu, Zhan-Yi Hu. Physiognomy: Personality Traits Prediction by Learning . International Journal of Automation and Computing, 2017, 14(4): 386-395.  doi: 10.1007/s11633-017-1085-8
    [12] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, Qianli Liao. Why and When Can Deep-but Not Shallow-networks Avoid the Curse of Dimensionality:A Review . International Journal of Automation and Computing, 2017, 14(5): 503-519.  doi: 10.1007/s11633-017-1054-2
    [13] Bo Zhao, Jiashi Feng, Xiao Wu, Shuicheng Yan. A Survey on Deep Learning-based Fine-grained Object Classification and Semantic Segmentation . International Journal of Automation and Computing, 2017, 14(2): 119-135.  doi: 10.1007/s11633-017-1053-3
    [14] S. Arumugadevi, V. Seenivasagam. Color Image Segmentation Using Feedforward Neural Networks with FCM . International Journal of Automation and Computing, 2016, 13(5): 491-500.  doi: 10.1007/s11633-016-0975-5
    [15] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, Zhi-Hua Zhou. Minimal Gated Unit for Recurrent Neural Networks . International Journal of Automation and Computing, 2016, 13(3): 226-234.  doi: 10.1007/s11633-016-1006-2
    [16] Fan Guo, Jin Tang, Zi-Xing Cai. Image Dehazing Based on Haziness Analysis . International Journal of Automation and Computing, 2014, 11(1): 78-86.  doi: 10.1007/s11633-014-0768-7
    [17] Chun-Li Zhang,  Jun-Min Li. Adaptive Iterative Learning Control for Nonlinear Time-delay Systems with Periodic Disturbances Using FSE-neural Network . International Journal of Automation and Computing, 2011, 8(4): 403-410.  doi: 10.1007/s11633-011-0597-x
    [18] Chang-Jiang Zhang, Bo Yang. A Novel Nonlinear Algorithm for Typhoon Cloud Image Enhancement . International Journal of Automation and Computing, 2011, 8(2): 161-169.  doi: 10.1007/s11633-011-0569-1
    [19] Shang-Ming Zhou, John Q. Can, Li-Da Xu, Robert John. Interactive Image Enhancement by Fuzzy Relaxation . International Journal of Automation and Computing, 2007, 4(3): 229-235.  doi: 10.1007/s11633-007-0229-7
    [20] Index Conditions of Resolution . International Journal of Automation and Computing, 2005, 2(1): 52-59.  doi: 10.1007/s11633-005-0052-y
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (12)  / Tables (2)

Metrics

Abstract Views (2317) PDF downloads (326) Citations (0)

Deep Learning Based Single Image Super-resolution: A Survey

Abstract: Single image super-resolution has attracted increasing attention and has a wide range of applications in satellite imaging, medical imaging, computer vision, security surveillance imaging, remote sensing, objection detection, and recognition. Recently, deep learning techniques have emerged and blossomed, producing " the state-of-the-art” in many domains. Due to their capability in feature extraction and mapping, it is very helpful to predict high-frequency details lost in low-resolution images. In this paper, we give an overview of recent advances in deep learning-based models and methods that have been applied to single image super-resolution tasks. We also summarize, compare and discuss various models from the past and present for comprehensive understanding and finally provide open problems and possible directions for future research.

Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero and Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413-426, 2019. doi: 10.1007/s11633-019-1183-x
Citation: Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero and Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey. International Journal of Automation and Computing, vol. 16, no. 4, pp. 413-426, 2019. doi: 10.1007/s11633-019-1183-x
    • Single image super-resolution (SISR) aims to obtain high-resolution (HR) images from a low-resolution (LR) image. It has practical applications in many real-world problems, where certain restrictions present in image or video such as bandwidth, pixel size, scene details, and other factors. Since multiple solutions exist for a given input LR image, SISR is to solve an ill-posed inverse problem. There are various techniques to solve an SISR problem, which can be classified into three categories, i.e., interpolation-based, reconstruction-based, and example-based methods. The interpolation-based methods are quite straightforward, but they can not provide any additional information for reconstruction and therefore the lost frequency cannot be restored. Reconstruction-based methods usually introduce certain knowledge priors or constraints in an inverse reconstruction problem. The representative priors can be local structure similarity, non-local means, or edge priors. Example-based methods attempt to reconstruct the prior knowledge from a massive amount of internal or external LR-HR patch pairs, in which deep learning techniques have shined new light on SISR.

      This survey focuses mainly on deep learning-based methods and aims to provide a comprehensive introduction to the field of SISR.

      The remainder of this paper is organized as follows: Section 2 provides the background and covers different types of example-based SISR algorithms, followed by recent advances in deep learning related models in Section 3. Section 4 compares convolutional neural networks (CNN)-based SISR algorithms. Section 5 presents in-depth discussions, followed by open questions for future research in Section 6. Finally, the paper is concluded in Section 7.

    • Example-based algorithms aim to enhance the resolution of LR images by learning from other LR-HR patch pair examples. The relationship between LR and HR was applied to an unobserved LR image to recover the most likely HR version. Example-based methods can be classified into two types: internal learning and external learning-based methods.

    • The natural image has a self-similarity property, which tends to recur many times within both the same scale or across different scales inside the image.

      To determine the similarity, Glasner et al.[1] made a test by comparing the original image and multiple cascades of images of decreasing resolutions. A scale-space pyramid was constructed to exploit the self-similarity in given LR image, which was then used to impose a set of constraints on the unknown HR image, as shown in Fig. 1[1]. Since the dictionary is limited on the given LR-HR patch pairs, Huang et al.[2] extended the search space to both planar perspectives and affine transforms of patches to exploit abundant feature similarity. However, the most important limitation lies in the fact that self-similarity based methods lead to high complexity of computation due to huge numbers of searching and the accuracy of algorithms is highly variant according to natural properties of images.

      Figure 1.  Pyramid model[1] for SISR. From the bottom, when a similar patch found in a down-scale patch (yellow at level I–2), its parent (yellow at level I0) is copied to an unknown HR image with an appropriate gap in scale and support of different kernels. Color versions of the figures in this paper are available online.

    • The external learning-based methods attempt to search the similar information from other images or patches instead. It was first introduced to estimate an underlying scene X with the given image data Y [3]. The algorithm aimed to learn the posterior probability $ P(X|Y) = \dfrac{1}{P(Y)}P(X, Y) $, by adding image patches X and its corresponding scenes Y as nodes in a Markov network. It was then applied for generating super-resolution images, where the input image is LR and the scene to be estimated is replaced by an HR image[4].

      Locally linear embedding (LLE) is one of the manifold learning algorithms, based on the idea that the high dimensionality may be represented as a function of a few underlying parameters. LLE begins by finding a set of nearest neighbors of each point that can best describe that point as a linear combination of its neighbors. It is then determined to find the low-dimensional embedding of points, such that each point is still represented by the same linear combination of its neighbors. However, one of the disadvantages is that LLE handles non-uniform sample density poorly because the feature represented by the weights varied according to regions in sample densities. The concept of LLE was also applied in SISR neighbor embedding[5], where the features are learned in the LR space before being applied to estimate HR images. There were several other studies based on local linear regression such as: ridge regression[6], anchored neighborhood regression[7, 8], random forest[9], and manifold embedding[10].

      Another group of algorithms that has received attention is sparsity-based methods. In the sparse representation theory, the data or images can be described as a linear combination of sparse elements chosen from an appropriately over-complete dictionary. Let $ D \in {\bf R}^{n \times K} $ be an over-complete dictionary ($ K\gg n $), we can build a dictionary for most scenarios of inputs and then any new image (patch) $ X \in {\bf R}^n $ can be represented as $ X = D \times \alpha $, where $ \alpha $ is a set of sparse coefficients. Hence, there were dictionary learning problems and sparse coding problems to optimize D and $ \alpha $, respectively. The objective function for standard sparse coding is

      $ \arg\min\limits_{\substack{D}}\sum\limits_{i = 1}^{N} \arg\min\limits_{\substack{\alpha_i}} \frac{1}{2}\parallel{x_i - D \alpha_i\parallel^2} + \lambda\Vert \alpha_i\Vert . $

      (1)

      Unlike standard sparse coding, the SISR sparsity-based method works with two dictionaries to learn the compact representation for these patch pairs. Assuming that the observed low-resolution image Y is blurred and a down-sampled version of the high-resolution X:

      $ Y = S\cdot H \cdot X $

      (2)

      where H represents a blurring filter and S the down-sampling operation. Under mild conditions, the sparest $ \alpha_0 $ can be unique for both dictionaries because the dictionary is over-complete or very large. Hence, the joint sparse coding can be represented as

      $\begin{split}& \arg\min\limits_{\substack{D_x, D_y}} \sum\limits_{i = 1}^{N} \arg\min\limits_{\substack{\alpha_i}} \frac{1}{2}\parallel{x_i - D_x \alpha_i\parallel^2}+\\ &\quad\quad\quad\quad \frac{1}{2}\parallel{y_i - D_y \alpha_i\parallel^2} + \lambda\Vert \alpha_i\Vert . \end{split}$

      (3)

      The two dictionaries of high-resolution $ D_h $ and low-resolution $ D_l $ are co-trained to find the compact coefficients $ \alpha_h = \alpha_l = \alpha $[11], such that sparse representation of a high-resolution patch is the same as the sparse representation of the corresponding low-resolution patch. A dictionary $ D_l $ was first trained to best fit the LR patches, then the $ D_h $ dictionary was trained that worked best with $ \alpha_l $. When these steps were completed, $ \alpha_l $ was then used to recover a high-resolution image based on the high-resolution dictionary $ D_h $.

      One of the major drawbacks of this method is that the two dictionaries are not always linearly connected. Another problem is that HR images are unknown in the testing phase, hence the equivalence constraint on the HR sparse representation does not guarantee as it has been done in the training phase. Yang et al.[12] suggested a coupled dictionary learning process to pose constraints for two spaces of LR and HR. The main disadvantage of this method is that both dictionaries are assumed to be strictly aligned to achieve alignment between $ \alpha_h $ and $ \alpha_l $ or the simplifying assumption of $ \alpha_h $ = $ \alpha_l $. To avoid this invariance assumption, Peleg and Elad[13] connect $ \alpha_h $, $ \alpha_l $ via a statistical parametric model. Wang et al.[14] proposed semi-couple dictionary learning, in which two dictionaries are not fully coupled. It was based on an assumption that there exists a mapping in sparse domain $ f(\cdot) $: $ \alpha_l $ $ \to $ $ \alpha_h $ or $ \alpha_h = f(\alpha_l) $. Therefore, the objective function has one additional error term $ \Vert \alpha_h - f(\alpha_l) \Vert^2 $ and other regularization terms. Beta process joint dictionary learning was proposed in [15], which enables the decomposition of these sparse coefficients to the element multiplication of dictionary atom indicators and coefficient values, providing the much needed flexibility to fit each feature space. Finally, sparsity-based algorithms have remaining limitations in feature extraction and mapping, which are not always adaptive or optimal for generating HR images.

    • The convolutional neural networks (CNNs) have been developed rapidly in the last two decades. The first CNN model to solve the SISR problems is introduced by Dong et al.[16, 17], named super-resolution convolutional neural network (SRCNN). Given a training set of LR and corresponding HR images $ {x^i,\;y^i} $, $i=1\cdots N $, the objective is to find an optimal model f, which will then be applied to accurately predict Y = f(X) on unobserved examples X. The SRCNN consists of the following steps, as shown in Fig. 2[16]:

      Figure 2.  SRCNN model for SISR

      1) Preprocessing: Upscale the LR image to desired HR image using bicubic interpolation.

      2) Feature extraction: Extract a set of feature maps from the upscaled LR image.

      3) Non-linear mapping: Maps the features between LR and HR patches.

      4) Reconstruction: Produce the HR image from HR patches.

      Interestingly, although only three layers have been used, the result significantly outperforms those non-deep learning algorithms discussed previously. However, it seems possible that the accuracy cannot be improved further based on this simple model. This led to the question of whether “the deeper the better” is or is not the case in super resolution (SR). Inspired by the success of very deep networks, Kim et al.[18, 19] proposed two models named very deep convolutional networks (VDSR)[18] and deeply recursive convolutional network (DRCN)[19], which both stack 20 convolutional layers, as shown in Figs. 3 (a) and 3 (b). The VDSR is trained with a very high learning rate ($ 10^{-1} $ instead of $ 10^{-4} $ in SRCNN) in order to accelerate the convergence speed and whilst gradient clipping was used to control the explosion problem.

      Figure 3.  VDSR, DRCN, DRRN model for SISR. The same color of yellow or orange indicates the sharing parameters.

      Instead of predicting the whole image as was done in SRCNN, residual connection was used to force the model to learn the difference between inputs and outputs. The zeros were padding at borders to avoid the problem of quickly reducing feature maps through deep networks. In order to gain more benefits from residual learning, Tai et al.[20] used both global residual connections and local residual connections in deeply recursive residual networks (DRRN). The global residual learning is used in the identity branch and recursive learning in the local residual branch, as illustrated in Fig. 3(c). Mao et al.[21] proposed a 30-layer convolutional auto-encoder network, namely the residual encoder-decoder network (RED30). The convolutional layers work as a feature extractor and encode image content, while the de-convolutional layers decode and recover image details. Unlike other methods as mentioned above, the encoder reduces the feature map to encode the most important features. By doing it in this way, noise/corruption can be efficiently eliminated. Hence, this model has completed extended tests on several tasks of image restoration such as image de-noising, JPEG de-blocking, non-blind de-blurring and image in-painting[21].

      Recent advances in CNN architecture such as DenseNet, Network in Network, and Residual Network have been exploited for SISR applications[22, 23]. Among them, Residual Channel Attention Network (RCAN) and SRCliqueNet have recently been the-state-of-the-art (up to 2018) in terms of pixel-wise measurement, as shown in Table 2, Section 4.

      Channel attention. Each of the learned filters operates with a local receptive field and the interdependence between channels is entangled with spatial correlation. Therefore, the transformation output is unable to exploit information such as the interrelationship between channels outside the region. The RCAN[24] has been the deepest model (about 400 layers) for the SISR task. It integrated a channel attention mechanism inside the residual block, as shown in Fig. 4[24]: The input with shape of a H×W×C is squeezed into the channel descriptor by averaging through a spatial dimension of H×W to generate the output shape of 1×1×C. This channel descriptor is put through gate activation of sigmoid f and element-wise product with the input in order to control how much information from each channel is passed up to the next layer in the hierarchy.

      Figure 4.  Channel attention block[24]

      Joint sub-band learning with clique structure – SRCliqueNet[25]. CliqueNet is newly proposed convolutional network architecture where any pair of layers in the same block are connected bilaterally, as shown in Fig. 5.

      Figure 5.  Clique block with two stages updated. Four layers 1, 2, 3, 4 in blocks are stacked in the order of 1, 2, 3, 4, 1, 2, 3, 4 and bilaterally connected by the residual shortcut. It has more skip connection compared with the Densenet block.

      The Clique block encourages the features to be refined, which provides more discrimination and leads to a better performance. Zhong et al.[25] proposed Super-Resolution CliqueNet, which applied this architecture to jointly learned wavelet sub-band in both the feature extraction stage and sub-band refinement stage.

      Concatenation for feature fusion rather than summation – RDN[26]. As the model goes deeper, the feature in each layer would be hierarchical with different receptive fields. The information from each layer may not be fully used by recent methods. Zhang et al.[26] proposed concatenated operations on the DenseNet to build hierarchical features from all layers, as shown in Fig. 6.

      Figure 6.  Residual dense block[26]. All previous feature are concatenated to build hierarchical features.

      Wide activation in residual block – Wide-activated deep super-resolution network (WDSR)[27]. The efficiency and higher accuracy image resolution can be achieved with fewer parameters than that of enhanced deep super-resolution network (EDSR) by expanding the number of channels by a factor of $ \sqrt{r} $ before rectified linear unit (RELU) activation in residual blocks. As such, the residual identity mapping path slimmed as a factor of $ \sqrt{r} $ to maintain constant output channels.

      Cascading residuals to incorporate the features from multiple layers – Cascading Residual Network (CARN)[28]. The most interesting finding was that there are similar mechanisms in MemNet (Section 3.2), RDN and CARN models. In addition to the ResNet architecture, they all use 1 × 1 convolution as a fusion module to incorporate multiple features from previous layers. Their results boost the performance effectively and can be considered in model design.

      Information distillation network – IDN[29]. The IDN model uses the distillation block, which combines an enhancement unit with a compression unit. In this block, the information is distilled inside the block before it passes to the next level.

      When we use neural networks to generate images, it usually involves up-sampling from low resolution to high resolution. One of the problems with the use of interpolation-based methods is that it is predefined and there is nothing that the network can learn about. This method is also being criticized for high computational complexity while computing in HR space without additional information. On the other hand, transposed convolution and PixelShuffle concepts have learnable parameters for optimally up-sampling the input. It provides flexible up-sampling and can be inserted at any place in the architecture. Lai et al.[30] proposed Laplacian Pyramid super-resolution networks (Lap-SRN) to reconstruct the image progressively. In general, the Laplacian Pyramid scheme decomposes an image as a series of high-pass bands and low-pass bands. At each level of reconstruction, a transposed convolution was used to up-sample the image in both the high-pass branch and low-pass branch. Beside the Laplace decomposition, Wavelet transform (WT) has been shown to be an efficient and highly intuitive tool to represent and store images in a multi-resolution way. WT can describe the contextual and textural information of an image at different scales. WT for super-resolution has been applied successfully to the multi-frame SR problem. However, conventional discrete wavelet transformation reduces the image size by a factor of 2n, which is inconvenient when testing images are of a certain size. It is proposed by Asamwar et al.[31] to reduce the image to any (variable scale) size, using discrete wavelet transformation.

      For comparison, most SISR algorithms have been performed on the LR image, which was downsampled with scaling factors of 2x, 3x, 4x from the HR image. Otherwise, features available in the LR space have not sufficed for learning. It is suggested that a training model for high upscaling factors can benefit from the pre-trained model on lower upscaling factors[32]. In other words, it can be described as transfer learning. Wang et al.[33] proposed a progressive asymmetric pyramidal structure to adapt with multiple upscaling factors and up to a large scaling factor of 8x. Also, a deep back projection network[34] using mutually connected up-sampling and down-sampling stages has been used for reaching such high up-scaling factors. These experiments support recommendations to use progressive up-sampling or iterative up and down-sampling when reconstructing SR images under larger scaling factors.

      When assuming a low-resolution image is downsampled from the corresponding high-resolution image, CNN-based methods ignored the true degradation such as noise in real world applications. Zhang et al.[35] proposed super-resolution multiple degradation (SRMD) training on LR images, synthesizing with three kinds of degradations: a blur kernel, bicubicly downsampling followed by additive white Gaussian noise (AWGN). Obviously, to learn invariant features, this model had to use large training datasets of approximate 6 000 images. Shocher et al.[36] observed strong internal data repetition in the natural images, which is similar to that in [1]. The information for tiny objects, for example, is better to be found inside the image, other than in any external database of examples. A "Zero Shot" SR (ZSSR)[36] was then proposed without relying on any prior image examples or prior training. It exploits cross-scale internal recurrence of image-specific information, where the test image itself is trained before being fed again to the resulting trained network. Because little research has been focused on variant degradations of SISR, more evaluations and comparisons are required and further investigations would be of great help.

    • A ResNet with weight sharing can be interpreted as an unrolled single-state recurrent neural network (RNN)[37]. A dual-state recurrent network (DSRN)[38] allows that both the LR path and HR path caption information at different spaces and are connected at every step in order to contribute jointly to the learning process, as shown in Fig. 7[38]. However, the average of all recovered SR images at each stage may have a deteriorated result. Another reason is that the down-sampling operation at every stage can lead to information loss at the final reconstruction layer.

      Figure 7.  Dual state model[38]. The top branch operates on the HR space, where the bottom branch works on the LR space. A connection from LR to HR using de-convolution operation; a delayed feedback mechanism is to connect previous predicted HR to LR at the next stage.

      In the view of memory in RNNs, CNNs can be interpreted as: short-term memory. The conventional plain CNNs adopts a single path feed-forward architecture, in which a latter feature is influenced by a previous state. Limited long-term memory: When the skip connection is introduced, one state is influenced by a previous state and specific point prior state. To enable the latter state to see more prior states and decide whether the information should be kept or discarded, Tai et al.[39] proposed a memory network (MemNet), which uses recursive layers followed by a memory unit to allow the combination of short and long-term memory for image reconstruction, as shown in Fig. 8[39]. In this model, a gate unit controls information from the prior recursive units, which extracts features at different levels.

      Figure 8.  Memory block in MemNet[39] includes multiple recursive units and a gate unit MemNet model

      Unlike convolutional operations, which capture features by repeatedly processing local neighborhoods of pixels, the non-local operation describes a pixel as a combination of weighted distance to all other pixels, regardless of their positional distance or channels. Non-local means to provide an efficient procedure for image noise reduction; however, the local and non-local based methods are treated separately, thereby not taking account of their advantages. The non-local block was introduced in [40], enabling integrate non-local operation into end-to-end training with local operation based models such as CNNs. Each pixel at point $ i $ in an image can be described as

      $ y_i = \frac{1}{C(x)}\sum\limits_{j \in \varOmega}^{}f(x_i, x_j)g(x_j) $

      (4)

      where $ f(x_i, x_j) ={\rm e}^{\Theta(x_i)^{\rm T}\varphi(x_j)} $ is a weighted function, measuring how closely related the image at point $ i $ is to the image at point $ j $. Thus, by choosing $ \Theta(x_i) = W_\Theta x_i $, $ \varphi(x_j) = W_\varphi x_j $ and $ g(x_j) = W_g x_j $, the self-similarity can be jointly learned in embedding the space by following blocks, as shown in Fig. 9[40].

      Figure 9.  A non-local block[40]

      For SISR tasks, Liu et al.[41] incorporated this model into the RNN network by maintaining two paths: a regular path, that contains convolution operations on image, and the other path that maintains non-local information at each step as input branches in the regular RNNs structure. However, non-local means it has disadvantage that remarkable denoising results are obtained at a high expense of computational cost due to the enormous amount of weighting computations.

    • Generative adversarial network (GAN) was first introduced in [42], targeting the minimax game between a discriminative network D and a generative network G. The generative network G takes the input z $ \sim $ p(z) as a form of random noise, then outputs new data G(z), whose distribution $ p_g $ is supposed to be close to that of the data distribution $ p_{\rm data} $. The task of the discriminative network D is to distinguish a generated sample G(z) $ \sim $ $ p_g $(G(z)) and the ground truth data sample x $ \sim $ $ p_{\rm data} $(x). In other words, the discriminative network determines whether the given images are natural-looking images or they look like artificial created images. As the models are trained through alternative optimization, both networks are improved until they reach a point called Nash Equilibrium that fake images are indistinguishable from real images. The objective function is represented as

      $\begin{split} &\min\limits_{G} \max\limits_{D} E_{x \sim p_{data}} [\log D(x)] + E_{z \sim p_z} [\log(1-D(G(z)))]=\\ &\quad \min\limits_{G} \max\limits_{D} E_{x \sim p_{data}} [\log D(x)] + E_{x \sim p_z} [\log(1-D(x))].\end{split}$

      (5)

      This concept is consistent with the problem solving in image super resolution. Ledig et al.[43] introduced the super-resolution generative adversarial network (SRGAN) model, of which a generative network upsamples LR images to super resolution (SR) images and the discriminative network is to distinguish the ground truth HR images and SR images. A pixel-wise quality assessment metric has been critical of showing poorly to human perception. By incorporating newly adversarial loss, the GAN-based algorithms have solved the problem and produced highly perceptive, naturalistic images, as can be seen from Fig. 10[43].

      Figure 10.  From left to right, image is reconstructed by bicubic interpolation, deep residual network (SRResNet) measured by MSE, SRGAN optimize more sensitive to human perception, and original image. Corresponding PSNR and SSIM are provided on top. The zoom of red rectangles are shown at right bottom.

      The GAN-based SISR model has been developed further in [44, 45], which has resulted in an improved SRGAN by fusion of pixel-wise loss, perceptual loss, and newly proposed texture transfer loss. Park et al.[46] proposed SRFeat and employed an additional discriminator in the feature domain. The generator is trained through two phases: pre-training and adversarial training. In the pre-training phase, the generator is trained to obtain high PSNR by minimizing MSE loss. The training procedure focuses on improving perceptual quality using perceptual similarity loss (Section 5.2.2), GAN loss in pixel domain and GAN loss in feature domain. Perhaps the most serious disadvantage of GAN-based SISR methods is difficulties in the training models, which will be further discussed in Section 5.2.

    • In order to provide a brief overview of the current performance of deep learning-based SISR algorithms, we compare some recent work in Tables 1 and 2. Two image quality metrics have been used for performance evaluation: A peak signal-to-noise ratio (PSNR) and a structural SIMlarity (SSIM) index. The higher the PSNR and SSIM, the better quality of the image being reconstructed. The PSNR can be described as

      ScaleSet5 PSNR/SSIMSet14 PSNR/SSIMB100 PSNR/SSIMUrban100 PSNR/SSIM
      SRCNN236.66/0.954 232.45/0.906 7
      332.75/0.909 029.30/0.821 5
      430.49/0.862 827.50/0.751 3
      VDSR237.53/0.958 733.03/0.912 431.90/0.896 030.76/0.914 0
      333.66/0.921 329.77/0.831 428.82/0.797 627.14/0.827 9
      431.35/0.883 828.01/0.767 427.29/0.725 125.18/0.752 4
      DRCN237.63/0.958 833.04/0.911 831.85/0.894 230.75/0.913 3
      333.82/0.922 629.76/0.831 128.80/0.796 327.15/0.827 6
      431.53/0.885 428.02/0.767 027.23/0.723 325.14/0.751 0
      DRRN237.74/0.959 133.23/0.913 632.05/0.897 331.23/0.918 8
      334.03/0.924 429.96/0.834 928.95/0.800 427.53/0.837 8
      431.68/0.888 028.21/0.772 025.44/076 3425.44/0.763 8
      RED30237.66/0.959 932.94/0.914 4
      333.82/0.923 029.61/0.834 1
      431.51/0.886 927.86/0.771 8
      MemNet237.78/0.959 733.28/0.914 232.08/0.897 831.31/0.919 5
      334.09/0.924 830.00/0.835 028.96/0.800 127.56/0.837 6
      431.74/0.889 328.26/0.772 327.40/0.728 125.50/0.763 0
      LapSRN237.52/0.959 033.08/0.913 031.80/0.895 030.41/0.910 0
      3
      431.54/0.885 028.19/0.772 027.32/0.728 025.21/0.756 0
      Zero Shot237.37/0.957 033.00/0.910 8
      333.42/0.918 829.800.830 4
      431.13/0.879 628.01/0.765 1
      EDSR238.20/0.960 634.02/0.920 432.37/0.901 833.10/0.936 3
      334.77/0.929 030.66/0.848 129.32/0.810 429.02/0.868 5
      432.62/0.898 428.94/0.790 127.79/0.743 726.86/0.808 0
      IDN237.83/0.960 033.30/0.914 832.08/0.898 531.27/0.919 6
      334.11/0.925 329.99/0.835 428.95/0.801 327.42/0.835 9
      431.82/0.890 328.25/0.773 027.41/0.729 725.41/0.763 2
      CARN237.76/0.959 033.52/0.916 632.09/0.897 831.92/0.925 6
      334.29/0.925 530.29/0.840 729.06/0.803 428.06/0.849 3
      432.13/0.893 728.60/0.780 627.58/0.734 926.07/0.783 7
      RDN238.30/0.961 634.10/0.921 832.40/0.902 233.09/0.936 8
      334.78/0.930 030.67/0.848 229.33/0.810 529.00/0.868 3
      432.61/0.900 328.92/0.789 326.82/0.806 926.82/0.806 9
      SRCliqueNet+238.28/0.963 034.03/0.924 032.40/0.906 032.95/0.937 0
      3
      432.67/0.903 028.95/0.797 027.81/0.752 026.80/0.810 0
      RCAN+238.27/0.961 434.23/0.922 532.46/0.903 133.54/0.939 9
      334.85/0.930 530.76/0.849 429.39/0.812 229.31/0.873 6
      432.73/0.901 328.98/0.791 027.85/0.745 527.10/0.814 2

      Table 2.  Quantitative evaluation of the-state-of-the-art SR algorithm. Average PSNR/SSIM for scale factor 2x, 3x, 4x. Red text indicates that the best and blue text indicates the second best performance.

      ModelsInputType of networkNumber of paramsMult-addsReconstructionsTrain dataLoss function
      SRCNNLR + BicubicSupervised8 K52.7 GDirectYang91L2(MSE)
      VDSRLR + BicubicSupervised666 K612 GDirectG200+Yang91L2
      DRCNLR + BicubicSupervised1, 775 K17 974 GDirectYang91L2
      DRRNLR + BicubicSupervised297 K6 796 GDirectG200+Yang91L2
      RED30LR + BicubicSupervised4, 2 MDirectBSD300L2
      LapSRNLRSupervised812 K29.9 GProgressiveG200+Yang91Charbonnie
      MemNetLR + BicubicSupervised677 K2 662 GDirectG200+Yang91L2
      Zero-ShotLR + BicubicUnsupervised225 KDirectL1(MAE)
      Dual StateLR + BicubicSupervised1, 2 MProgressiveYang91L2
      SRGANLRSupervisedDirectImageNetL2 + Perceptual loss
      EDSRLRSupervised43 M2 890 GDirectDIV2KL1
      IDNLRSupervised677 KDirectG200+Yang91L1
      CARNLRSupervised1, 6 M222 GDirectDIV2K+Yang91+B200L1
      RDNLRSupervised22.6 M1 300 GDirectDIV2KL1
      SRCliqueNet+LRSupervisedDirectDIV2K+FlickrL1 + L2
      RCAN+LRSupervised16 MDirectDIV2KL1

      Table 1.  Comparison of different SISR models

      $ {\rm PSNR} = 10\log_{10} {\frac{{\rm 255}^2}{{\rm MSE}}} $

      (6)

      where MSE is mean squared error between two images of $ I_1 $ and $ I_2 $:

      $ {\rm MSE} = \frac{\sum_{M, N}[I_1((m, n)-I_2(m, n)]^2}{M\times N}.$

      (7)

      Here, M and N are the number of rows and columns in the input images, respectively. Equation (6) shows that minimizing $ L_2 $ loss tends to maximizing the PSNR value.

      Table 1 summarizes the detailed performance comparison of some typical deep learning based SISR models, including SRCNN[17], VDSR[18], DRCN[19], DRRN[20], RED30[21], RCAN[24], SRCliqueNet[25], RDN[26], CARN[28], IDN[29], LapSRN[30], EDSR[32], Zero Shot[36], and MemNet[39]. The detailed performance comparison of those models is presented in Table 2. The four standard benchmark datasets are used including SET5[47], SET14[48], B100[49], URBAN100[2] which are popularly used for comparison of SR algorithms. The down-sampling scale factor used include 2x, 3x, and 4x, and missing information that was not provided by the authors is marked by [-]. All quantitative results are duplicated from the original papers.

      From Table 1, Table 2 and Fig. 11, CARN stand out through their high accuracy using small model. SRCliqueNet+ and RCAN+ achieved higher accuracy in comparison with EDSR in term of PSNR/SSIM measurement whilst requiring smaller model size. GAN-based models are in favour of perceptual reconstruction, which we do not include in Table 2 and Fig. 11.

      Figure 11.  Comparing the PSNR accuracy of different algorithms on 4 testing datasets with factor of 4x

    • Generally, when a random variable X has been observed, the aim is to predict the random variable Y as the output of the network. Let g(X) be the predictor, clearly we would like to choose g so that g(X) tends to be close to Y via the maximum likelihood estimation (MLE). One possible criterion for closeness is to choose g to minimize $ E[(Y-g(X))^2] $, thus the optimal predictor of Y becomes $ g(X) = E[Y|X] $ as the mean conditional expectation of Y given X. Most of the objective functions originally comes from MLE and we will show that the typical objective functions below are special cases of MLE.

    • By using CNNs, the mapping between a pair of corresponding LR and HR images is non-linear. The classical content loss function for the regression problem are LAD (least absolutes deviations) (or $ L_1 $) and LSE (least squared errors) (or $ L_2 $) defined as

      $ L_1 = \sum\limits_{i = 1}^n|\hat{y} - y| $

      (8)

      $ L_2 = \sum\limits_{i = 1}^n(\hat{y} - y)^2 $

      (9)

      where the estimation of y can be defined as $ y = W^{\rm T} x $ and $ {\hat{y}}$ is the ground truth. This objective function is to minimize the cost function with regard to the weight matrix W. If we could write the regression target as $ {\hat{y}}= {y} + \xi $ and the model regression target as a Gaussian random variable y $ \sim N(\mu,\;\sigma^2) $ with $ \mu = $ y $ = W^{\rm T} x $, the prediction model is

      $\begin{split} P(\hat{y}|x, W) & = N(\hat{y}|W^{\rm T} x, \sigma^2)= \\ & \frac{1}{\sqrt{2 \sigma^2 \pi}}\exp\left(-\frac{(\hat{y} - W^{\rm T} x)^2}{2\sigma^2}\right) \end{split}$

      (10)

      then, the optimum W can be determined by using the maximum likelihood estimation (MLE):

      $\begin{split}W_{\rm MLE}& = \arg\max\limits_{\substack{W}} N(\hat{y}|W^{\rm T} x, \sigma^2) =\\ & \arg\max\limits_{\substack{W}}\exp\left(-\frac{(\hat{y} - W^{\rm T} x)^2}{2\sigma^2}\right).\end{split}$

      (11)

      Taking the logarithm of the likelihood function, and making use of the standard form ($ \sigma = 1 $), we obtain the objective function:

      $ W_{\rm MLE} = \arg\min\limits_{\substack{W}}\frac{1}{2}(\hat{y} - W^{\rm T} x)^2 $

      (12)

      which is equal to the minimum the loss function $ L_2 $ in (9). In other words, least square estimate is actually the same as the maximum likelihood estimate under a Gaussian model. We have to replace the $ L_2 $ loss function with $ L_1 $ loss: $ E[(Y-g(X)] $ as mentioned previously, the solution is $ g(x) = median(Y|X) $, which is also a solution for MLE. It is important to bear in mind that the assumption is for uni-modal distribution with a single peak, which will not work well to predict multi-modal distributions. Another problem with content loss is that a minor change in pixels, for example shifting, can lead to a dramatically decreased PSNR. This problem has been mentioned in our previous work[50] with experimental results.

    • A key relationship between images and statistics is that we can interpret images as samples from a high-dimensional probability distribution. The probability distribution goes over the pixels of images and is what we use to define whether an image is natural or not. This is when a Kullback-Leibler (KL) divergence measurement comes into place. It measures the difference between two probability distributions, which is different from the Euclidean distance, i.e., $ L_1, L_2 $ loss. It may be tempting to think of it as a distance metric, but, we cannot use KL divergence to measure distance between two distributions because it is not symmetric. Given two distribution $ P_{data} $ and $ P_{model} $, the forward KL Divergence can be computed as follow:

      $\begin{split}& D_{KL}[P_{x|data}||P_{x|model}] = E_{x \sim P_{data}}\log \frac{P_{x|data}}{P_{x|model}}=\\ &\quad E_{x \sim P_{data}}[\log P_{x|data}] - E_{x \sim P_{data}}[\log P_{x|model}].\end{split}$

      (13)

      The left term is entropy of $ P_{x|data} $ which is dependent on the model and thus can be ignored. If we sample N of $ x \in P_{x|data} $ when N goes to infinity, following by the law of large numbers we have

      $-\frac{1}{N}\sum\limits_{i}^n \log P(x_i|model) = -E_{x \sim P_{x|data}}[P(x|model)] \!\!\!$

      (14)

      where the right term is negative log-likelihood. The minimum Kullback-Leibler divergence is also equivalent to the maximum the Log likelihood.

      When $ P_{model} = P_{data} $ the KL divergence comes to the minimum 0. It is assumed that human observers learn $ p_{data} $ as a natural distribution or a kind of prior belief. The GAN-based model is to encourage reconstructed images to have similar distributions to the ground truth images, which refer to adversarial loss as part of the perceptual loss in SRGAN[43]. Adversarial learning is actually useful when facing the complicated manifold distributions in natural images. However, training a GANs-based model is elusive due to several drawbacks:

      1) Hard to achieve Nash Equilibrium[51]: According to game theory, the GANs-based model converges when the discriminator and generator reach a Nash Equilibrium. However, updating each model with no respect to each other cannot guarantee the convergence. Both models can reach a state when the action of each model does not matter to each other.

      2) Vanishing problem[52]: As given in (5), when the discriminator knows better we can assume that $ D(x) = 1,\;\forall x \in p_{data} $ and $ D(x) = 0,\;\forall x \in p_{p_z} $ and the loss function falls to 0 and ends up with a vanishing gradient. As a result, the learning is super slow and even jammed. Conversely, when the discriminator behaves badly, the generator does not give accurate feedback.

      3) Mode collapse[53]: a generator generates a limited diversity of samples, or even the same sample regardless of the input. We have demonstrated that L1 and L2 loss are special cases of MLE and further KLD is equivalent to MLE. This finding leads to a question whether there exists another effective representation of MLE which is a better representation for image super resolution.

    • The MSE in feature space is to compare two images based on high-level representations from pre-trained convolutional neural networks (trained on image classification tasks, e.g., the ImageNet Dataset, as given in Fig. 12).

      Figure 12.  Model structure for calculating perceptual loss[45]

      Given an input image x, Image Transform Net transforms it into the output image $ \hat{y} $. Rather than matching the pixels of output image to the pixels of the target image, they were encouraged to have similar feature represents as measured by loss network. The perceptual loss was defined by computing MSE between later set of activations, particularly in applied super-resolution or style transfer. In practice, we can combine different kinds of loss functions, but, each loss function mentioned has a particular property. There is not a single loss function that works for all kinds of data.

    • Despite the success of deep learning for SISR tasks, there are open research questions regarding SISR model design as discussed below:

      1) Need for light structure model: Although deeper is better, most recent SISR models contain no more than a hundred layers due to the overfitting problem. This is because SISR models work on pixel level, which requires many more parameters than that of image classification. As the model is getting deeper, the vanishing gradient is becoming more challenging. This suggests the preference of a light structure model with fewer parameters and less computation.

      2) Adapt well to unknown degradation: Most algorithms highly depend on predetermined assumptions that LR images are simply down-sampling from HR images. They were unsuccessful in recovering SR images with big scale factors due to the lack of learnable features on LR images. If noise is present, the accuracy of reconstruction is deteriorated as a result of the increasing ill-posed problems. A good way to feasibly deal with unknown degradation is to use transfer learning or a huge number of training examples. However, there has been little research on this task hence this needs be further investigated.

      3) Requirement for different assessment criteria: No methods can achieve low distortion and good perceptual quality at the same time. The traditional measurements such as L1/L2 loss can help to generate images with low distortion, but there is still considerable disagreement with regard to human perception. In contrast, the integration of perceptual assessment produces more realistic images, but it suffers from low PSNR. Therefore, it is necessary to extend more criteria of assessment for particular applications.

      4) Efficiently interpret and exploit prior know- ledge to reduce ill-posed problems: Until recently, the deep architecture appears like a black box and we have limited knowledge of why it works and how it works. Meanwhile, most SISR algorithms have introduced different structures or connections based on the experiments, neglecting to explain further on why the result is improved. Another important solution for ill-posed problems is to combine different constraints as regulizers for prediction. For example, the combination of different loss functions, or the use of image segmentation information to constrain reconstructed images. That is why a semantic categorical prior[54] was introduced, attempting to achieve richer and more realistic textures. The simple ways to use more prior knowledge are that we can use MLE as a proxy to incorporate prior knowledge as conditional probability or feed directly into the network whilst forcing parameters sharing for all kinds of inputs.

    • This survey has reviewed key papers in single image super-resolution that underly example-based learning methods. Among these, we noticed that deep learning based methods have recently achieved state-of-the-art performance. Before going into more detail of each algorithm, the general background in each of the categories was introduced. We have highlighted important contributions of these algorithms, discussed their pros and cons and suggested future work possible either within categories or in designated sections. Up to now, we cannot define which SISR algorithms are the most state-of-the-art, as this is highly dependent on applications. For instance, an algorithm which is good for medical imaging or facing processing purposes is not necessarily effective for remote sensing images. The different constraints imposed in a problem indicates a need to generate a benchmark database that specifies the concerns of applications in different fields. Finally, there are outstanding challenges to exploit algorithms in practical applications since they have been mainly applied to standard benchmark datasets and poorly adapted to different scenarios. This survey paper has enhanced the understanding of deep learning based algorithms applied to single image super-resolution, which can be used as a comprehensive guide for the beginner and throws up many questions in need of further investigation.

    • The authors would like acknowledge the support from the Shanxi Hundred People Plan of China and colleagues from the Image Processing Group in Strathclyde University (UK) , Anhui University (China) and Taibah Valley (Taibah University, Saudi Arabia) respectively, for their valuable suggestions.

Reference (54)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return