STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition

Citation: Z. W. Xu, X. J. Wu, J. Kittler. STRNet: Triple-stream spatiotemporal relation network for action recognition. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1289-9 doi:  10.1007/s11633-021-1289-9
 Citation: Citation: Z. W. Xu, X. J. Wu, J. Kittler. STRNet: Triple-stream spatiotemporal relation network for action recognition. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1289-9

## STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition

###### Author Bio: Zhi-Wei Xu received the B. Eng. degree in computer science and technology from Harbin Institute of Technology, China in 2017. He is a postgraduate student at School of Artificial Intelligence and Computer Science, Jiangnan University, China. His research interests include computer vision, video understanding and action recognition. E-mail: zhiwei_xu@stu.jiangnan.edu.cn ORCID iD: 0000-0003-1472-431X Xiao-Jun Wu received the B. Sc. degree in mathematics from Nanjing Normal University, China in 1991. He received the M. Sc. and the Ph. D. degrees in pattern recognition and intelligent systems from Nanjing University of Science and Technology, China in 1996 and 2002, respectively. He is currently a professor in artificial intelligent and pattern recognition at the Jiangnan University, China. His research interests include pattern recognition, computer vision, fuzzy systems, neural networks and intelligent systems. E-mail: wu_xiaojun@jiangnan.edu.cn (Corresponding author) ORCID iD: 0000-0002-0310-5778 Josef Kittler received the B. A. degree in electrical science tripos, Ph. D. degree in pattern recognition, and D. Sc. degree from University of Cambridge, UK in 1971, 1974, and 1991, respectively. He is a Distinguished Professor of machine intelligence at Centre for Vision, Speech and Signal Processing, University of Surrey, UK. He conducts research in biometrics, video and image database retrieval, medical image analysis, and cognitive vision. He published the textbook Pattern Recognition: A Statistical Approach and over 700 scientific papers. His publications have been cited more than 66000 times (Google Scholar). He is series editor of Springer Lecture Notes on Computer Science. He currently serves on the Editorial Boards of Pattern Recognition Letters, Pattern Recognition and Artificial Intelligence, and Pattern Analysis and Applications. He also served as a member of the Editorial Board of IEEE Transactions on Pattern Analysis and Machine Intelligence during 1982−1985. He served on the Governing Board of the International Association for Pattern Recognition (IAPR) as one of the two British representatives during the period 1982-2005, and President of the IAPR during 1994−1996. His research interests include robotics, feedback control systems, and control theory. E-mail: j.kittler@surrey.ac.uk ORCID iD: 0000-0002-8110-9205
• Figure  1.  Architecture overview of STRNet. Our STRNet consists of three individual branches that focus on learning appearance, motion and temporal relation information, respectively. For comprehensively representing the information of the whole video, we apply two-stage fusion and separable (2+1)D convolution to reinforce the feature learning. Finally, we apply a decision level weight assignment to adjust the classification performance.

Figure  2.  Feature visualization of STRNet. The first column is the input frames. The second column is the feature maps of Stem. The third column is the fusion feature maps of stage 3. The last column is the output of spatiotemporal with relation feature maps of stage 5. We rescale the feature maps into original size for good comparison.

Figure  3.  The schema of building relation unit, where X denotes the original inputs of the sequential feature maps, and $\tilde{ X}$ denotes the calculated relation maps. The function Fsm(*) is to calculate the similarity measurement. And g denotes the similarity weight vector and Y denotes the final relation response maps.

•  [1] C. M. Bishop. Pattern Recognition and Machine Learning, New York, USA: Springer, 2006. [2] D. Michie, D. J. Spiegelhalter, C. C. Taylor. Machine Learning, Neural and Statistical Classification, Englewood Cliffs, USA Prentice Hall, 1994. [3] Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, vol. 521, no. 7553, pp. 436–444, 2015. DOI:  10.1038/nature14539. [4] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, ACM, Lake Tahoe, USA, pp. 1097−1105, 2012. [5] K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Dep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 770−778, 2016. [6] C. Szegedy, W. Liu, Y. Q. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 1−9, 2015. [7] J. W. Han, D. W. Zhang, G. Cheng, N. A. Liu, D. Xu. Advanced deep-learning techniques for salient and category-specific object detection: A survey. IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018. DOI:  10.1109/Msp.2017.2749125. [8] J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 779−788, 2016. [9] H. Noh, S. Hong, B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 1520−1528, 2015. [10] E. Shelhamer, J. Long, T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017. DOI:  10.1109/TPAMI.2016.2572683. [11] J. Carreira, A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 6299−6308, 2017. [12] X. F. Ji, Q. Q. Wu, Z. J. Ju, Y. Y. Wang. Study of human action recognition based on improved spatio-temporal features. International Journal of Automation and Computing, vol. 11, no. 5, pp. 500–509, 2014. DOI:  10.1007/s11633-014-0831-4. [13] L. M. Wang, Y. Qiao, X. O. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4305−4314, 2015. [14] X. L. Wang, A. Farhadi, A. Gupta. Actions ~ Transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2658−2667, 2016. [15] K. Simonyan, A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 27th International Conference on Neural Information Processing Systems, ACM, Montreal, Canada, pp. 568−576, 2014. [16] L. M. Wang, Y. J. Xiong, Z. Wang, Y. Qiao, D. H. Lin, X. O. Tang, L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 20−36, 2016. [17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4489−4497, 2015. [18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, F. F. Li. Large-scale video classification with convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Columbus, USA, pp. 1725−1732, 2014. [19] B. W. Zhang, L. M. Wang, Z. Wang, Y. Qiao, H. L. Wang. Real-time action recognition with enhanced motion vector CNNs. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 2718−2726, 2016. [20] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6450−6459, 2018. [21] B. Y. Jiang, M. M. Wang, W. H. Gan, W. Wu, J. J. Yan. STM: SpatioTemporal and motion encoding for action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 2000−2009, 2019. [22] Z. G. Tu, H. Y. Li, D. J. Zhang, J. Dauwels, B. X. Li, J. S. Yuan. Action-stage emphasized spatiotemporal VLAD for video action recognition. IEEE Transactions on Image Processing, vol. 28, no. 6, pp. 2799–2812, 2019. DOI:  10.1109/TIP.2018.2890749. [23] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. [Online], Available: https://arxiv.orglabs/1409.1556, 2014. [24] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Honolulu, USA, pp. 4700−4708, 2017. [25] I. Laptev. On space-time interest points. International Journal of Computer Vision, vol. 64, no. 2–3, pp. 107–123, 2005. DOI:  10.1007/s11263-005-1838-7. [26] H. Wang, C. Schmid. Action recognition with improved trajectories. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Sydney, Australia, pp. 3551−3558, 2013. [27] L. M. Wang, Y. Qiao, X. O. Tang. MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, vol. 119, no. 3, pp. 254–271, 2016. DOI:  10.1007/s11263-015-0859-0. [28] X. L. Song, C. L. Lan, W. J. Zeng, J. L. Xing, X. Y. Sun, J. Y. Yang. Temporal-spatial mapping for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 748–759, 2020. DOI:  10.1109/Tcsvt.2019.2896029. [29] S. W. Ji, W. Xu, M. Yang, K. Yu. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2012. DOI:  10.1109/TPAMI.2012.59. [30] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 4694−4702, 2015. DOI:  10.1109/CVPR.2015.7299101. [31] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, K. Saenko. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 2625−2634, 2015. [32] S. J. Yan, Y. J. Xiong, D. H. Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, the 30th Innovative Applications of Artificial Intelligence, and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, USA, pp. 7444−7452, 2018. [33] C. Wu, X. J. Wu, J. Kittler. Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision Workshop, IEEE, Seoul, Korea, pp. 1740−1748, 2019. [34] H. S. Wang, L. Wang. Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Transactions on Image Processing, vol. 27, no. 9, pp. 4382–4394, 2018. DOI:  10.1109/TIP.2018.2837386. [35] B. K. P. Horn, B. G. Schunck. Determining optical flow. Artificial Intelligence, vol. 17, no. 1-3, pp. 185–203, 1981. DOI:  10.1117/12.965761. [36] H. Sak, A. W. Senior, F. Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association, Singapore, pp. 338−342, 2014. [37] C. H. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Q. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, J. Malik. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 6047−6056, 2018. [38] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In IEEE Proceedings of International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5843−5851, 2017. [39] L. M. Wang, W. Li, W. Li, L. Van Gool. Appearance-and-relation networks for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 1430−1439, 2018. [40] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. DOI:  10.1109/5.726791. [41] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, A. Baskurt. Sequential deep learning for human action recognition. In Proceedings of the 2nd International Workshop on Human Behavior Understanding, Springer, Amsterdam, The Netherlands, pp. 29−39, 2011. [42] L. Sun, K. Jia, D. Y. Yeung, B. E. Shi. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Santiago, Chile, pp. 4597−4605, 2015. [43] Z. F. Qiu, T. Yao, T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of IEEE International Conference on Computer Vision, IEEE, Venice, Italy, pp. 5533−5541, 2017. [44] R. Memisevic. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1829–1846, 2013. DOI:  10.1109/TPAMI.2013.53. [45] B. L. Zhou, A. Andonian, A. Oliva, A. Torralba. Temporal relational reasoning in videos. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 803−818, 2018. [46] R. H. Zeng, W. B. Huang, C. Gan, M. K. Tan, Y. Rong, P. L. Zhao, J. Z. Huang. Graph convolutional networks for temporal action localization. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7093−7102, 2019. [47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Uszkoreit, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, ACM, Long Beach, USA, pp. 5998−6008, 2017. [48] H. Z. Chen, G. H.Tian, G. L. Liu. A selective attention guided initiative semantic cognition algorithm for service robot. International Journal of Automation and Computing, vol. 15, no. 5, pp. 559–569, 2018. DOI:  10.1007/s11633-018-1139-6. [49] T. V. Nguyen, Z. Song, S. C. Yan. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 77–86, 2015. DOI:  10.1109/Tcsvt.2014.2333151. [50] X. Long, C. Gan, G. De Melo, J. J. Wu, X. Liu, S. L. Wen. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7834−7843, 2018. [51] X. Zhang, Q. Yang. Transfer hierarchical attention network for generative dialog system. International Journal of Automation and Computing, vol. 16, no. 6, pp. 720–736, 2019. DOI:  10.1007/s11633-019-1200-0. [52] X. L. Wang, R. Girshick, A. Gupta, K. M. He. Non-local neural networks. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Salt Lake City, USA, pp. 7794−7803, 2018. [53] C. Szegedy, S. Ioffe, V. Vanhoucke, A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, San Francisco, USA, pp. 4278−4284, 2017. [54] Y. Z. Zhou, X. Y. Sun, C. Luo, Z. J. Zha, W. J. Zeng. Spatiotemporal fusion in 3D CNNs: A probabilistic view. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, Seattle, USA, pp. 9829−9838, 2020. [55] H. S. Su, J. Su, D. L. Wang, W. H. Gan, W. Wu, M. M. Wang, J. J. Yan, Y. Qiao. Collaborative distillation in the parameter and spectrum domains for video action recognition. [Online], Available: https://arxiv.org/abs/2009.06902, 2020. [56] C. Feichtenhofer, H. Q. Fan, J. Malik, K. M. He. Slowfast networks for video recognition. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 6201−6210, 2019. [57] M. Zolfaghari, K. Singh, T. Brox. ECO: Efficient convolutional network for online video understanding. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 695−712, 2018. [58] K. Soomro, A. R. Zamir, M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. [Online], Available: https://arxiv.org/abs/1212.0402, 2012. [59] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre. HMDB: A large video database for human motion recognition. In Proceedings of International Conference on Computer Vision, IEEE, Barcelona, Spain, pp. 2556−2563, 2011. [60] X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, JMLR, Sardinia, Italy, pp. 249−256, 2010. [61] A. Diba, M. Fayyaz, V. Sharma, M. M. Arzani, R. Yousefzadeh, J. Gall, L. Van Gool. Spatio-temporal channel correlation networks for action classification. In Proceedings of the 15th European Conference on Computer Vision, Springer, Munich, Germany, pp. 284−299, 2018. [62] S. N. Xie, C. Sun, J. Huang, Z. W. Tu, K. Murphy. Rethinking spatiotemporal feature learning for video understanding. [Online], Available: https://arxiv.org/abs/1712.04851, 2017. [63] J. Lin, C. Gan, S. Han. TSM: Temporal shift module for efficient video understanding. In Proceedings of IEEE/CVF International Conference on Computer Vision, IEEE, Seoul, Korea, pp. 7082−7092, 2019. [64] Y. S. Tang, J. W. Lu, J. Zhou. Comprehensive instructional video analysis: The COIN dataset and performance evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. DOI:  10.1109/TPAMI.2020.2980824.
•  [1] Zhen-Yi Zhao, Yang Cao, Yu Kang, Zhen-Yi Xu.  Prediction of Spatiotemporal Evolution of Urban Traffic Emissions Based on Taxi Trajectories . International Journal of Automation and Computing, doi: 10.1007/s11633-020-1271-y [2] Lu-Jie Zhou, Jian-Wu Dang, Zhen-Hai Zhang.  Fault Information Recognition for On-board Equipment of High-speed Railway Based on Multi-Neural Network Collaboration . International Journal of Automation and Computing, doi: 10.1007/s11633-021-1298-8 [3] Li-Fang Wu, Qi Wang, Meng Jian, Yu Qiao, Bo-Xuan Zhao.  A Comprehensive Review of Group Activity Recognition in Videos . International Journal of Automation and Computing, doi: 10.1007/s11633-020-1258-8 [4] Huan Liu, Gen-Fu Xiao, Yun-Lan Tan, Chun-Juan Ouyang.  Multi-source Remote Sensing Image Registration Based on Contourlet Transform and Multiple Feature Fusion . International Journal of Automation and Computing, doi: 10.1007/s11633-018-1163-6 [5] Shui-Guang Tong, Yuan-Yuan Huang, Zhe-Ming Tong.  A Robust Face Recognition Method Combining LBP with Multi-mirror Symmetry for Images with Various Face Interferences . International Journal of Automation and Computing, doi: 10.1007/s11633-018-1153-8 [6] Bing-Tao Zhang, Xiao-Peng Wang, Yu Shen, Tao Lei.  Dual-modal Physiological Feature Fusion-based Sleep Recognition Using CFS and RF Algorithm . International Journal of Automation and Computing, doi: 10.1007/s11633-019-1171-1 [7] Zhi-Heng Wang, Chao Guo, Hong-Min Liu, Zhan-Qiang Huo.  MFSR: Maximum Feature Score Region-based Captions Locating in News Video Images . International Journal of Automation and Computing, doi: 10.1007/s11633-015-0943-5 [8] Derradji Nada, Mounir Bousbia-Salah, Maamar Bettayeb.  Multi-sensor Data Fusion for Wheelchair Position Estimation with Unscented Kalman Filter . International Journal of Automation and Computing, doi: 10.1007/s11633-017-1065-z [9] Hong-Kai Chen, Xiao-Guang Zhao, Shi-Ying Sun, Min Tan.  PLS-CCA Heterogeneous Features Fusion-based Low-resolution Human Detection Method for Outdoor Video Surveillance . International Journal of Automation and Computing, doi: 10.1007/s11633-016-1029-8 [10] Fadhlan Kamaru Zaman, Amir Akramin Shafie, Yasir Mohd Mustafah.  Robust Face Recognition Against Expressions and Partial Occlusions . International Journal of Automation and Computing, doi: 10.1007/s11633-016-0974-6 [11] Zheng-Huan Zhang, Xiao-Fen Jiang, Hong-Sheng Xi.  Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services . International Journal of Automation and Computing, doi: 10.1007/s11633-016-1025-z [12] Hai-Shun Du, Qing-Pu Hu, Dian-Feng Qiao, Ioannis Pitas.  Robust Face Recognition via Low-rank Sparse Representation-based Classification . International Journal of Automation and Computing, doi: 10.1007/s11633-015-0901-2 [13] Li Wang, Rui-Feng Li, Ke Wang, Jian Chen.  Feature Representation for Facial Expression Recognition Based on FACS and LBP . International Journal of Automation and Computing, doi: 10.1007/s11633-014-0835-0 [14] Xiao-Fei Ji, Qian-Qian Wu, Zhao-Jie Ju, Yang-Yang Wang.  Study of Human Action Recognition Based on Improved Spatio-temporal Features . International Journal of Automation and Computing, doi: 10.1007/s11633-014-0831-4 [15] Fu-Shou Lin, Bao-Qun Yin, Jing Huang, Xu-Min Wu.  Admission Control with Elastic QoS for Video on Demand Systems . International Journal of Automation and Computing, doi: 10.1007/s11633-012-0668-7 [16] Jing Wang,  Zhi-Jie Xu.  Video Analysis Based on Volumetric Event Detection . International Journal of Automation and Computing, doi: 10.1007/s11633-010-0516-6 [17] Tie-Jun Li, Gui-Qiang Chen, Gui-Fang Shao.  Action Control of Soccer Robots Based on Simulated Human Intelligence . International Journal of Automation and Computing, doi: 10.1007/s11633-010-0055-1 [18] Vincent Nozick,  Hideo Saito.  On-line Free-viewpoint Video:From Single to Multiple View Rendering . International Journal of Automation and Computing, doi: 10.1007/s11633-008-0257-y [19] Kenji Yamamoto,  Ryutaro Oi.  Color Correction for Multi-view Video Using Energy Minimization of View Networks . International Journal of Automation and Computing, doi: 10.1007/s11633-008-0234-5 [20] Sing Kiong Nguang, Ping Zhang, Steven X. Ding.  Parity Relation Based Fault Estimation for Nonlinear Systems: An LMI Approach . International Journal of Automation and Computing, doi: 10.1007/s11633-007-0164-7

##### 计量
• 文章访问数:  15
• HTML全文浏览量:  52
• PDF下载量:  29
• 被引次数: 0
##### 出版历程
• 收稿日期:  2020-10-30
• 录用日期:  2021-02-05
• 网络出版日期:  2021-03-23

## STRNet: Triple-stream Spatiotemporal Relation Network for Action Recognition

### English Abstract

Citation: Z. W. Xu, X. J. Wu, J. Kittler. STRNet: Triple-stream spatiotemporal relation network for action recognition. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1289-9 doi:  10.1007/s11633-021-1289-9
 Citation: Citation: Z. W. Xu, X. J. Wu, J. Kittler. STRNet: Triple-stream spatiotemporal relation network for action recognition. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1289-9

/

• 分享
• 用微信扫码二维码

分享至好友和朋友圈