Article Contents
Zun-Ran Wang, Chen-Guang Yang, Shi-Lu Dai. A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1240-5
Cite as: Zun-Ran Wang, Chen-Guang Yang, Shi-Lu Dai. A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1240-5

A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence

Author Biography:
  • Zun-Ran Wang received the B. Eng. degree in automation from the South China University of Technology, China in 2017. He is currently a M.Sc. degree candidate in the South China University of Technology, China. His research interests include human-robot interaction, intelligent control and image processing. E-mail: zran.wang@qq.com ORCID iD: 0000-0002-0826-9376

    Chen-Guang Yang received the B. Eng. degree in measurement and control from Northwestern Polytechnical University, China in 2005, the Ph.D. degree in control engineering from the National University of Singapore, Singapore in 2010. He received Best Paper Awards from IEEE Transactions on Robotics and over 10 international conferences. His research interests include robotics and automation. E-mail: cyang@ieee.org (Corresponding author)ORCID iD: 0000-0001-5255-5559

    Shi-Lu Dai received his B. Eng. degree in thermal engineering, the M. Eng. and Ph. D. degrees in control science and engineering, Northeastern University, China in 2002, 2006, and 2010, respectively. He was a visiting student in Department of Electrical and Computer Engineering, National University of Singapore, Singapore from November 2007 to November 2009, and a visiting scholar at Department of Electrical Engineering, University of Notre Dame, USA from October 2015 to October 2016. Since 2010, he has been with the School of Automation Science and Engineering, South China University of Technology, China, where he is currently a professor. His research interests include adaptive and learning control, distributed cooperative systems. E-mail: audaisl@scut.edu.cn

  • Received: 2020-05-11
  • Accepted: 2020-06-05
  • Published Online: 2020-07-31
  • In this paper, a novel compression framework based on 3D point cloud data is proposed for telepresence, which consists of two parts. One is implemented to remove the spatial redundancy, i.e., a robust Bayesian framework is designed to track the human motion and the 3D point cloud data of the human body is acquired by using the tracking 2D box. The other part is applied to remove the temporal redundancy of the 3D point cloud data. The temporal redundancy between point clouds is removed by using the motion vector, i.e., the most similar cluster in the previous frame is found for the cluster in the current frame by comparing the cluster feature and the cluster in the current frame is replaced by the motion vector for compressing the current frame. The first, the B-SHOT (binary signatures of histograms orientation) descriptor is applied to represent the point feature for matching the corresponding point between two frames. The second, the K-mean algorithm is used to generate the cluster because there are a lot of unsuccessfully matched points in the current frame. The matching operation is exploited to find the corresponding clusters between the point cloud data of two frames. Finally, the cluster information in the current frame is replaced by the motion vector for compressing the current frame and the unsuccessfully matched clusters in the current and the motion vectors are transmitted into the remote end. In order to reduce calculation time of the B-SHOT descriptor, we introduce an octree structure into the B-SHOT descriptor. In particular, in order to improve the robustness of the matching operation, we design the cluster feature to estimate the similarity between two clusters. Experimental results have shown the better performance of the proposed method due to the lower calculation time and the higher compression ratio. The proposed method achieves the compression ratio of 8.42 and the delay time of 1228 ms compared with the compression ratio of 5.99 and the delay time of 2163 ms in the octree-based compression method under conditions of similar distortion rate.
  • 加载中
  • [1] C. G. Yang, Y. H. Ye, X. Y. Li, R. W. Wang.  Development of a neuro-feedback game based on motor imagery EEG[J]. Multimedia Tools and Applications, 2018, 77(12): 15929-15949. doi: 10.1007/s11042-017-5168-x
    [2] F. Nagata, K. Watanabe, M. K. Habib.  Machining robot with vibrational motion and 3D printer-like data interface[J]. International Journal of Automation and Computing, 2018, 15(1): 1-12. doi: 10.1007/s11633-017-1101-z
    [3] C. G. Yang, H. W. Wu, Z. J. Li, W. He, N. Wang, C. Y. Su.  Mind control of a robotic arm with visual fusion technology[J]. IEEE Transactions on Industrial Informatics, 2018, 14(9): 3822-3830. doi: 10.1109/TII.2017.2785415
    [4] X. Y. Wang, C. G. Yang, Z. J. Ju, H. B. Ma, M. Y. Fu.  Robot manipulator self-identification for surrounding obstacle detection[J]. Multimedia Tools and Applications, 2017, 76(5): 6495-6520. doi: 10.1007/s11042-016-3275-8
    [5] J. H. Zhang, M. Li, Y. Feng, C. G. Yang.  Robotic grasp detection based on image processing and random forest[J]. Multimedia Tools and Applications, 2020, 79(3–4): 2427-2446. doi: 10.1007/s11042-019-08302-9
    [6] H. Y. Chen, H. L. Huang, Y. Qin, Y. J. Li, Y. H. Liu.  Vision and laser fused slam in indoor environments with multi-robot system[J]. Assembly Automation, 2019, 39(2): 297-307. doi: 10.1108/AA-04-2018-065
    [7] Y. Yang, F. Qiu, H. Li, L. Zhang, M. L. Wang, M. Y. Fu.  Large-scale 3D semantic mapping using stereo vision[J]. International Journal of Automation and Computing, 2018, 15(2): 194-206. doi: 10.1007/s11633-018-1118-y
    [8] J. Oyekan, A. Fischer, W. Hutabarat, C. Turner, A. Tiwari. Utilising low cost RGB-D cameras to track the real time progress of a manual assembly sequence. Assembly Automation, to be published. DOI: 10.1108/AA-06-2018-078.
    [9] G. L. Wang, X. T. Hua, J. Xu, L. B. Song, K. Chen.  A deep learning based automatic surface segmentation algorithm for painting large-size aircraft with 6-DOF robot[J]. Assembly Automation, 2019, 40(2): 199-210. doi: 10.1108/AA-03-2019-0037
    [10] J. W. Li, W. Gao, Y. H. Wu.  Elaborate scene reconstruction with a consumer depth camera[J]. International Journal of Automation and Computing, 2018, 15(4): 443-453. doi: 10.1007/s11633-018-1114-2
    [11] C. G. Yang, Z. R. Wang, W. He, Z. J. Li.  Development of a fast transmission method for 3D point cloud[J]. Multimedia Tools and Applications, 2018, 77(19): 25369-25387. doi: 10.1007/s11042-018-5789-8
    [12] S. M. Prakhya, B. B. Liu, W. S. Lin, V. Jakhetiya, S. C. Guntuku.  B-shot: A binary 3D feature descriptor for fast Keypoint matching on 3D point clouds[J]. Autonomous Robots, 2017, 41(7): 1501-1520. doi: 10.1007/s10514-016-9612-y
    [13] J. H. Hou, L. P. Chau, N. Magnenat-Thalmann, Y. He.  Compressing 3-D human motions via Keyframe-based geometry videos[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2015, 25(1): 51-62. doi: 10.1109/TCSVT.2014.2329376
    [14] J. Wingbermuehle.  Towards automatic creation of realistic anthropomorphic models for realtime 3D telecommunication[J]. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 1998, 20(1-2): 81-96. doi: 10.1023/A:1008018307114
    [15] J. H. Zhang, C. B. Owen.  Octree-based animated geometry compression[J]. Computers & Graphics, 2007, 31(3): 463-479. doi: 10.1016/j.cag.2006.12.002
    [16] Q. H. Yu, W. Yu, J. H. Zheng, X. Z. Zheng, Y. He, Y. C. Rong.  A high-throughput and low-complexity decoding scheme based on logarithmic domain[J]. Journal of Signal Processing Systems, 2017, 88(3): 245-257. doi: 10.1007/s11265-016-1143-4
    [17] C. Loop, C. Zhang, Z. Y. Zhang. Real-time high-resolution sparse voxelization with application to image-based modeling. In Proceedings of the 5th High-performance Graphics Conference, ACM, New York, USA, pp. 73-79, 2013. DOI: 10.1145/2492045.2492053.
    [18] R. Schnabel, R. Klein. Octree-based point-cloud compression. In Proceedings of the 3rd Symposium on Point-based Graphics, ACM, Boston, USA, pp. 111–121, 2006.
    [19] J. Kammerl, N. Blodow, R. B. Rusu, S. Gedikli, M. Beetz, E. Steinbach. Real-time compression of point cloud streams. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Saint Paul, USA, pp. 778–785, 2012.
    [20] C. Zhang, D. Florêncio, C. Loop. Point cloud attribute compression with graph transform. In Proceedings of IEEE International Conference on Image Processing, IEEE, Paris, France, pp. 2066–2070, 2014. DOI: 10.1109/ICIP.2014.7025414.
    [21] D. Sedlacek, J. Zara. Graph cut based point-cloud segmentation for polygonal reconstruction. In Proceedings of the 5th International Symposium on Visual Computing, Springer, Las Vegas, USA, pp. 218–227, 2009. DOI: 10.1007/978-3-642-10520-3_20.
    [22] L. Landrieu, C. Mallet, M. Weinmann. Comparison of belief propagation and graph-cut approaches for contextual classification of 3D lidar point cloud data. In Proceedings of IEEE International Geoscience and Remote Sensing Symposium, IEEE, Fort Worth, USA, pp. 2768–2771, 2017. DOI: 10.1109/IGARSS.2017.8127571.
    [23] X. M. Zhang, W. G. Wan, X. D. An.  Clustering and DCT based color point cloud compression[J]. Journal of Signal Processing Systems, 2017, 86(1): 41-49. doi: 10.1007/s11265-015-1095-0
    [24] J. Euh, J. Chittamuru, W. Burleson.  Power-aware 3D computer graphics rendering[J]. Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 2005, 39(1-2): 15-33. doi: 10.1023/B:VLSI.0000047269.03965.e9
    [25] D. Thanou, P. A. Chou, P. Frossard.  Graph-based compression of dynamic 3D point cloud sequences[J]. IEEE Transactions on Image Processing, 2016, 25(4): 1765-1778. doi: 10.1109/TIP.2016.2529506
    [26] Y. T. Shao, Q. Zhang, G. Li, Z. Li, L. Li. Hybrid point cloud attribute compression using slice-based layered structure and block-based intra prediction. In Proceedings of the 26th ACM Multimedia Conference on Multimedia Conference, ACM, Istanbul, Turkey, pp. 1199–1207, 2018.
    [27] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. F. Huang, T. Ghirmai, M. F. Bugallo, J. Miguez.  Particle filtering[J]. IEEE Signal Processing Magazine, 2003, 20(5): 19-38. doi: 10.1109/MSP.2003.1236770
    [28] Z. Chen.  Bayesian filtering: From Kalman filters to particle filters, and beyond[J]. Statistics: A Journal of Theoretical and Applied Statistics, 2003, 182(1): 1-69.
    [29] J. S. Liu, R. Chen, T. Logvinenko. A theoretical framework for sequential importance sampling with resampling. Sequential Monte Carlo Methods in Practice, A. Doucet, N, de Freitas, N. Gordon, Eds., New York, USA: Springer, pp. 225–246, 2001. DOI: 10.1007/978-1-4757-3437-9_11.
    [30] X. H. Liu, S. Payandeh. Implementation of levels-of-detail in bayesian tracking framework using single RGB-D sensor. In Proceedings of the 7th IEEE Annual Information Technology, Electronics and Mobile Communication Conference, IEEE, Vancouver, Canada, 2016. DOI: 10.1109/IEMCON.2016.7746290.
    [31] S. Salti, F. Tombari, L. Di Stefano.  SHOT: Unique signatures of histograms for surface and texture description[J]. Computer Vision and Image Understanding, 2014, 125(): 251-264. doi: 10.1016/j.cviu.2014.04.011
    [32] A. Frome, D. Huber, R. Kolluri, T. Bülow, J. Malik. Recognizing objects in range data using regional point descriptors. In Proceedings of the 8th European Conference on Computer Vision, Springer, Prague, Czech Republic, pp. 224–237, 2004. DOI: 10.1007/978-3-540-24672-5_18.
    [33] F. Tombari, S. Salti, L. Di Stefano. Unique signatures of histograms for local surface description. In Proceedings of the 11th European Conference on Computer Vision, Springer, Heraklion, Greece, pp. 356–369, 2010. DOI: 10.1007/978-3-642-15558-1_26.
    [34] R. Mekuria, K. Blom, P. Cesar.  Design, implementation, and evaluation of a point cloud codec for tele-immersive video[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(4): 828-842. doi: 10.1109/TCSVT.2016.2543039
  • 加载中
  • [1] Viet-Anh Le, Hai-Xuan Le, Linh Nguyen, Minh-Xuan Phan. An Efficient Adaptive Hierarchical Sliding Mode Control Strategy Using Neural Networks for 3D Overhead Cranes . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1174-y
    [2] Qiang Fu, Xiang-Yang Chen, Wei He. A Survey on 3D Visual Tracking of Multicopters . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1199-2
    [3] Bing-Xing Wu, Suat Utku Ay, Ahmed Abdel-Rahim. Pedestrian Height Estimation and 3D Reconstruction Using Pixel-resolution Mapping Method Without Special Patterns . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1170-2
    [4] Xian-Xia Zhang, Zhi-Qiang Fu, Shao-Yuan Li, Tao Zou, Bing Wang. A Time/Space Separation Based 3D Fuzzy Modeling Approach for Nonlinear Spatially Distributed Systems . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1080-0
    [5] Mostafa El Mallahi, Jaouad El Mekkaoui, Amal Zouhri, Hicham Amakdouf, Hassan Qjidaa. Rotation Scaling and Translation Invariants of 3D Radial Shifted Legendre Moments . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1105-8
    [6] Yi Yang, Fan Qiu, Hao Li, Lu Zhang, Mei-Ling Wang, Meng-Yin Fu. Large-scale 3D Semantic Mapping Using Stereo Vision . International Journal of Automation and Computing,  doi: 10.1007/s11633-018-1118-y
    [7] Mostafa El Mallahi, Amal Zouhri, Anass El Affar, Ahmed Tahiri, Hassan Qjidaa. Radial Hahn Moment Invariants for 2D and 3D Image Recognition . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1071-1
    [8] Fusaomi Nagata, Keigo Watanabe, Maki K. Habib. Machining Robot with Vibrational Motion and 3D Printer-like Data Interface . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1101-z
    [9] Ling-Yi Xu, Zhi-Qiang Cao, Peng Zhao, Chao Zhou. A New Monocular Vision Measurement Method to Estimate 3D Positions of Objects on Floor . International Journal of Automation and Computing,  doi: 10.1007/s11633-016-1047-6
    [10] Merras Mostafa, El Hazzat Soulaiman, Saaidi Abderrahim, Satori Khalid, Gadhi Nazih Abderrazak. 3D Face Reconstruction Using Images from Cameras with Varying Parameters . International Journal of Automation and Computing,  doi: 10.1007/s11633-016-0999-x
    [11] Wei-Hua Chen,  Yuan-Yuan Liu,  Fu-Hua Zhang,  Yong-Ze Yu,  Hai-Ping Chen,  Qing-Xi Hu. Osteochondral Integrated Scaffolds with Gradient Structure by 3D Printing Forming . International Journal of Automation and Computing,  doi: 10.1007/s11633-014-0853-y
    [12] Fan Zhou,  Wei Zheng,  Zeng-Fu Wang. Adaptive Noise Identification in Vision-assisted Motion Estimation for Unmanned Aerial Vehicles . International Journal of Automation and Computing,  doi: 10.1007/s11633-014-0857-7
    [13] Fei Yan, Yi-Sha Liu, Ji-Zhong Xiao. Path Planning in Complex 3D Environments Using a Probabilistic Roadmap Method . International Journal of Automation and Computing,  doi: 10.1007/s11633-013-0750-9
    [14] De Xu, Hua-Wei Wang, You-Fu Li, Min Tan. A New Calibration Method for an Inertial andVisual Sensing System . International Journal of Automation and Computing,  doi: 10.1007/s11633-012-0648-y
    [15] Xiao-Jing Zhou, Zheng-Xu Zhao. The Skin Deformation of a 3D Virtual Human . International Journal of Automation and Computing,  doi: 10.1007/s11633-009-0344-8
    [16] Kazuki Kozuka,  Cheng Wan,  Jun Sato. Rectification of 3D Data Obtained from Moving Range Sensor by Using Extended Projective Multiple View Geometry . International Journal of Automation and Computing,  doi: 10.1007/s11633-008-0268-8
    [17] Edmée Amstutz, Tomoaki Teshima, Makoto Kimura, Masaaki Mochimaru, Hideo Saito. PCA-based 3D Shape Reconstruction of Human Foot Using Multiple Viewpoint Cameras . International Journal of Automation and Computing,  doi: 10.1007/s11633-008-0217-6
    [18] Yukiko Kenmochi, Lilian Buzer, Akihiro Sugimoto, Ikuko Shimizu. Discrete Plane Segmentation and Estimation from a Point Cloud Using Local Geometric Patterns . International Journal of Automation and Computing,  doi: 10.1007/s11633-008-0246-1
    [19] Ming-Min Zhang,  Zhi-Geng Pan,  Li-Feng Ren,  Peng Wang. Image-based Virtual Exhibit and Its Extension to 3D . International Journal of Automation and Computing,  doi: 10.1007/s11633-007-0018-3
    [20] Jindong Liu,  Huosheng Hu. A 3D Simulator for Autonomous Robotic Fish . International Journal of Automation and Computing,  doi: 10.1007/s11633-004-0042-5
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (10)  / Tables (2)

Metrics

Abstract Views (14) PDF downloads (8) Citations (0)

A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence

Abstract: In this paper, a novel compression framework based on 3D point cloud data is proposed for telepresence, which consists of two parts. One is implemented to remove the spatial redundancy, i.e., a robust Bayesian framework is designed to track the human motion and the 3D point cloud data of the human body is acquired by using the tracking 2D box. The other part is applied to remove the temporal redundancy of the 3D point cloud data. The temporal redundancy between point clouds is removed by using the motion vector, i.e., the most similar cluster in the previous frame is found for the cluster in the current frame by comparing the cluster feature and the cluster in the current frame is replaced by the motion vector for compressing the current frame. The first, the B-SHOT (binary signatures of histograms orientation) descriptor is applied to represent the point feature for matching the corresponding point between two frames. The second, the K-mean algorithm is used to generate the cluster because there are a lot of unsuccessfully matched points in the current frame. The matching operation is exploited to find the corresponding clusters between the point cloud data of two frames. Finally, the cluster information in the current frame is replaced by the motion vector for compressing the current frame and the unsuccessfully matched clusters in the current and the motion vectors are transmitted into the remote end. In order to reduce calculation time of the B-SHOT descriptor, we introduce an octree structure into the B-SHOT descriptor. In particular, in order to improve the robustness of the matching operation, we design the cluster feature to estimate the similarity between two clusters. Experimental results have shown the better performance of the proposed method due to the lower calculation time and the higher compression ratio. The proposed method achieves the compression ratio of 8.42 and the delay time of 1228 ms compared with the compression ratio of 5.99 and the delay time of 2163 ms in the octree-based compression method under conditions of similar distortion rate.

Zun-Ran Wang, Chen-Guang Yang, Shi-Lu Dai. A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1240-5
Citation: Zun-Ran Wang, Chen-Guang Yang, Shi-Lu Dai. A Fast Compression Framework Based on 3D Point Cloud Data for Telepresence. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1240-5
    • With wide applications for the 3D images in games[13], telepresences[4-6], environmental perception[79] and scene reconstruction[10] etc., the 3D image for real-world scenes has been widely researched for decades. In these applications, the 3D image must be transmitted to the clients. There are many challenges for transmitting and storing the 3D image data, such as the quite high amount of data for the 3D image, the different numbers of points between two frame 3D point clouds, no explicit corresponding relation between the different 3D point cloud data frames and the non-uniform distribution of points in the space. Thus, compressing 3D images has been an active research topic. However, the state-of-art transmission methods depend on the support of huge servers, which may result in a high cost. Therefore, a novel compression framework using the temporal and spatial redundancy is proposed in this paper. The octree structure is employed to voxelize the point cloud data for uniform distribution and lower calculation time. The motion estimation and the clustering technology are applied to solve the problem of the different numbers of points between the different 3D point cloud data frames. In particular, we design the cluster feature to estimate the similarity between two clusters for the problem of no explicit corresponding relation between the different 3D point cloud data frames.

      In this paper, a telepresence system is built. The 3D image data of the remote operator body is transmitted into the client and is displayed on the moving robot when the customers interact with the remote operator. In the meantime, the commands from the customers are transmitted to the remote operator through the internet. In order to enhance immerse of interaction for the customers, the 3D image of the remote operator body is displayed on the moving robot in the form of a 3D image.

      Building on our previous work[11], a novel compression framework based on the temporal and spatial redundancy for 3D point cloud data is proposed. As shown in Fig. 1, the 3D point cloud data and 2D RGB image are acquired by using the Kinect sensor. The novel compression framework based on 3D point cloud data is proposed for the telepresence, which consists of two parts. One is implemented to remove the spatial redundancy, the 3D point cloud data and 2D RGB image are fed into the first part while its output is a 3D point cloud data of the remote operator body without outliers. The other part is applied to remove the temporal redundancy of the 3D point cloud data, its input is the output of the first part while it outputs the compressed point cloud data.

      Figure 1.  The proposed framework. We send three P frames between two I frames.

      In the fisrt part, in order to remove the spatial redundancy, first, the 2D RGB image is fed into a robust Bayesian framework which is utilized to track the remote operator body within a 2D bounding box and the robust Bayesian framework outputs a tracking 2D bounding box. Second, the tracking 2D box and the camera calibration parameters are exploited to convert the 2D bounding box into a 3D bounding box. And then, the 3D bounding box is exploited to obtain the 3D point cloud data of the remote operator body from the point cloud of the whole scene. There are some outliers in the 3D point cloud data of the human body, the statistical filter is implemented to remove them. In the statistical filter, we regard 3D point cloud data of the remote operator body as the input. The mean distance between the point and its nearest point is larger than the threshold, it will be treated as the outlier. Finally, the part outputs a 3D point cloud data of the remote operator body without outliers.

      In the second part, to further compress 3D point cloud data of the remote operator body, a number of predicted frames (P frame) are inserted into between two I frames for the 3D point cloud data compression. The intra frame (I frame) (of relatively large size) is treated as a 3D point cloud including all the visual information while the P frame (of relative very small size) only reflects the small change of the current frame compared with the previous frame. In order to acquire the P frame, some steps are implemented as follows: First, the B-SHOT descriptor is acquired to regard as the point feature. The eigen value decomposition technology is applied to obtain the reference frame of the key points, the features of the key points are acquired through constructing the histogram of reference frames, and the features are binarized to reduce calculation time. Second, the correspondences of the point to point between different point cloud frames are found through comparing the features of two key points. Then, a clustering operation in key points will be applied to cluster some points that will not be matched. Third, we design the cluster feature to estimate the similarity between two clusters, and the motion vectors are acquired by calculating the transformation vector between two similar clusters. In particular, the octree structure is utilized to voxelize the 3D point cloud before implementing the B-SHOT descriptor, since the octree structure can retain the original distribution of point cloud, make the distribution more uniform and decrease the intensity of points. Finally, the P frame (including the unsuccessfully matched clusters in the current frame and the motion vectors) is transmitted into the remote end. In the remote end, the information of the I frame and the information of the P frame are utilized to predict the new frame.

      The main contributions of this paper are summarized as follows:

      1) A new compression framework based on spatial redundancy (through the Bayesian framework) and temporal redundancy (through the motion estimation) is proposed for the telepresence.

      2) Inspired by [12], the B-SHOT descriptor is implemented to acquire the point feature. In order to further reduce the calculation time of the B-SHOT descriptor, we apply the octree structure to improve the B-SHOT descriptor, i.e., before acquiring the point feature, each frame is voxelized through using the octree structure.

      3) In order to improve the robustness of the matching operation, we design the cluster feature to estimate the similarity between two clusters.

      The structure of this paper is organized as follows. First, the existing technologies that study the problem of compression of the 3D image are reviewed in Section 2. In Section 3, the theories of the compression based on the spatial redundancy for 3D point cloud data are described while Section 4 presents the theories of compression based on the temporal correlation information for 3D point cloud data. The experimental evaluation is explained in Section 5 and Section 6 sums up our work.

    • Research on compressing the 3D point cloud data has attracted the attention of the media industry in recent years, because the 3D point cloud data can be captured in real time. Compressing the geometry of the mesh has been pursued most of all[13, 14]. In these methods, the strategy of the key frame based on the video geometry is mainly proposed to compress 3D human motion data, e.g., Hou et al.[13] extracted the key frames and produced a reconstruction matrix. The key frames were compressed through using the video compression technique (e.g., H.264/advanced video coding), so that the spatial and temporal redundancy were significantly removed. Another way of compressing the geometry of the 3D mesh, the octree-based motion representation method was proposed to compress animated geometric data in [15]. First, the motion vectors were generated by finding correspondence relations between two consecutive frames. Then, the vertex positions for all frames were compressed through using the motion vectors. Finally, the motion vectors were exploited to predict the vertex positions. A decoder based on logarithmic binary arithmetic coding was proposed to implement additions and shift operations, which can easily implement the decoding of multiple systems. However, the compression rate of the decoder was quite low[16] and it was more computationally efficient in compressing the point cloud but not the geometry of the mesh[17].

      In order to compress the 3D point cloud data, a method based on an octree decomposition in the space was proposed in the paper[18]. In the paper, a novel prediction technique was employed to achieve the higher compression rates that outperform those of previous lossless algorithms. The point cloud data was encoded in terms of the occupied octree cells and the points in occupied octree cells were replaced by the cell centers of the octree's leaves. What's more, the progressive decompression method was comparable to that of previous lossy methods. However, our proposed method can cost less CPU time (or process time) than the method based on an octree decomposition of the space. Some applications of compressing 3D image based on the octree method can be found in [19, 20].

      In recent years, graph theory has been heavily employed in the application of the point cloud data due to its structure[2124]. The graph on small neighborhoods of the point cloud is constructed by connecting nearby points, and the attributes of the graph are regarded as signals over the graph, and the correlations of the signal are removed through using the graph transform (Karhunen-Loève transform on such graphs)[20]. Thanou et al.[25] address the problem of compressing the 3D point cloud data through introducing the motion estimation. First, the new spectral graph wavelet descriptors were acquired by the graph transform. Second, the new spectral graph wavelet descriptors were used to match points between two different frames and the motion vectors were estimated on the vertices through using matching results. Then, a dense motion field was eventually interpolated for solving a graph-based regularization problem. Thus, the problem of estimating the motion vectors was converted into the problem of matching the successive graphs. Finally, the temporal redundancy information of the 3D coordinates in the predictive coding was removed by the estimated motion vectors. A point cloud compression scheme based on an original layered data structure and graph Fourier transform was proposed in [26]. First, the four-layer structure was built through applying the slice-partition scheme and the geometry-adaptive k dimensional-tree method. Then, the spatial correlation between adjacent points was introduced to predict new frames by employing the intra prediction framework based on block. Unfortunately, the graph transform is hard to use for the applications of the real-time processing because of repeated eigen-decompositions of the graph Laplacians. Thus, the approach based on the graph transform is infeasible for the real-time processing. Therefore, a novel compression framework based on the temporal correlation and the spatial redundancy is proposed for telepresence in this paper.

    • As shown on Table 1, the particle filter is exploited to track the remote operator body within a 2D bounding box. First, we predefine a 2D bounding box which presents the movement dynamic model of the human motion and the parameter of movement dynamic model is a 12-tuple state vector.

      $ \{x,y,s,x_p,y_p,s_p,x_0,y_0,w,h,hg\} $

      where $ (x,y) $, $ (x_0,y_0) $ and $ (x_p,y_p) $ are the centers of the 2D bounding box at time of $ t_n $, $ t_0 $, and $ t_{n-1} $ respectively. $ s $ and $ s_p $ represent the scale of the 2D bounding box at time of $ t_n $ and $ t_{n-1} $ respectively. $ w $ is the width of 2D bounding box while $ h $ is the high of 2D bounding box at time of $ t_n $, and $ hg $ is the color histogram of 2D bounding box at time of $ t_n $.

      1) Predicting: select particles
      $ x_\iota^i\sim p(x_\iota|x_{\iota-1}) $
      2) Transition: apply motion model
      $ \Lambda_\iota=F_\iota \Lambda_{\iota-1}+W_\iota $
      3) Update: each particles′ weight is updated and normalized
      $ w_\iota^i=w_{\iota-1}^ip(z_\iota|x_\iota^i) $
      $ \tilde{w_\iota}(X_\iota^i)=\frac{w_\iota(X_\iota^i)}{\sum_{i=1}^{N_p}w_\iota(X_\iota^i)} $
      4. Output state:
      $ \bar{x_\iota}=\sum_{i=1}^Nw_\iota^ix_\iota^i $
      5. Resampling

      Table 1.  Pseudo-code of the particle filter

      Second, the particle filter is implemented to update the parameters of the movement dynamic model. And then, the tracking 2D bounding box and the camera calibration parameters are exploited to convert the 2D bounding box into a 3D bounding box. And then, the 3D bounding box is exploited to obtain the 3D point cloud data of the remote operator body from the point cloud of the whole scene. Furthermore, the statistical filter is implemented to remove the sparse outliers.

    • From the Markov chain, the property can be obtained that only the current state $ x_\kappa $ will affect the next state $ x_{\kappa+1} $ of an object.

      $ p(x_{\kappa+1}|x_\kappa,x_{\kappa-1},\cdots,x_1) = p(x_{\kappa+1}|x_\kappa) $

      (1)
    • On the random process, the joint probability distribution can be represented by the Chapman Kolmogorov equation[27].

      $ p(x_1,\cdots,x_{k-1}) = \int_{-\infty}^{\infty}p(x_1,\cdots,x_{k}){\rm{d}}x_k $

      (2)

      where the property of the Markovian has been exploited, i.e., [27].

      $ p(x_3|x_1) = \int_{-\infty}^{\infty}p(x_3|x_2)p(x_2|x_1){\rm{d}}x_k . $

      (3)
    • The particle filter is based on the Monte Carlo method because the Monte Carlo method can be exploited to acquire the real probability distribution through repeated random sampling[27].

      $ \begin{split} E[f(x)] =\; & \int f(x)p(x){\rm{d}}x \approx E_N(f(x))=\frac{1}{K}\sum\limits_{i = 1}^K f(x_i) \end{split} .$

      (4)
    • Assume that the state function and observation function are as follows:

      $ x_\iota = F_\iota(\iota,x_{\iota-1},u_\iota,\Lambda_\iota) $

      (5)

      $ y_\iota = H_\iota(\iota,x_{\iota},u_\iota,\Psi_\iota) $

      (6)

      where $ \Lambda_\iota $ represents the noise of the state system and $ \Psi_\iota $ is the noise of the observation system. $ u_\iota $ is regarded as the input of the system, $ x_\iota $ is the state value of the system in the $ \iota $ time, $ y_\iota $ is the observation value of the system in the $ t $ time. Especially, $X_\iota = \{x_1, x_2, \cdots, x_\iota\}$, $Y_\iota = \{y_1, y_2, \cdots, y_\iota\}$.

      In order to obtain the posterior probability $ p(x_\iota|Y_\iota) $, according to the Bayesian formula, the derivation process is as follows[28]:

      $ \begin{split} p(x_\iota|Y_{\iota-1}) =\;& \int p(x_\iota,x_{\iota-1}|Y_{\iota-1}) {\rm{d}}x_{\iota-1}= \\ &\int p(x_\iota|x_{\iota-1},Y_{\iota-1}) p(x_{\iota-1}|Y_{\iota-1}) {\rm{d}}x_{\iota-1} .\end{split} $

      (7)

      Using the property of Markov processes[28],

      $ p(x_\iota|Y_{\iota-1}) = \int p(x_\iota|x_{\iota-1})p(x_{\iota-1}|Y_{\iota-1}) {\rm{d}}x_{\iota-1} .$

      (8)

      Then, the posterior probability $ p(x_\iota|Y_\iota) $ is represented as follows[28]:

      $ \begin{split} p(x_\iota|Y_{\iota}) =\; & \frac{p(y_\iota|x_{\iota},Y_{\iota-1})p(x_\iota|Y_{\iota-1})}{p(y_\iota|Y_{\iota-1})}= \\ & \frac{p(y_\iota|x_{\iota})p(x_\iota|Y_{\iota-1})}{p(y_\iota|Y_{\iota-1})} \end{split} $

      (9)

      where $ p(y_\iota|Y_{\iota-1}) $ represents the normalization constant and is as followes[28]:

      $ p(y_\iota|Y_{\iota-1}) = \int p(y_\iota|x_{\iota})p(x_\iota|Y_{\iota-1}) {\rm {d}}x_{\iota} .$

      (10)

      In the end, the real probability distribution is acquired through exploiting the Monte Carlo method, as follows[28].

      $ \begin{split} E[f(x_\iota)|Y_\iota] =\; & \int f(x_\iota)p(x_\iota|Y_{\iota}){\rm{d}}x_{\iota} \approx \\ &\int f(x_\iota)\hat{p}{(x_\iota|Y_{\iota})}{\rm{d}}x_{\iota}=\\ & \frac{1}{N}\sum\limits_{i = 1}^{N} \int f(x_\iota)\delta (x_\iota-x_\iota^{i}){\rm{d}}x_{\iota}= \\ & \frac{1}{N}\sum\limits_{i = 1}^{N} f(x_\iota^i) \end{split} $

      (11)

      where $ \hat{p}{(x_\iota|Y_{\iota})} $ is the estimated value of the posterior probability.

    • However, in practicality, it is very difficult to sample the posterior probability. Thus, for easy sampling, the reference distribution is introduced into the importance probability density function, the reference distribution will be exploited to sample posterior probability[27].

      $ \begin{split} E[f(x_\iota)|Y_\iota] =\; & \int f(x_\iota)p(x_\iota|Y_\iota){\rm{d}}x_\iota= \\ & \int \frac{f(x_\iota)p(x_\iota|Y_\iota)r(x_\iota|Y_\iota)}{r(x_\iota|Y_\iota)}{\rm{d}}x_\iota \end{split} $

      (12)

      where $ r(x_\iota|Y_\iota) $ represents the reference distribution.

      According to the Bayesian equation,

      $ p(x_\iota|Y_\iota) = \frac{p(Y_\iota|x_\iota)p(x_\iota)}{p(Y_\iota)} $

      (13)

      $ \begin{split} E[f(x_\iota)|Y_\iota] =\;& \int \frac{f(x_\iota)p(x_\iota|Y_\iota)r(x_\iota|Y_\iota)}{r(x_\iota|Y_\iota)}{\rm{d}}x_\iota=\\ & \int \frac{f(x_\iota)w_i(x_\iota)r(x_\iota|Y_\iota)}{p(Y_\iota)}{\rm{d}}x_\iota = \\ & \frac{\int f(x_\iota)w_i(x_\iota)r(x_\iota|Y_\iota){\rm{d}}x_\iota}{p(Y_\iota)} \end{split} $

      (14)

      where the unnormalized weight is expressed as follows:

      $ w_i(x_\iota) = \frac{p(Y_\iota|x_\iota)p(x_\iota)}{r(x_\iota|Y_\iota)} .$

      (15)

      $ p(Y_\iota) $ can be represented as follows:

      $ \begin{split} p(Y_\iota) =\;& \int p(Y_\iota,x_\iota){\rm{d}}x_\iota=\\ &\int \frac{p(Y_\iota|x_\iota)p(x_\iota)r(x_\iota|Y_\iota)}{r(x_\iota|Y_\iota)}{\rm{d}}x_\iota=\\ & \int w_i(x_\iota)r(x_\iota|Y_\iota){\rm{d}}x_\iota .\end{split} $

      (16)

      Thus, the $ E[f(x_\iota)] $ is

      $ E[f(x_\iota)|Y_\iota] = \frac{\int f(x_\iota)w_i(x_\iota)r(x_\iota|Y_\iota){\rm{d}}x_\iota}{\int w_i(x_\iota)r(x_\iota|Y_\iota){\rm{d}}x_\iota} .$

      (17)

      According to the Monte Carlo method,

      $\begin{aligned} \bar{E}[f(x_\iota)|Y_\iota] = \frac{\dfrac{1}{N_p}\sum_{i = 1}^{N_p}f(x_\iota^i)w_i(x_\iota^i)}{\dfrac{1}{N_p}\sum_{i = 1}^{N_p}w_i(x_\iota^i)} = \sum\limits_{i = 1}^{N_p}f(x_\iota^i)\tilde{w_i}(x_\iota^i)\\[-4pt]\end{aligned}$

      (18)

      where $ \tilde{w_i}(x_\iota^i) = \frac{w_i(x_\iota^i)}{\sum_{i = 1}^{N_p}w_i(x_\iota^i)} $ is the i-th normalized weight, $ x_\iota^i $ is acquired from $ r(x_\iota|Y_\iota) $ and represents the i-th particle in the $ \iota $ time.

    • In fact, the computer is exploited to implement the particle filter algorithm. Therefore, in order to reduce the computational time, the sequential importance sampling is introduced for sampling. Assume that (19) can be exploited to represent the importance probability density function[29].

      $ \begin{split} r(X_\iota|Y_\iota) =\;& r(x_\iota,X_{\iota-1}|Y_\iota)=r(x_\iota|X_{\iota-1},Y_\iota)r(X_{\iota-1}|Y_{\iota-1}) .\end{split} $

      (19)

      Formula (20) can be exploited to represent the weight.

      $ \begin{split} w_\iota =\;& \frac{p(X_\iota|Y_\iota)p(Y_\iota)}{r(x_\iota|X_{\iota-1},Y_\iota)r(X_{\iota-1}|Y_{\iota-1})}\propto \\ & \frac{p(X_\iota|Y_\iota)}{r(x_\iota|X_{\iota-1},Y_\iota)r(X_{\iota-1}|Y_{\iota-1})} \end{split} $

      (20)

      where $ p(X_\iota|Y_\iota) $ can be given by

      $ \begin{split} p(X_\iota|Y_\iota) =\;& \frac{p(y_\iota|X_\iota,Y_{\iota-1})p(X_\iota|Y_{\iota-1})}{p(y_\iota|Y_{\iota-1})}=\\ & \frac{p(y_\iota|x_\iota)p(x_\iota|x_{\iota-1})p(X_{\iota-1}|Y_{\iota-1})}{p(y_\iota|Y_{\iota-1})} .\end{split} $

      (21)

      Finally, (22) can be exploited to represent the weight.

      $ \begin{split} w_\iota \propto\;& \frac{p(y_\iota|x_\iota)p(x_\iota|x_{\iota-1})p(X_{\iota-1}|Y_{\iota-1})}{r(x_\iota|X_{\iota-1},Y_\iota)r(X_{\iota-1}|Y_{\iota-1})}=\\ & w_{\iota-1}\frac{p(y_\iota|x_\iota)p(x_\iota|x_{\iota-1})}{r(x_\iota|X_{\iota-1},Y_\iota)} .\end{split} $

      (22)

      For the i-th particle in the $ t $ time,

      $ w_\iota^i = w_{\iota-1}^i\frac{p(y_\iota|x_\iota^i)p(x_\iota^i|x_{\iota-1}^i)}{r(x_\iota^i|x_{\iota-1}^i,Y_\iota)} .$

      (23)

      The following prior is chosen to regard as the reference distribution of the importance probability in order to simplify $ w_\iota^i $.

      $ r(x_\iota|x_{\iota-1}^i,Y_\iota) = p(x_\iota^i|x_{\iota-1}^i) .$

      (24)

      Then,

      $ w_t^i = w_{\iota-1}^ip(y_\iota|x_\iota^i) $

      (25)

      where $ p(y_\iota|x_\iota^i) $ represents the posterior probability and is acquired through comparing the likelihood of the histogram between the current particle and previous particle, as shown in Fig. 2.

      Figure 2.  Diagram of judging the likelihood

      In order to estimate the velocity and the direction of the human motion, the movement dynamic model of the human motion will be introduced as follows[30].

      $ \Lambda_\iota = F_\iota \Lambda_{\iota-1}+W_\iota $

      (26)

      $ {\left[\begin{array}{c} X_\iota \\ Y_\iota\\ \dot{X_\iota} \\ \dot{Y_\iota} \end{array} \right]} = { \left[ \begin{array}{cccc} 1 & & \Delta \iota &\\ & 1& & \Delta \iota \\ & & 1 & \\ & & &1 \end{array} \right ]}\times {\left[\begin{array}{c} X_{\iota-1} \\ Y_{\iota-1}\\ \dot{X}_{\iota-1} \\ \dot{Y}_{\iota-1} \end{array} \right]}+W_\iota $

      (27)

      where $ \Lambda_\iota $ vector$ \{ X_\iota, Y_\iota, \dot{X_\iota}, \dot{Y_\iota}\} $ includes the center of the bounding box and the velocity of the particles and $ W_\iota $ represents the noise.

    • The distribution of the particles may lose diversity when there is a smaller weight of the particle in the current time because some particles with a smaller weight will be discarded. In the particle filter algorithm, the weight of the particle is exploited to decide whether the particle will be discarded. In this condition, the weight of the good particle will be bigger while the weight of the bad particle will be smaller. Thus, there is a degeneracy in the transmission process of the particles. In order to avoid the degeneracy, resampling is introduced into the transmission process of the particles. As shown in Fig. 3, some new particles are drawn from the particles with higher weights with probabilities proportional to their weights. Then the particles with higher weights are replaced with this new one and the particles with negligible weights are discarded.

      Figure 3.  Diagram of resampling particles

    • The key ideas of the compression of 3D point cloud data based on the temporal redundancy information is to combine a number of P frames between two I frames as illustrated in Fig. 4.

    • The octree data structure is used to store the sparse 3D point cloud data due to its special structure and each branch node is exploited to store the center point information in a certain cube. For a given depth, the depth-first order method will be used to construct an octree through traversing the tree structure. Starting from the root, eight children voxels will be generated at each node. All the points are mapped to the leaf voxels at the maximum depth of the tree. The color information of the center point in the certain cube is acquired through using the mean value of all points in the certain cube, as shown in Fig. 5.

      Figure 4.  Principle of the compression algorithm

      Figure 5.  Octree data structure

    • In order to generate the local reference frame with the invariant characteristic for translations and rotations, Salti et al.[31] apply a disambiguated eigen value decomposition to construct the reference frame, i.e., the eigen vector of the modified covariance matrix $ \psi $ is used to consist into reference frame. The three unit eigen vectors of the modified covariance matrix in decreasing eigen value order are the $ x^+ $, $ y^+ $, $ z^+ $ axes, respectively. The modified covariance matrix $ \psi $ is given by[31],

      $ \psi = \frac{1}{\sum\limits_{i:d_i\leq l}(r-d_i)}\sum\limits_{i:d_i\leq l}(l-d_i)(q_i-q)(q_i-q)^{\rm T} $

      (28)

      where $ d_i $ is the distance between the point $ q_i $ and the key point $ q $. The distance $ d_i $ is smaller than the radius $ l $.

      In order to remove the sign ambiguity, the majority direction of the vectors will firstly be assigned toward the direction of the local $ x $ and $ z $ axes, and the local axis $ y $ is obtained through calculating the cross product between the local $ x $ and $ z $ axes, i.e., $ y = x\times z $. For example, the process of acquiring the local $ x $ axes is as follows[31]:

      $ \begin{split}& F_x^+\doteq \{ i:d_i\leq L \wedge (q_i-q)\dot x^+ \geq 0 \}\\ & F_x^-\doteq \{ i:d_i\leq L \wedge (q_i-q)\dot x^- > 0 \}\\ & \widetilde{F}_x^+\doteq \{ i:i\in \Omega(k) \wedge (q_i-q)\dot x^+ \geq 0 \}\\ &\widetilde{F}_x^-\doteq \{ i:i\in \Omega(k) \wedge (q_i-q)\dot x^- > 0 \} \end{split} $

      (29)

      $ x = \left\{ \begin{aligned} & x^+ |F_x^+|>|S_x^-|\\ & x^- |F_x^+|<|S_x^-|\\ & x^+ |F_x^+| = |S_x^-| \wedge |\widetilde{F}_x^+|>|\widetilde{F}_x^-|\\ & x^- |F_x^+| = |S_x^-| \wedge |\widetilde{F}_x^+|<|\widetilde{F}_x^-| \end{aligned} \right. $

      (30)

      where $ x^- $ represents the opposite unit vector of $ x^+ $. $ \Omega(k) $ is the point set of the k-nearest neighbors in the key point $ q $ and $ k $ is an odd number.

    • In order to obtain the signature structure, first, an isotropic spherical grid whose center is the key point is cut into 32 slices along the radial, azimuth and elevation axes. As shown in Fig. 6, this spherical grid includes 32 partitions produced from 2 elevation, 2 radial and 8 azimuth divisions. For each of the slices, a local histogram will be constructed with 11 bins. Thus, a key point will have 352 features. Second, according to the cosine of angle $ \alpha_p $ between the local $ z $ axis at the key point $ p $ and the normal $ n_{p_i} $ of the neighborhood point $ p_i $, i.e., ${\rm{cos}}\alpha_p = z_p \cdot n_{p_i}$, a point will be accumulated into a corresponding bin. Then, a quadrilinear interpolation technique is applied to deal with the boundary effects resulting from the histogram based bins and small disturbance in the way of obtaining the local reference frame. Finally, in order to make the descriptor robust for the point density variations[31], it will be normalized.

      Figure 6.  Spherical grid for the local feature in SHOT descriptor[32]

      The aim of the quadrilinear interpolation technique is to make the distribution of the histogram more uniform. Thus, a point in the neighbours of the key point will be extracted to play a role in four bins of the histograms and the four bins are from the adjacent bins, the adjacent husks, the adjacent vertical volumes, the adjacent horizontal volumes, respectively.

    • Because the SHOT feature descriptor has 352 features including 32 histograms, the SHOT feature descriptor must be encoded to reduce the calculation time. Thus, 352 SHOT features are encoded into 352 bit binary descriptors. The 352 features descriptor is cut into 88 slices and each of the slices has 4 features. Each of the slices will be encoded individually, e.g., the first slice $ \{ f_0, f_1, f_2, f_3\} $. Let's $f_{\rm{sum}} = f_0+f_1+f_2+f_3$, the bit binary descriptor $ \{ b_0, b_1, b_2, b_3 \} $ of the first slice is given by [12].

      Case 1: If $ f_i = 0 $, $i = 0, \cdots, 3$, then $ \{ b_0, b_1, b_2, b_3 \} = \{0,0,0,0\} $.

      Case 2: If Case 1 is not satisfied and $f_i > f_{\rm{sum}}\times 90\%$, then, $ b_i = 1 $. For example, if $f_0 > f_{\rm{sum}}\times 90\%$, then the encoded $ \{ b_0, b_1, b_2, b_3 \} = \{0,0,0,1\} $. In this case, four cases are covered, i.e., $ \{ b_0, b_1, b_2, b_3 \} $ are $ \{1, 0, 0, 0\} $, $ \{0, 1, 0, 0\} $, $ \{0, 0, 1, 0\} $, and $ \{0, 0, 1, 0\} $.

      Case 3: If Cases 1 and 2 are not satisfied, and $f_i+f_j > f_{\rm{sum}}\times 90\%$, $ i \neq j $, then, $ b_i = 1 $, $ b_j = 1 $. For example, if the sum of $f_0+f_1 > f_{\rm{sum}}\times90\%$, then the encoded $ \{ b_0, b_1, b_2, b_3 \} = \{0,0,1,1\} $. In this case, the possible values of $ \{ b_0, b_1, b_2, b_3 \} $ are $ \{1, 1, 0, 0\} $, $ \{1, 0, 1, 0\} $, $ \{1, 0, 0, 1\} $, $ \{0, 1, 1, 0\} $, $ \{0, 1, 0, 1\} $ and $ \{1, 0, 1, 0\} $.

      Case 4: If Cases 1, 2 and 3 are not satisfied, and $f_i+f_j+f_k > f_{\rm{sum}}\times 90\%$, $ i \neq j \neq k $, then, $ b_i = 1 $, $ b_j = 1 $, $ b_k = 1 $. The possible values of $ \{ b_0, b_1, b_2, b_3 \} $ are $ \{1, 1, 1, 0\} $, $ \{0, 1, 1, 1\} $, $ \{1, 1, 0, 1\} $ and $ \{1, 0, 1, 1\} $.

      Case 5: If all the above conditions are not satisfied, then $ \{ b_0, b_1, b_2, b_3 \} = \{1,1,1,1\} $.

    • Now the problem of finding correspondences on the 3D dynamic frame is translated into finding correspondences between two key points, which are extracted from two frame 3D images, i.e., $ \Gamma_{i} $ and $\Gamma_{j} (i \; {\rm{mod}} \; 4 = 0, j \; {\rm{mod}}\; 4\neq 0)$. The features from the previous subsections are applied to estimate the similarity between two key points. The histogram similarity is regarded as the matching score between two key points $ m\in\Gamma_i $, $ n\in\Gamma_j $. For each key point in $ \Gamma_{j} $, the matching score is applied to find its best matching key point in $ \Gamma_{i} $. The best match is given as follows[33]:

      $ m_n = \mathop{\rm{argmax}}\limits_{m \in \Gamma_{i}} S(m,n) $

      (31)

      where $ S(m,n) $ represents the histogram similarity between $ m $ and $ n $.

    • Because of the different number of points between two frame 3D point cloud frames, it is impossible to find corresponding points in the previous frame for all points in the current frame. Thus, if a key point best score is bigger than a predefined threshold, the key point can be regarded as the center of a cluster which includes some sparse points with poor match score. The K-means method is applied for the clustering in the 3D coordinates of the points, where the target number of significant clustering is $ k $. In particular, in order to avoid the boundary effects of the cluster, all points will be traversed to find the corresponding cluster subset.

    • In the previous subsection, the 3D point cloud frame is divided into the $ k $ cluster subsets. Then, the similarity of the cluster subsets between two frames is calculated through using the normal and position of its center point. The standards of similarity are the normal and position of its center point, the reason is twofold: it can be judged fast, since the normal and position have been saved in the previous operation; the normal of its center point represents the rotation direction of the cluster subset and the position of its center point represents the translation motion. If the similarity is smaller than a predefined threshold, the 3D point cloud data of the cluster subset will be retained. If the similarity is bigger than or equal to a predefined threshold, the 3D point cloud data of the cluster subset will not be retained and the motion vectors will be computed to replace it. The motion vectors are given by

      $ v_{ji}^k = p_j^k-p_i^k $

      (32)

      where $ p_j^k $ represents the center point position of the k-th cluster subset in $ \Gamma_j $ frame, $ p_i^k $ represents the center point of the k-th cluster subset in $ \Gamma_i $ frame.

      The $ \Gamma_j $ frame is decoded through using some information including the motion vectors, the reference frame $ \Gamma_i $ and a small amount of the cluster subsets in the $ \Gamma_j $ frame. Note that a small amount of the cluster subsets are acquired through comparing the similarity of the cluster subsets between two frames, i.e., $ \Gamma_i $ and $ \Gamma_j $. The process in detail is as follows:

      $ S1 $: Traversing all center points in the $ \Gamma_j $ frame. If the center point has the motion vector, then $ S2 $ will be operated. If the center point has not the motion vector, then $ S3 $ will be operated.

      $ S2 $: According to the motion vector, the center point's corresponding cluster subset in the $ \Gamma_i $ frame is found. Then, the center point's cluster subset is replaced by the corresponding cluster subset in the $ \Gamma_i $ frame to reconstruct the $ \Gamma_j $ frame.

      $ S3 $: The center point's cluster subset in the $ \Gamma_j $ frame is used to reconstruct the $ \Gamma_j $ frame.

    • In this section, the experimental results will be described. The raw 3D point cloud data and 2D RGB image are obtained by using the Kinect V2 sensor. The experiments are performed on the computers with an Intel fourth-generation Core i5-4460 @3.20 GHz Processor with 8 GB RAM.

    • In order to estimate the quality of compressing point cloud, some evaluation metrics are designed, such as the delay time, the peak signal to noise ratio (PSNR) metric (i.e., the distortion rate, the bigger the PSNR is, the bigger the distortion rate is) and compression ratio. The delay time includes coding frame time, decoding frame time and transmission time. The transmission time will be affected by Internet broadband that is 20 Mbps in our experiment. The geometric PSNR is defined in (35) while the peak signal of the geometry of the 3D point cloud data over the symmetric root mean square (RMS) distance is represented by (34)[34].

      $ d_{rms}(V_{o},V_{d}) = \frac{1}{M}\sum\limits_{v_l\in V_o}||v_l-v_{nn}||_2 $

      (33)

      where $ v_{nn} $ represents the most nearby point in the decompressed 3D point cloud data. $ V_o $ represents the raw 3D point cloud data and $ V_d $ represents the decompressed 3D point cloud data.

      $ d_{sys\_rms}(V_o,V_d) = {\rm{max}}(d_{rms}(V_o,V_d),d_{rms}(V_d,V_o)) $

      (34)

      $ psnr_{ge} = \frac{10{\rm{log}}_{10}(||{\rm{max}}_{x,y,z}(V_d)||_2^2)}{(d_{sym\_rms}(V_o,V_d))^2)}\quad\quad\quad\quad\quad\quad\quad$

      (35)

      where the ${\rm{max}}_{x,y,z}(V_d)$ represents the max of $ (x,y,x) $ value in the decoding point cloud data.

    • First, we acquired the RGB image and the point cloud data of the scene by using the Kinect V2 sensor. Second, the remote operator was tracked within the 2D bounding box by implementing the color-based particle filter. Then, the inner parameters of the camera calibration were exploited to project the 2D bounding box into the 3D bounding box and we obtained the 3D point cloud data of the remote operator body by using the 3D bounding box, as shown in Fig. 7.

      Figure 7.  Color-based particle filter was applied to acquire the remote operator body. (a)–(f) show the results of acquiring the remote operator body through the 2D image (i.e., the red bounding box), (g)–(l) show the results of acquiring the point cloud data of the remote operator body through the proposed method. Color versions of the figures are available online.

    • In this subsection, two experiments were implemented to measure the performance of the proposed method. The octree-based compression algorithm was utilized to compare with the proposed method. We used the octree-based compression algorithm as the comparing experiment. The reason was twofold: The first, most of the state-of-art transmission methods mainly focus on the distortion rate, for which the computational power is quite high, e.g., the motion compensated temporal filtering wavelet-based encoder[25]; Second, the octree-based method is suitable for real-time applications with good performance[15].

    • In our proposed method, we sent three P frames between two I frames as illustrated in Fig. 4. Thus, three groups matching operations between two different frames were implemented. In order to reduce calculation time, before matching operations, each frame has been voxelized and the voxelization resolution is 0.04 m, which generated approximately occupied voxels of 1661 (the number of the initial 3D points was 86402). In fact, the number of points depends on the size of the actual frames. As shown in Fig. 8, after voxelizing the 3D point cloud data of the remote operator body, the distribution of the point cloud was uniform and the main structure information of point cloud data still was retained, which makes the following matching operation successful more easily.

      Figure 8.  Results of the voxelization. (a)–(l): Before voxelizing 3D point cloud data of the remote operator body. (m)–(x): After voxelizing 3D point cloud data of the remote operator body, obviously, distribution of point cloud was uniform and the main information of point cloud data was retained.

      First, the points in the frame at the time of $ t+1 $, $ t+2 $, $ t+3 $ were used to find the matching points in the frame at the time of $ t $, and the matched points were regarded as the center of clusters. As shown in Figs. 9 (a)- 9 (i), the left of each image represented the frame at the time of $ t+1 $, $ t+2 $ and $ t+3 $, respectively. The right of each image was the frame at the time of $ t $. Then, the unmatched points were clustered into a cluster and the similarity of the cluster subset between two frames was calculated through using the normal and position of its center points. Finally, the motion vectors were generated by calculating the transformation vector between two similar clusters. Especially, only the unsuccessfully matched clusters in the current frame and the motion vectors were transmitted into the remote end. As shown in Figs. 9 (j)- 9 (r), the predicted frame was almost flawless. Though there was the quite different number between two 3D point cloud frames, the prediction of the new frame was successful.

      Figure 9.  Results of the matching point and the proposed compression. (a)–(i): The results of the matching point, the content on the left of each of the images represented the frame on the time $ t+1 $, $ t+2 $, $ t+3 $ respectively. The content on the right of each of the images represented the frame on the time $ t $. If two points were matched successfully, they will be connected by a line. (j)–(r): The $ \Gamma_{t+1} $, $ \Gamma_{t+2} $, and $ \Gamma_{t+3} $ frames were decoded through using some information including the motion vectors, the reference frame $ \Gamma_t $ and a small amount of the cluster subsets in the $ \Gamma_{t+1} $, $ \Gamma_{t+2} $ or $ \Gamma_{t+3} $ frame, each image has been decoded successfully.

    • The comparing experiment was implemented to compare with performance of the proposed algorithm. The octree-based compression method was applied to directly compress the 3D point cloud data grabbed by the Kinect V2 sensor. Fig. 10 (b) shows the performance of the octree-based compression algorithm.

      Figure 10.  Results of the octree-based compression algorithm

    • As shown in Table 2, we used 20 frames of point cloud data to calculate the quality evaluation metric and obtained their mean. The CPU time of compressing 3D point cloud was 1228 ms through using the proposed method, while the CPU time of compressing 3D point cloud was 2163 ms through using the octree-based compression algorithm. In the way of the compression ratio, the compression ratio of the proposed method was 8.42 while the compression ratio of the octree-based method was 5.99. The PSNR of the proposed method was 42.68 dB while the PSNR of the octree-based method was 38.19 dB. As shown in Figs. 9 and 10, the main original information has been retained. In conclusion, the proposed method has better performance due to the much lower CPU time of compressing the 3D point cloud in the condition of the almost same distortion rate. More importantly, the proposed method also has the lower compression ratio.

      Method Octree-based Our method
      Delay time 2163 ms 1228 ms
      Compression ratio 5.99 8.42
      Distortion rate 38.19 dB 42.68 dB

      Table 2.  Performance of the two methods

    • In this paper, a novel compression framework based on the 3D point cloud data has been proposed for the telepresence. The key idea of the framework is to exploit the temporal and spatial redundancy. The spatial redundancy is removed through using the robust color-based Bayesian framework and the temporal redundancy between two consecutive frames is decreased through employing the motion vector estimation. First, the B-SHOT descriptor is applied to match points between two frames and the K-means cluster algorithm is implemented to generate the clusters. Then, the similarity of the clusters between two frames is evaluated and the motion vectors are generated. Finally, the motion vectors, the information of the I frame and the unsuccessfully matched clusters are utilized to predict the new frame. Experimental results have shown that the proposed method has better performance. The proposed method achieves the compression ratio of 8.42 and the delay time of 1228 ms compared with the compression ratio of 5.99 and the delay time of 2163 ms in octree-based compression method under conditions of similar distortion rate.

    • This work was supported by National Nature Science Foundation of China (No.61811530281 and 61861136009), Guangdong Regional Joint Foundation (No. 2019B1515120076), and the Fundamental Research for the Central Universities.

Reference (34)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return