Article Contents
Fu-Qiang Liu, Zong-Yi Wang. Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning[J]. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1221-8
Cite as: Fu-Qiang Liu, Zong-Yi Wang. Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning[J]. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1221-8

Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning

Author Biography:
  • Fu-Qiang Liu received the M. Sc. degree in computer technology from Harbin Engineering University, China in 2013. He is currently a Ph. D. degree candidate in control science and engineering at College of Automation, Harbin Engineering University, China. His research interests include computer vision, deep learning, artificial intelligent, deep neural networks, simultaneous localization and mapping (SLAM), and robotics.E-mail: fqliu@hrbeu.edu.cn (Corresponding author)ORCID iD: 0000-0002-1593-9497

    Zong-Yi Wang received the Ph. D. degree in control theory and control engineering from Harbin Engineering University, China in 2005. He is a professor at College of Automation, Harbin Engineering University, China. He won first prize of the Heilongjiang Provincial Scientific and Technological Progress Award in 2004.His research interests include computer vision, robotics, welding and cutting intelligence. E-mail: zywang@hrbeu.edu.cn

  • Received: 2019-11-13
  • Accepted: 2019-12-25
  • Published Online: 2020-03-05
  • In industry, it is becoming common to detect and recognize industrial workpieces using deep learning methods. In this field, the lack of datasets is a big problem, and collecting and annotating datasets in this field is very labor intensive. The researchers need to perform dataset annotation if a dataset is generated by themselves. It is also one of the restrictive factors that the current method based on deep learning cannot expand well. At present, there are very few workpiece datasets for industrial fields, and the existing datasets are generated from ideal workpiece computer aided design (CAD) models, for which few actual workpiece images were collected and utilized. We propose an automatic industrial workpiece dataset generation method and an automatic ground truth annotation method. Included in our methods are three algorithms that we proposed: a point cloud based spatial plane segmentation algorithm to segment the workpieces in the real scene and to obtain the annotation information of the workpieces in the images captured in the real scene; a random multiple workpiece generation algorithm to generate abundant composition datasets with random rotation workpiece angles and positions; and a tangent vector based contour tracking and completion algorithm to get improved contour images. With our procedures, annotation information can be obtained using the algorithms proposed in this paper. Upon completion of the annotation process, a json format file is generated. Faster R-CNN (Faster R-convolutional neural network), SSD (single shot multibox detector) and YOLO (you only look once: unified, real-time object detection) are trained using the datasets proposed in this paper. The experimental results show the effectiveness and integrity of this dataset generation and annotation method.
  • 加载中
  • [1] J. X. Xiao, K. A. Ehinger, J. Hays, A. Torralba, A. Oliva.  Sun database: Exploring a large collection of scene categories[J]. International Journal of Computer Vision, 2016, 119(1): 3-22. doi: 10.1007/s11263-014-0748-y
    [2] A. Torralba, R. Fergus, W. T. Freeman.  80 million tiny images: A large data set for nonparametric object and scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(11): 1958-1970. doi: 10.1109/TPAMI.2008.128
    [3] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, A. Zisserman.  The pascal visual object classes (VOC) challenge[J]. International Journal of Computer Vision, 2010, 88(2): 303-338. doi: 10.1007/s11263-009-0275-4
    [4] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, USA, pp. 248–255, 2009. DOI: 10.1109/CVPR.2009.5206848.
    [5] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision, Springer, Zurich, Switzerland, pp. 740–755, 2014. DOI: 10.1007/978-3-319-10602-1_48.
    [6] B. L. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba.  Places: A 10 million image database for scene recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(6): 1452-1464. doi: 10.1109/tpami.2017.2723009
    [7] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, G. Chechik, D. Cai, Z. Feng, D. Narayanan, K. Murphy. OpenImages: A public dataset for large-scale multi-label and multi-class image classification, [Online], Available: https://storage.googleapis.com/openimages/web/index.html, October 6, 2019.
    [8] J. Tremblay, T. To, A. Molchanov, S. Tyree, J. Kautz, S. Birchfield. Synthetically trained neural networks for learning human-readable plans from real-world demonstrations. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Brisbane, Australia, pp. 5659–5666, 2018. DOI: 10.1109/ICRA.2018.8460642.
    [9] J. Tremblay, T. To, S. Birchfield. Falling things: A synthetic dataset for 3D object detection and pose estimation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, IEEE, Salt Lake City, USA, pp. 2119–21193, 2018. DOI: 10.1109/cvprw.2018.00275.
    [10] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, A. M. Dollar. The YCB object and model set: Towards common benchmarks for manipulation research. In Proceedings of International Conference on Advanced Robotics, IEEE, Istanbul, Turkey, pp. 510–517, 2015. DOI: 10.1109/ICAR.2015.7251504.
    [11] M. Arsenovic, S. Sladojevic, A. Anderla, D. Stefanovic, B. Lalic. Deep learning powered automated tool for generating image based datasets. In Proceedings of the 14th IEEE International Scientific Conference on Informatics, IEEE, Poprad, Slovakia, pp. 13–17, 2017. DOI: 10.1109/informatics.2017.8327214.
    [12] J. Sun, P. Wang, Y. K. Luo, G. M. Hao, H. Qiao.  Precision work-piece detection and measurement combining top-down and bottom-up saliency[J]. International Journal of Automation and Computing, 2018, 15(4): 417-430. doi: 10.1007/s11633-018-1123-1
    [13] N. Poolsawad, L. Moore, C. Kambhampati, J. G. F. Cleland.  Issues in the mining of heart failure datasets[J]. International Journal of Automation and Computing, 2014, 11(2): 162-179. doi: 10.1007/s11633-014-0778-5
    [14] X. Y. Gong, H. Su, D. Xu, Z. T. Zhang, F. Shen, H. B. Yang.  An overview of contour detection approaches[J]. International Journal of Automation and Computing, 2018, 15(6): 656-672. doi: 10.1007/s11633-018-1117-z
    [15] A. Aldoma, T. Fäulhammer, M. Vincze. Automation of “ground truth” annotation for multi-view RGB-D object instance recognition datasets. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Chicago, USA, pp. 5016–5023, 2014. DOI: 10.1109/IROS.2014.6943275.
    [16] K. Lai, L. F. Bo, X. F. Ren, D. Fox. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Shanghai, China, pp. 1817–1824, 2011. DOI: 10.1109/icra.2011.5980382.
    [17] M. Di Cicco, C. Potena, G. Grisetti, A. Pretto. Automatic model based dataset generation for fast and accurate crop and weeds detection. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Vancouver, Canada, pp. 5188–5195, 2017. DOI: 10.1109/IROS.2017.8206408.
    [18] S. Greuter, J. Parker, N. Stewart, G. Leach. Real-time procedural generation of `pseudo infinite′ cities. In Proceedings of the 1st International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia, ACM, Melbourne, Australia, pp. 87–94, 2003. DOI: 10.1145/604487.604490.
    [19] R. Van Der Linden, R. Lopes, R. Bidarra.  Procedural generation of dungeons[J]. IEEE Transactions on Computational Intelligence and AI in Games, 2013, 6(1): 78-89. doi: 10.1109/tciaig.2013.2290371
    [20] S. R. Richter, V. Vineet, S. Roth, V. Koltun. Playing for data: Ground truth from computer games. In Proceedings of 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 102–118, 2016. DOI: 10.1007/978-3-319-46475-6_7.
    [21] P. Marion, P. R. Florence, L. Manuelli, R. Tedrake. Label Fusion: A pipeline for generating ground truth labels for real RGBD data of cluttered scenes. In Proceedings of IEEE International Conference on Robotics and Automation, Brisbane, Australia, pp. 3235–3242, 2018. DOI: 10.1109/icra.2018.8460950.
    [22] T. Hodan, P. Haluza, Š. Obdržálek, J. Matas, M. Lourakis, X. Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, IEEE, Santa Rosa, USA, pp. 880–888, 2017. DOI: 10.1109/WACV.2017.103.
    [23] H. Hattori, V. Naresh Boddeti, K. Kitani, T. Kanade. Learning scene-specific pedestrian detectors without real data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Boston, USA, pp. 3819–3827, 2015. DOI: 10.1109/CVPR.2015.7299006.
    [24] H. S. Koppula, A. Anand, T. Joachims, A. Saxena. Semantic labeling of 3D point clouds for indoor scenes. In Proceedings of the 24th International Conference on Neural Information Processing Systems, ACM, Red Hook, USA, pp. 244–252, 2011.
    [25] J. Xie, M. Kiefel, M. T. Sun, A. Geiger. Semantic instance annotation of street scenes by 3D to 2D label transfer. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 3688–3697, 2016. DOI: 10.1109/CVPR.2016.401.
    [26] B. Zoph, E. D. Cubuk, G. Ghiasi, T. Y. Lin, J. Shlens, Q. V. Le. Learning data augmentation strategies for object detection. ArXiv preprint ArXiv: 1906.11172, 2019.
    [27] A. Dutta, A. Zisserman. The VIA annotation software for images, audio and video. ArXiv preprint ArXiv: 1904.10699, 2019.
    [28] L. Von Ahn, L. Dabbish. Labeling images with a computer game. In Proceedings of SIGCHI Conference on Human Factors in Computing Systems, ACM, New York, USA, pp. 319–326, 2004. DOI: 10.1145/985692.985733.
    [29] C. H. Zhang, K. Loken, Z. Y. Chen, Z. Y. Xiao, G. Kunkel. Mask Editor: An image annotation tool for image segmentation tasks. ArXiv preprint ArXiv: 1809.06461v1, 2018.
    [30] B. C. Russell, A. Torralba, K. P. Murphy, W. T. Freeman. LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, vol. 77, no. 1–3, pp. 157–173, 2008. DOI: 10.1007/s11263-007-0090-8.
    [31] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, R. Vasudevan. Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks? In Proceedings of IEEE International Conference on Robotics and Automation, IEEE, Singapore, pp. 746–753, 2017. DOI: 10.1109/icra.2017.7989092.
    [32] B. T. Phong.  Illumination for computer generated pictures[J]. Communications of the ACM, 1975, 18(6): 311-317. doi: 10.1145/360825.360839
    [33] S. Q. Ren, K. M. He, R. Girshick, J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, ACM, Cambridge, USA, pp. 91–99, 2015.
    [34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg. Ssd: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision, Springer, Amsterdam, The Netherlands, pp. 21–37, 2016. DOI: 10.1007/978-3-319-46448-0_2.
    [35] J. Redmon, S. Divvala, R. Girshick, A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Las Vegas, USA, pp. 779–788, 2016. DOI: 10.1109/CVPR.2016.91.
    [36] F. Q. Liu, Z. Y. Wang.  PolishNet-2d and PolishNet-3d: Deep learning-based workpiece recognition[J]. IEEE Access, 1270, 7(): 127042-127054. doi: 10.1109/ACCESS.2019.2940411
  • 加载中
  • [1] Maryam Aljanabi, Mohammad Shkoukani, Mohammad Hijjawi. Ground-level Ozone Prediction Using Machine Learning Techniques: A Case Study in Amman, Jordan . International Journal of Automation and Computing,  doi: 10.1007/s11633-020-1233-4
    [2] Kittinun Aukkapinyo, Suchakree Sawangwong, Parintorn Pooyoi, Worapan Kusakunniran. Localization and Classification of Rice-grain Images Using Region Proposals-based Convolutional Neural Network . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1207-6
    [3] Chang-Hao Zhu, Jie Zhang. Developing Soft Sensors for Polymer Melt Index in an Industrial Polymerization Process Using Deep Belief Networks . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1203-x
    [4] Bin Hu, Jiacun Wang. Deep Learning Based Hand Gesture Recognition and UAV Flight Controls . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1194-7
    [5] Viet Khanh Ha, Jin-Chang Ren, Xin-Ying Xu, Sophia Zhao, Gang Xie, Valentin Masero, Amir Hussain. Deep Learning Based Single Image Super-resolution: A Survey . International Journal of Automation and Computing,  doi: 10.1007/s11633-019-1183-x
    [6] Zhen-Jie Yao, Jie Bi, Yi-Xin Chen. Applying Deep Learning to Individual and Community Health Monitoring Data: A Survey . International Journal of Automation and Computing,  doi: 10.1007/s11633-018-1136-9
    [7] Bo Zhao, Jiashi Feng, Xiao Wu, Shuicheng Yan. A Survey on Deep Learning-based Fine-grained Object Classification and Semantic Segmentation . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1053-3
    [8] Danko Nikolić. Why Deep Neural Nets Cannot Ever Match Biological Intelligence and What To Do About It? . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1093-8
    [9] Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, Qianli Liao. Why and When Can Deep-but Not Shallow-networks Avoid the Curse of Dimensionality:A Review . International Journal of Automation and Computing,  doi: 10.1007/s11633-017-1054-2
    [10] S. Arumugadevi, V. Seenivasagam. Color Image Segmentation Using Feedforward Neural Networks with FCM . International Journal of Automation and Computing,  doi: 10.1007/s11633-016-0975-5
    [11] Guo-Bing Zhou, Jianxin Wu, Chen-Lin Zhang, Zhi-Hua Zhou. Minimal Gated Unit for Recurrent Neural Networks . International Journal of Automation and Computing,  doi: 10.1007/s11633-016-1006-2
    [12] Xiao-Cheng Shi,  Tian-Ping Zhang. Adaptive Tracking Control of Uncertain MIMO Nonlinear Systems with Time-varying Delays and Unmodeled Dynamics . International Journal of Automation and Computing,  doi: 10.1007/s11633-013-0712-2
    [13] Fusaomi Nagata, Keigo Watanabe. Adaptive Learning with Large Variability of Teaching Signals for Neural Networks and Its Application to Motion Control of an Industrial Robot . International Journal of Automation and Computing,  doi: 10.1007/s11633-010-0554-0
    [14] Yuan-Yuan Wu, Tao Li, Yu-Qiang Wu. Improved Exponential Stability Criteria for Recurrent Neural Networks with Time-varying Discrete and Distributed Delays . International Journal of Automation and Computing,  doi: 10.1007/s11633-010-0199-z
    [15] Siva S. Sivatha Sindhu, S. Geetha, M. Marikannan, A. Kannan. A Neuro-genetic Based Short-term Forecasting Framework for Network Intrusion Prediction System . International Journal of Automation and Computing,  doi: 10.1007/s11633-009-0406-y
    [16] Ahcene Boubakir, Fares Boudjema, Salim Labiod. A Neuro-fuzzy-sliding Mode Controller Using Nonlinear Sliding Surface Applied to the Coupled Tanks System . International Journal of Automation and Computing,  doi: 10.1007/s11633-009-0072-0
    [17] Computational Intelligence and Games:Challenges and Opportunities . International Journal of Automation and Computing,  doi: 10.1007/s11633-008-0045-8
    [18] Jane M. Binner, Alicia M. Gazely, Graham Kendall. Evaluating the Performance of a EuroDivisia Index Using Artificial Intelligence Techniques . International Journal of Automation and Computing,  doi: 10.1007/s11633-008-0058-3
    [19] Mohamed-Faouzi Harkat,  Salah Djelel,  Noureddine Doghmane,  Mohamed Benouaret. Sensor Fault Detection, Isolation and Reconstruction Using Nonlinear Principal Component Analysis . International Journal of Automation and Computing,  doi: 10.1007/s11633-007-0149-6
    [20] Modelling and Multi-Objective Optimal Control of Batch Processes Using Recurrent Neuro-fuzzy Networks . International Journal of Automation and Computing,  doi: 10.1007/s11633-006-0001-4
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (14)  / Tables (4)

Metrics

Abstract Views (433) PDF downloads (5) Citations (0)

Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning

Abstract: In industry, it is becoming common to detect and recognize industrial workpieces using deep learning methods. In this field, the lack of datasets is a big problem, and collecting and annotating datasets in this field is very labor intensive. The researchers need to perform dataset annotation if a dataset is generated by themselves. It is also one of the restrictive factors that the current method based on deep learning cannot expand well. At present, there are very few workpiece datasets for industrial fields, and the existing datasets are generated from ideal workpiece computer aided design (CAD) models, for which few actual workpiece images were collected and utilized. We propose an automatic industrial workpiece dataset generation method and an automatic ground truth annotation method. Included in our methods are three algorithms that we proposed: a point cloud based spatial plane segmentation algorithm to segment the workpieces in the real scene and to obtain the annotation information of the workpieces in the images captured in the real scene; a random multiple workpiece generation algorithm to generate abundant composition datasets with random rotation workpiece angles and positions; and a tangent vector based contour tracking and completion algorithm to get improved contour images. With our procedures, annotation information can be obtained using the algorithms proposed in this paper. Upon completion of the annotation process, a json format file is generated. Faster R-CNN (Faster R-convolutional neural network), SSD (single shot multibox detector) and YOLO (you only look once: unified, real-time object detection) are trained using the datasets proposed in this paper. The experimental results show the effectiveness and integrity of this dataset generation and annotation method.

Fu-Qiang Liu, Zong-Yi Wang. Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning[J]. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1221-8
Citation: Fu-Qiang Liu, Zong-Yi Wang. Automatic “Ground Truth” Annotation and Industrial Workpiece Dataset Generation for Deep Learning[J]. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1221-8
    • Deep learning datasets play an important role in the history of object recognition. They have always been one of the most important factors in making great progress in this field. At present, a large number of images on the Internet can be accessed at any time, which makes it possible to build a rich, large number of images and types of datasets.

      Early datasets, such as Caltech 101 or Caltech 256, were criticized for lacking inter-class variance. SUN[1] is based on images indicating different scene categories, many of which have scene and object annotation that can support scene recognition and object detection. The Tiny Images dataset[2] generates a dataset with a large number of object categories and scenes, but its annotated data have not been verified manually, and contain notable errors. The two benchmark datasets, CIFAR10 and CIFAR100, evolved from the Tiny Images and have trusted tags. The PASCAL VOC dataset[3] is used to evaluate object recognition and classification algorithms for annual competitions. Since 2005, there have been only four kinds of objects, and they have increased to 20 kinds of common objects in daily life. ImageNet[4] contains more than 14 million images and more than 20 000 categories of objects. The main supporting dataset of the ImageNet large scale visual recognition challenge (ILSVRC) contest has pushed object recognition research to a new level. ImageNet has also been criticized for its large objects in the dataset, which are all in the center of the image, an occurrence which is not typical in real world situations. To solve this problem, researchers have created the MS COCO dataset[5], pushing the research to a richer stage for image understanding. The images in MS COCO comprise complex daily scenes, containing common objects in their natural background. It is very close to real life scenes. The object label provides a more accurate detection and evaluation result by using a full instance segmentation method. The Places dataset[6] contains 10 million scene pictures, which are annotated by scene semantic categories, thus providing an opportunity for the deep learning algorithms that are badly needed for data to reach the human recognition level. Open Images[7] is a dataset with 9 million images labeled by image sets and object bounding frames.

      At present, there are a multitude of open datasets. However, for many specific domains, there are no open datasets available. Therefore, while obtaining many ideal workpiece three-dimensional models and some actual workpiece images, the ability to automatically and quickly generate a dataset in a specific field would be of great value to researchers.

    • Dataset annotation is a very labor consuming and time consuming task. The annotation process is very difficult to perform, for it requires much professional knowledge. To overcome these shortcomings, in 2018, NVIDIA Corporation released a virtual dataset generator for deep learning: the NVIDIA deep learning dataset synthesizer (NDDS)[8].

      In the same year, NVIDIA also released a dataset for three-dimensional object detection and pose estimation: Falling things[9], which contains 21 common objects in families from the Yale-CMU-Berkeley (YCB) dataset[10], a total of 60 000 tagged pictures. For each image in the dataset, provided are the corresponding instance level classification, the two-dimensional object bounding boxes, the three-dimensional object bounding boxes, and the three-dimensional poses. Each element in the dataset contains a monocular RGB (red, green and blue) color image, binocular RGB color images and a dense depth map after registration.

      Arsenovic et al.[11] of the University of Novi Sad presented an automated collection of images on the Internet, as well as some scattered images, as the dataset used by researchers themselves. They used two methods to automatically annotate the collected images: 1) the traditional image retrieval method to process the images and get the position coordinates of the bounding boxes of the objects, and 2) the convolutional neural network (CNN)-based neural network method to detect the bounding boxes of the objects. When the bounding box of the target is detected, the coordinates of the upper left and lower right corners of the bounding box are stored in a JavaScript object notation (JSON) file. There are two drawbacks for using this method to automatically annotate datasets: 1) There are misjudgments or omissions using either the traditional image retrieval method or the CNN-based object detection method. Additionally, after the detection process, the paper does not participate in data cleaning artificially, so the annotated data are not clean; 2) This method only gives the annotation results at the bounding box level. In some finer-grained applications, such as instance segmentation, the annotation information at the instance level is needed. Sun et al.[12] combines top-down and bottom-up saliency to detect and measure the workpieces. Poolsawad et al.[13] discussed the issues in the existing datasets, such as missing values, high dimensionality, and unbalanced classes. Gong et al.[14] gave an overview of contour detection approaches.

    • Aldoma et al.[15] proposed a method for automatic “ground truth” annotation for multi-view RGB-D object instance recognition datasets. Lai et al.[16] proposed a method of automatic annotation using a sliding window detector, which can annotate RGB-D three-dimensional scenes at the pixel level. Cicco et al.[17] proposed an automatic model based dataset generation method. Among them, procedural generation is a commonly used technique in the graphics field, which is used to generate some scenes, such as virtual city generation[18] and virtual dungeons generation[19]. Richter et al.[20] proposed a method to obtain the ground truth of datasets from modern computer games. Although the source code and the data contents of the commercial games cannot be accessed, the author intercepts data from the communication between games and graphics display devices, and then automatically generate the datasets with high efficiency. The authors combine the automatically generated dataset with $\dfrac{1}{3} $ of the training set of another dataset to train on a deep neural network, which outperforms the results of entirely training on another dataset. Massachusetts Institute of Technology (MIT)[21] proposed Label Fusion. By reprojecting the annotation results of three-dimensional scenes, each scene can be annotated corresponding to RGB-D images. This tool enables the researchers to work in just a few days. More than 1 000 000 annotated objects were collected. But the disadvantage of this tool is that it is also a semi-automatic labeling tool assisted by humans. Hodan et al.[22] have used this tool to produce the T-LESS dataset, which contains about 49 000 RGB-D low-texture images, with each object being labeled with six degrees of freedom. In addition, there have been some recent efforts to extend[23] using large amounts of simulated data from small-scale real datasets. Koppula et al.[24] of Cornell University proposed a semantic segmentation method for three-dimensional point clouds in indoor scenes, and proposed a solution to the semantic annotation of these three-dimensional point clouds. Sony[23] proposed a method of learning pedestrian detectors in specific scenarios without real data. When the amount of real data is limited, this method can outperform the model trained in a specific real scene while only using artificial data. Xie et al.[25] of the University of Washington proposed a method of annotating street scene semantic instances using a three-dimensional to two-dimensional annotation transformation method, and released a new dataset containing 400 K images. Google′s Zoph et al.[26] proposed a data augmentation strategy for object detection, which migrated data augmentation strategies for different computer vision tasks, and achieved good results for image classification, object detection, semantic segmentation and other fields. The authors predict that the research on data augmentation strategy can effectively replace the acquisition of additional human annotated data, thus saving a lot of manpower and time costs.

      Visual geometry group image annotator (VIA)[27], an annotation tool developed by the visual geometry group (VGG) at Oxford University, is very compact and can be opened in the form of web pages. It also provides circular, rectangular, triangular, polygon and other annotation shapes. In each annotation area, users can define the category attributes of the area themselves, which is very convenient to use. Ahn and Dabbish[28] proposed a method to label the images while people playing the computer games, rather than using computer vision techniques. While most of the existing methods can only support polygon bounding box drawing for annotation, Zhang et al.[29] proposed Mask Editor for image mask generation with irregular shapes. LabelMe[30], a data annotation tool released by MIT, also offers applications for both the iPhone and the iPad versions. The tool can annotate the mask polygon information of different object instances in the image, and can assign the attribute value of the annotated instance manually. After the annotation process, a JSON format annotation file can be obtained, containing the polygon information for different object instances. Label Fusion[21] is a semi-automatic assistant annotation software for RGB-D object at the pixel level released by MIT. The tool can generate RGB-D datasets of pixel-level labels and corresponding pose data of objects at the same time. Johnson-Roberson et al.[31] describe a method to incorporate photo-realistic computer images from a simulation engine to generate annotated data, which can be used for training the deep neural networks.

    • At present, the public datasets have the following shortcomings: 1) There are few workpiece datasets focused on industrial area, and most of the datasets published on the Internet are of animals, human beings, daily used things, etc. In some specific areas, there are no known public datasets, such as polishing workpiece datasets. 2) Even when having dataset images, the process of annotating datasets manually is also very time consuming. Although there are some open source methods for semi-automatic dataset assistant generation, there are no automatic methods.

      In view of the above problems, this paper proposed a systematic automatic dataset generation scheme: 1) Taking the polishing workpiece as an example, a complete automatic dataset generation system scheme is introduced. 2) The transformation method from a three-dimensional computer aided design (CAD) workpiece model to a series of other dataset formats is introduced, including two-dimensional multi-view workpiece datasets, two-dimensional multi-workpieces scene datasets, etc. 3) An automatic annotation algorithm is proposed, which can automatically generate annotation information for two-dimensional single workpiece, two-dimensional multi-workpieces, etc.

      There are two steps for automatically generating a two-dimensional workpiece dataset: image dataset generation and annotation information generation. The annotation information includes the polygon point coordinate array of the workpiece and the vertex coordinate of the two-dimensional bounding box of the workpiece. This paper introduces the above two steps separately.

    • By setting the position and angle of a virtual camera and the number of images to be collected, the Phong illumination reflection model[32] is used to generate multi-view polishing workpiece images. Fig. 1 shows some of the original three-dimensional workpiece models, and Fig. 2 shows some of the multi-view images generated according to the Phong illumination reflection model.

      Figure 1.  Selected original three-dimensional model of polishing workpieces

      Figure 2.  Selected multi-view images of polishing workpieces

      In addition, the Phong reflection model can provide the calculation of the illumination intensity of each surface point by having a set of all light sources, the direction vector between the point on the surface of the object and the light source, the normal vector of the point on the surface of the object, the direction of the ideal reflected beam, and the direction of the light projected to the observer (or virtual camera).

      1) Simulated workpiece versus real workpiece

      According to the Phong illumination model proposed above, the simulated image of any workpiece can be obtained from any angle of view. In addition, the robotic manipulator is also used to collect real workpiece images in different poses. Afterwards, these workpieces are extracted from the collected images as the backup material for real workpieces. Figs. 3 (a) and 3 (b) are schematic diagrams of simulated workpiece and real workpiece.

      Figure 3.  Different types of generated workpiece images: (a) simulated workpiece, (b) real workpiece, (c) single workpiece, (d) multiple workpieces, (e) blank background, and (f) real background

      2) Single workpiece versus multiple workpieces

      Different kinds of datasets are generated according to the situation that there is only one workpiece in the image or there is more than one workpiece in the image. In the case of only one workpiece present in the image, the simulated multi-view workpiece image and the workpiece captured by camera from the actual industrial scene are rotated, zoomed, and translated to augment the image dataset. Then, multiple workpieces are randomly selected and combined into an image. In the process of combining multiple workpieces into an image, the collision and occlusion of different workpieces are detected, and the positions of different workpieces are adjusted iteratively. This paper proposes a random multiple workpiece generation algorithm. The flow chart of that algorithm is shown in Fig. 4. Figs. 3 (c) and 3 (d) show the sketches of a single workpiece and of multiple workpieces. All of the workpieces were composed by our random multiple workpiece generation algorithm.

      Figure 4.  Flow chart of the random multiple workpieces generation algorithm

      3) Blank background versus real background

      Two kinds of datasets are generated according to whether there is a real environmental background or not. One kind of dataset has no background; the background in the image is purely white. Another kind of dataset has a real factory environmental background. We collected a number of environmental backgrounds in different factories, under different lighting conditions and on different locations. These environmental backgrounds can well represent the rich environmental features in factories, and enable deep convolutional neural networks to learn actual and specific environmental features. After training on this dataset, good results can be obtained in the inference process with real factory scenes. Fig. 5 shows the flow chart of the workpiece and background composition algorithm. Figs. 3 (e) and 3 (f) show an image without an environmental background and of an image with a real environmental background, respectively. Fig. 3 (f) was composed by the workpiece and background composition algorithm.

      Figure 5.  Flow chart of the workpiece and background composition algorithm

    • In this paper, we proposed a tangent vector based contour tracking and completion algorithm to obtain the contour image. The pseudo code of the tangent vector based contour tracking and completion algorithm is shown as Algorithm 1. Fig. 6 shows the single workpiece contour image generation process, and Fig. 7 shows the multiple workpiece contour images generation process.

      Figure 6.  Single workpiece contour image generation process: (a) single workpiece with background, (b) single workpiece with no background, (c) binary image, and (d) contour image

      Figure 7.  Multiple workpiece contour image generation process: (a) multiple workpieces with background, (b) multiple workpieces with no background, (c) gray image, and (d) contour image

      Algorithm 1. Tangent vector based contour tracking and completion algorithm

      Input: Input Image $ F = {f_{ij}} $;

      Output: Output Image $ C = {c_{ij}} $;

      1) Set NBD = 1. set LNBD = 1 while scanning new line. For the pixel $ gray \neq 0 $:

      a) If $ f_{ij} = = 1 $, and $ f_{i, j-1} = = 0 $, then NBD = NBD + 1, and $ (i_2, j_2) = (i, j-1) $;

      b) If $ f_{ij}\ge 1 $, and $ f_{i, j+1} = = 0 $, then NBD = NBD + 1, and $ (i_2, j_2) \!=\! (i, j\!+\!1) $; Set $ LNBD \!=\! f_{ij}, if f_{ij}\! > \!1 $;

        c) Otherwise, jump to Step 4).

      2) From start point $ (i, j) $, track the detected contour through 2.1 to 2.5.

      a) From $ (i_2, j_2) $ on, find the non-zero pixels around $ (i, j) $ in a clockwise direction. Suppose $ (i_1, j_1) $ is the first found non-zero pixel. Set $ f_{ij} $ to be –NBD, then enter Step 4);

      b) Update $ (i_2, j_2) $ to $ (i_1, j_1) $, and update $ (i_3, j_3) $ to $ (i, j) $;

      c) From next element of pixel $ (i_2, j_2) $ on, search the non-zero pixel around current pixel $ (i_3, j_3) $ in a anti-clockwise direction, and set the first non-zero pixel as $ (i_4, j_4) $;

        d) Change $ f_{i_3, j_3} $ according to the following rules:

      i) If pixel $ (i_3, j_3 + 1) $ is zero after being checked with Step 2)–c), then update $ f_{i_3, j_3} $ to –NBD;

      ii) If pixel $ (i_3, j_3 + 1) $ is non-zero after being checked with Step 2)–c), and $ f_{i_3, j_3} $ is $ 1 $, then update $ f_{i_3, j_3} $ to NBD;

          iii) Otherwise, do not change $ f_{i_3, j_3} $.

      e) If $ (i_4, j_4) = = (i, j) $, and $ (i_3, j_3) = = (i_1, j_1) $ (back to start point), then execute with Step 4); Otherwise, update $ (i_2, j_2) $ to $ (i_3, j_3) $, update $ (i_3, j_3) $ to $ (i_4, j_4) $, then go back to Step 2)–c).

      3) If $ f_{ij} \neq 1 $, update LNBD to $ |f_{ij}| $, and then search from pixel $ (i, j+1) $ untill the edge of the lower right corner is reached.

      4) Connect the components if the connection condition is satisfied (the details of the condition are listed in Section 3.2.1).

      5) return $ C = {c_{ij}} $.

      The procedure to attach two contour connected components is detailed below:

      1) Extract the eight neighborhood connected components, as shown in Fig. 8 (a). Fig. 8 (b) shows an eight neighborhood diagram.

      Figure 8.  Tangent vector based contour tracking and completion process: (a) original image, (b) 8-neighborhood, (c) before connection, (d) after connection, and (e) result image

      2) Filter out the connected componets in which the number of pixels is less than $ {N_{thresh}} $.

      3) Insert the connected components in which the number of pixels is larger than $ {N_{thresh}} $ into the vector.

      4) Calculate the gravity center position of every connected component in the vector.

      5) For every connected component, fit the beginning and the ending line segments ($ {p_i\{i = 0, 1,\cdots, n-1\}} $) using the random sample consensus (RANSAC) method, and obtain the corresponding tangent vectors.

      6) Defining the slope of the obtained tangent vector as $ k_c $, and the slope of tangent vector of the current connected component′s i-th $ k $ nearest neighborhood as $ k_i $, the error function is given below:

      $ E = \alpha|k_c - k_i|+\beta\sqrt{(x_c - x_i)^2 + (y_c - y_i)^2}. $

      (1)

      The weight values of the slope difference and the end points′ distance are $ \alpha $ and $ \beta $, respectively. When $ \alpha $ is set to 0 and $ \beta $ is set to 1, only the distance between the end points is used to decide whether to merge and connect two connected components or not. When $ \alpha $ is set to 1 and $ \beta $ is set to 0, only the similarity between the two slopes of the two tangent vectors of the two connected components is used to decide whether to merge and connect the two connected components or not. The distance between the end points is used to decide whether to merge and connect the two connected components or not. In this paper′s experiment, $ \alpha $ is set to 0.6 and $ \beta $ is set to 0.4.

      7) If E is less than Ethresh, then the two tangent vectors of the two connected components should be connected, as shown in Figs. 8 (c) and 8 (d).

      Fig. 8(e) shows the contour image after being processed by this algorithm.

      After extracting the vertex coordinates of the outer edge polygon of the workpiece in the image, according to the organization format of the JSON file, the annotation file is formatted and outputted.

    • The generated datasets include a composite dataset of a single simulated workpiece with no background, a composite dataset of a single simulated workpiece with a real environmental background, a composite dataset of a single real workpiece with a real environmental background, a non-composite dataset of a single real workpiece with a real environmental background, a composite dataset of multiple simulated workpieces with no background, a composite dataset of multiple simulated workpieces with a real environmental background, a composite dataset of multiple real workpieces with a real environmental background, and a non-composite dataset of multiple real workpieces with a real environmental background. Fig. 9 shows the typical images in the eight different datasets.

      Figure 9.  Typical images in eight different datasets

      For the single real workpiece with a real environmental background non-composite dataset and for the multiple real workpieces with a real environmental background non-composite dataset, we need to segment the workpieces out of the images and then extract the workpiece contours in order to generate the ground truth annotation information automatically. For this purpose, we proposed a point cloud based spatial plane segmentation algorithm to segment the workpieces in the scenes. We assume that all of the workpieces are located above the spatial ground plane or above the work platform, which is the case in the preponderance of industrial situations. We can then fit the spatial plane parameters and segment all of the points above the spatial plane. Fig. 10 shows the actual workpieces and their point clouds. Fig. 11 shows the intermediate results of our proposed algorithm. Fig. 11(a) shows the complete point cloud, including the fitted spatial plane point cloud and the segmented workpieces point cloud. After filtering out the noises in the segmented point cloud, the segmented workpieces are shown in Figs. 11 (b) and 11 (c). The inputs of this algorithm are an RGB image and a depth image. The algorithm is described as follows:

      Figure 10.  Actual workpieces and their point cloud: (a) actual workpiece 1, (b) actual workpiece 2, and (c) point cloud

      Figure 11.  Point cloud based spatial plane segmentation algorithm: (a) complete point cloud, (b) segmented workpiece 1, and (c) segmented workpiece 2

      1) Obtain the point cloud set from the depth image.

      2) Define the plane in the three-dimentional coordinate system as

      $ Ax + By + Cz + D = 0.$

      (2)

      3) Define the random generation seeds, and then use those seeds to pick up the three point coordinates, $ P_1([x_1, y_1, z_1]) $, $ P_2([x_2, y_2, z_2]) $ and $ P_3([x_3, y_3, z_3]) $. Calculate the spatial plane parameters using those coordinates:

      $ \left\{ \begin{aligned} &A = y_1(z_2-z_3) + y_2(z_3-z_1) + y_3(z_1-z_2)\\ & B = z_1(x_2-x_3)+z_2(x_3-x_1) + z_3(x_1-x_2)\\ &C = x_1(y_2-y_3)+x_2(y_3-y_1) + x_3(y_1-y_2)\\ & D = -[x_1(y_2z_3-y_3z_2)+\\ & \qquad x_2(y_3z_1-y_1z_3) + x_3(y_1z_2-y_2z_1)].\end{aligned} \right. $

      (3)

      4) Calculate the distances from the points to the fitted spatial plane. The distance between the i-th point $ P_i\{x_i, y_i, z_i\} $ and the plane is defined as:

      $ d_i = \frac{|Ax_i+By_i+Cz_i+D|}{\sqrt{A^2+B^2+C^2}} .$

      (4)

      5) Iterate $ k $ times to find the parameters of the plane which minimize the sum of $ d_i $:

      $ A, B, C, D = argmin\sum\limits_{i = 0}^{n}\frac{|Ax_i+By_i+Cz_i+D|}{\sqrt{A^2+B^2+C^2}} . $

      (5)

      6) Set the distance threshold value as $ D_{thresh} $, insert the points whose distances are larger than $ D_{thresh} $ into the vector.

      7) Project all of the points in the vector back to the two-dimensional coordinate system, according to the camera intrinsic parameter.

      8) Calculate the convex hull of the projected two-dimensional points to obtain the minimum circumscribed polygon of the workpiece.

      The single workpiece and multiple workpiece datasets statistics data are shown in Table 1. The dataset is evenly distributed. In each category, the number of images they contain is very close.

      DescriptionTotalTrainTest
      Single simulated workpiece without background302 400280 00022 400
      Single simulated workpiece with real scene background302 400280 00022 400
      Single real workpiece with real scene background241 920220 00021 920
      Single real workpiece with real scene background non-composite dataset241 920220 00021 920
      Multiple simulated workpieces without background50 00035 00015 000
      Multiple simulated workpieces with real scene background50 00035 00015 000
      Multiple real workpieces with real scene background50 00035 00015 000
      Multiple real workpieces with real scene background non-composite dataset60 00050 00010 000

      Table 1.  Single workpiece and multiple workpieces datasets statistics data

    • We use Faster R-CNN (Faster R-convolutional neural network)[33], SSD (single shot multibox detector)[34] and YOLO (you only look once: unified, real-time object detection)[35] as the testing networks, to train and test the dataset proposed in this paper. Both Faster R-CNN and SSD used VGG16 as their backbone network, and YOLO used darknet as its backbone network. To verify the effectiveness of our proposed datasets, we directly train on the datasets proposed by this paper. The following aspects are analyzed respectively: training loss curve, test set accuracy, receiver operating characteristic (ROC) curve, and area under curve (AUC) value. We used this dataset for training deep neural networks[36], and the loss curve of training set in training process also existed in that paper.

      1) Loss curve of training set in training process: The loss curve of Faster R-CNN in training on the polishing workpiece dataset generated in this paper is shown in Figs. 12 and 13. Fig. 12 shows the change curve of the loss value on four datasets with only one workpiece in the image. Fig. 13 shows a loss curve on four datasets with multiple workpieces in the image. In the training set, there are 66 000 images corresponding to the eight datasets mentioned above, of which 8 250 images are in each dataset.

      Figure 12.  Curve of loss value in the single workpiece training process

      Figure 13.  Curve of loss value in the multiple workpieces training process

      Among them, Dataset 1 in Figs. 12 and 13 represents a composite dataset of simulated workpiece (s) without environmental background. Dataset 2 represents a composite dataset of simulated workpiece (s) with a real environmental background. Dataset 3 represents a composite dataset of real workpiece (s) with a real environmental background. Dataset 4 represents a non-composite dataset of real workpiece (s) with a real environmental background. The loss value of Faster R-CNN on the training set can be converged well.

      2) Detection accuracy, ROC curve and AUC value: Precision is defined as the ratio of true positive samples to all judged positive samples. The ROC curve is called a receiver operating characteristic curve. The abscissa of the ROC curve is the false positive rate and the ordinate is the true positive rate. The full name of AUC is area under curve, which is the area under the ROC curve.

      This paper uses Faster R-CNN, SSD and YOLO, the current mainstream object detection neural networks, to test 2 500 single workpiece images and 2 500 multiple workpiece images. The number of workpieces in the testing set is 84. The images include multi-view images generated using the Phong illumination model and the workpieces captured in real scenes. In the multiple workpieces testing set, the number of possible workpieces in each picture is from 1 to 15. Table 2 shows the detection accuracy of Faster R-CNN on the polishing workpiece dataset generated in this paper. The test result is obtained after the deep neural network being trained on all eight datasets.

      Single workpieceMultiple workpieces
      Workpiece category number8484
      Workpiece number11~15
      Image number in training set66 00066 000
      Image number in testing set2 5002 500
      Detection accuracy of Faster R-CNN94.33%92.85%
      Detection accuracy of SSD91.25%90.60%
      Detection accuracy of YOLO89.75%88.25%

      Table 2.  Detection accuracy of Faster R-CNN, SSD and YOLO

      Fig. 14 shows the ROC curve of Faster R-CNN, SSD and YOLO trained on the dataset proposed by this paper:

      Figure 14.  ROC curve of Faster R-CNN

      The AUC values of Faster R-CNN, SSD and YOLO, corresponding to the ROC, curve are shown in Table 3.

      Faster R-CNNSSDYOLO
      AUC0.8650.8540.821 8

      Table 3.  AUC statistical data on testing dataset

      From the experimental results above, it can be seen that the dataset generated in this paper contains abundant workpiece features and can make the loss value of the deep convolutional neural network trained on this dataset to be converged to a small value.

    • The method proposed in this paper can save a lot of manpower and time, and the larger the size of the image dataset is, the more time can be saved. The time statistics data comparison between manual annotation and automatic annotation under different data scales is counted in Table 4.

      Number of imagesWith 1 personOur annotation time
      1300.1

      Table 4.  Time annotation time (s) statistics data comparison between manual and automatic mode

      The experimental results show that with the powerful computing ability of modern computers, the automatic ground truth annotation method proposed in this paper can be processed in a very short period of time. In contrast, manual annotation processes cost a lot of manpower and time.

    • Being unable to find a public dataset for certain specific fields, researchers need to be able to generate their own datasets. The method proposed in this paper can quickly and efficiently be used to generate customized datasets. At the same time, the types of datasets generated in this paper are with a better integrity. It meets the requirements of training deep convolutional neural network in different scenarios.

    • To address the scarcity of industrial workpiece datasets that can directly be used to train deep neural networks, as well as to address the time consuming problem of dataset annotation, this paper proposes an automatic deep learning dataset generation method and a “ground truth” annotation information generation method. Based on a theoretical workpiece model, real environmental background data and actual data collected by individual workpieces, and according to an illumination transformation, and multi-view/multi-source image fusion algorithm, many types of datasets, namely the multi-view theoretical workpiece, the theoretical workpiece model with real environmental background, and the real workpiece with the real environmental background, are generated. The experimental results show that the dataset generated by the method proposed in this paper can satisfy the training requirements of deep neural networks, and make the trained deep neural networks useful in practical industrial applications. To solve the time consuming problem of data annotation for training deep neural networks, based on the generated two-dimensional workpiece dataset, and the algorithms proposed in this paper, an annotation file is generated in the JSON file format. Compared with the traditional manual annotation method, the annotation method proposed in this paper can greatly reduce the annotation time, while obtaining the same or a higher annotation quality. The experimental results of Faster R-CNN, SSD and YOLO show that the annotation file generated by this method can meet the training and testing requirements of deep learning.

Reference (36)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return