-
Recently, with the development in artificial intelligence, computer vision and data mining, many intelligent methods have been applied to sports[1-7]. On one hand, some methods are proposed to predict the outcomes of sports games and help people analyze sports at a macroscopic level. Owramipur et al.[8] proposed to use Bayesian network to predict the results of football games. Miljkovic et al.[9] formalized the problem of predicting the outcomes of basketball games in the national basketball association (NBA) league as a classification problem and use Naïve Bayes method to solve it. On the other hand, some methods are proposed to analyze the specific behaviors on the sports ground and help people understand sports at a microscopic level. Nakai et al.[10] used the human pose estimation method, i.e., OpenPose[11], to predict the shooting probability of basketball free throws. Liu and Carr[12] proposed to use a random decision forest and a context-conditioned motion model to detect and track the players in sports games.
The object detection methods[13-17] have recently developed rapidly with the convolutional neural network (CNN)[18-22]. These methods have been widely applied in many application scenarios, such as smart video surveillance, autonomous driving, etc. In this paper, we introduce an object detection method into the area of basketball. We propose a camera-based basketball scoring detection (BSD) method with a you only look once (YOLO) basketball hoop detector and a frame difference scoring detector.
When we hold basketball games, the scoring tables or scorers are needed to keep records of the points. However, it is impractical to arrange scoring tables or scorers for all the basketball games, especially for the spare-time games. Therefore, the automatic scoring detection and counting is applicable for basketball courts. What′s more, it is a social media age. The amateur players want to share their highlights on basketball courts, such as the moment of scoring, with the basketball APPs, such as “QIUJI” and “REEE Camera”. Detecting the basketball scoring automatically will help the APPs to cut, upload, store and share the highlight videos. Considering the requirements on real-time performance and practicability, we adopt the simple and efficient methods in the proposed BSD model. As demonstrated in Fig. 1, the videos of the basketball court are taken as the input in the proposed BSD model. Afterwards, the real-time object detection method, i.e., YOLO[16, 17], is implemented to locate the position of the basketball hoop in the video. Then, the frame difference method[23] is utilized to detect whether there is any object motion in the area of the hoop to determine the basketball scoring condition. The implementation of the YOLO method and the frame difference method makes sure the proposed model meets the key requirement, i.e., real-time running, when applied in real scenarios. The experiments and applications in practical basketball court scenarios verify the effectiveness of the proposed BSD method.
-
In recent years, the CNN based methods have been widely applied in computer vision research[13-17, 24, 25]. As a fundamental task of computer vision, the object detection methods with CNN have developed rapidly. The regions with CNN features (RCNN)[13] method is firstly proposed to use the selective search method to generate image regions of interest (RoI). Then, the CNN model is used to extract the visual features of all the generated regions. Finally, the support vector machine (SVM)[26] classifier is used to determine the object categories in the image regions. Because the RCNN methods need to extract the CNN features of all generated regions, the entire process is very slow and far from being applied in real-time systems. The faster RCNN[14] method is proposed to accelerate the process and promote the detection accuracy of the RCNN method. The faster RCNN model replaces the selective search with the region proposal network. Meanwhile, the utilization of the RoI pooling layers does not extract CNN features of all the generated image regions. Therefore, the operating efficiency of the faster RCNN method is much better than the RCNN method. However, the faster RCNN method is still far from real-time processing. The processes of RCNN and faster RCNN are two-stage, which contain region proposals and object classification.
In order to further accelerate the process, some one-stage methods, e.g., the single shot multi-box detector (SSD) method[15] and the YOLO method[16, 17], are proposed. In the YOLO method, the category classifications and the object locations can be operated by a single convolutional neural network. The processes of location and classification are unified as one regression task. Compared to the two-stage methods, the detection accuracy of the one-stage methods is relatively inferior. However, the running efficiency of the one-stage methods is much faster, which meets the requirement of the real-time basketball scoring detection task.
Furthermore, Lin et al.[27] proposed RetinaNet, which is a one-stage detector solving the class imbalance problem in a flexible manner. RetinaNet proposes focal loss to suppress the gradients of easy negative samples. A feature pyramid network is used to detect multi-scale objects at different levels of feature maps. Zhang et al.[28] introduced RefineDet which is a one-stage detector. RefineDet proposes a cascaded optimization framework to refine the manually defined anchors and improve the anchor quality and final prediction accuracy significantly. Cai and Vasconcelos[29] proposed cascade RCNN which adopts a similar idea as RefineDet by refining proposals in a cascaded manner. Law and Deng[30] proposed a novel anchor-free framework CornerNet which detects objects as a pair of corners. CornerNet predicts class heat-maps, pair embeddings and corner offsets on each position of the feature maps to match the objects. Another anchor-free framework is CenterNet[31] which combines the idea of center-based methods and obtains significant improvements compared to baseline methods. Ghiasi et al.[32] proposed neural architecture search-feature pyramid network (NAS-FPN) which adopts neural architecture searching to find some new feature pyramid architectures. NAS-FPN consists of both top-down and bottom-up connections to fuse features with a variety of different scales. Similarly, EfficientNet[33] uses a neural architecture search to design a detection network, which carefully balances network depth, width, and resolution. Zhu et al.[34] presented a feature selection anchor free (FSAF) framework which can be plugged into one-stage detectors with FPN structure. Fan et al.[35] proposed a novel few-shot object detection network which aims at detecting objects of unseen categories with only a few annotated training examples. Dong et al.[36] proposed CentripetalNet which uses centripetal shift to pair corner key points from the same instance. CentripetalNet adopts a cross-star deformable convolution network to conduct feature adaption to make the information more aware at the corners.
The frame difference method[23, 37-40] has been widely used in different motion detection applications, which is robust and efficient for the scenarios with fixed position cameras. When there are moving objects in the videos, the gray scales of the corresponding pixels between consecutive frames will have differences. Therefore, we can calculate the difference map between the consecutive frames. The pixels of stationary objects are set as 0 in the difference map. The pixels of moving objects have gray scale variations. If the variations get larger than a set threshold, we can consider the object to be moving in the video. The calculation of the frame difference method is quite fast, which meets the real-time running requirement of the proposed BSD method.
-
The framework of the proposed camera-based basketball scoring detection method is shown in Fig. 1. In order to guarantee that the BSD method is able to process the basketball video in real-time with satisfactory detection accuracy, the well developed YOLO object detection method and frame difference motion detection method are adopted.
-
When operating the BSD method, the video clips of the basketball court are taken as inputs to the model. Afterwards, the first frame of the video is used as the base frame to determine the position of the basketball hoop. The YOLO network is implemented as the hoop detector. The hoop detection results are demonstrated with red boxes in Fig. 1. Because the positions of cameras are fixed on the basketball courts, the hoop areas are stationary in the videos. Therefore, the hoop detector only needs to be operated on the base frame for one time. When training the YOLO hoop detector, the loss function loss, contains four parts, i.e., the classification loss losscls, the center coordinate loss lossxy, the width-height coordinate loss losswh and the confidence loss lossconf.
The network structure of the adopted YOLO based basketball hoop detector is shown in Fig. 2. For each input image, there are three scales with three default anchors for detection, i.e., 9 anchor size in total. When the base frame of the video inputs the YOLO hoop detector, the image is divided into a K × K grid. We assume there are M object box boundary candidates in every cell. There is only one object, i.e., the basketball hoop, that we need to detect. Therefore, the binary cross entropy loss is implemented as the classification loss. Formally,
Figure 2. Framework of the YOLO based basketball hoop detector. For each input image, there are three scales with three default anchors for detection. Colored figures are available in the online version.
$\begin{split} &los{s_{cls}} = \\ &\;\;\;\;- \sum\limits_{i = 0}^{K \times K} {I_i^{hoop}} \sum\limits_{c\, \in \,class} {\left[ \begin{array}{l} {{\hat p}_i}(c)\log \left( {{p_i}(c)} \right)+\\ \left( {1 - {{\hat p}_i}(c)} \right)\log \left( {1 - {p_i}(c)} \right) \end{array} \right]} \end{split}$ (1) where
$I_i^{hoop} = 1$ if the hoop appears in cell$i$ , otherwise$I_i^{hoop} = 0$ . The classification candidate$class$ contains only one category.${\hat p_i}(c)$ denotes the conditional probability for the hoop in cell$i$ .The center coordinate loss lossxy is implemented to locate the center position of the predicted boundary box of the basketball hoop. Formally,
$ los{s_{xy}} = \sum\limits_{i = 0}^{K \times K} {\sum\limits_{j = 0}^M {I_{ij}^{hoop}} } \left[ {{{\left( {{x_i} - {{\hat x}_i}} \right)}^2} + {{\left( {{y_i} - {{\hat y}_i}} \right)}^2}} \right] $ (2) where
$({x_i},{y_i})$ indicates the ground truth center position of the basketball hoop boundary box.$({\hat x_i},{\hat y_i})$ is the predicted results of the center coordinates of the boundary box.$I_{ij}^{hoop} = 1$ if the j-th boundary box in the i-th cell is responsible for detecting the basketball hoop, i.e., the intersection-over-union (IoU) between the ground truth hoop boundary box and the predicted boundary box is larger than 0.5. Otherwise,$I_{ij}^{hoop} = 0$ .The width-height coordinate loss losswh is implemented to determine the width and the height coordinates of the hoop boundary box. Formally,
$ \begin{split} los{s_{wh}} = \displaystyle\sum\limits_{i = 0}^{K \times K} {\displaystyle\sum\limits_{j = 0}^M {I_{ij}^{hoop}} } \left( {2 - {w_i} \times {h_i}} \right)\times\\ \left[ {{{\left( {{w_i} - {{\hat w}_i}} \right)}^2} + {{\left( {{h_i} - {{\hat h}_i}} \right)}^2}} \right] \end{split} $ (3) where the indicator
$I_{ij}^{hoop}$ works similarly to the one in (2).${w_i}$ and${h_i}$ indicate the width and the height of the ground truth hoop boundary box.${\hat w_i}$ and${\hat h_i}$ are the predicted results. The factor$2 - {w_i} \times {h_i}$ in losswh is set for the hoops in small size. The smaller the hoops are, the larger the value of the loss becomes. It will benefit the small basketball hoops in the images.Last but not the least, the confidence loss lossconf is implemented to measure the confidence if a basketball hoop is in the predicted boundary box. The confidence loss is also in the form of binary cross entropy loss, which is demonstrated as follows:
$ \begin{split} los{s_{conf}} = - \displaystyle\sum\limits_{i = 0}^{K \times K} {\displaystyle\sum\limits_{j = 0}^M {I_{ij}^{hoop}} } \left[ \begin{array}{l} {{\hat C}_i}\log \left( {{C_i}} \right)+\\ ( {1 - {{\hat C}_i}} )\log \left( {1 - {C_i}} \right) \end{array} \right]-\\ {\lambda _{nohoop}}\displaystyle\sum\limits_{i = 0}^{K \times K} {\displaystyle\sum\limits_{j = 0}^M {I_{ij}^{nohoop}} } \left[ \begin{array}{l} {{\hat C}_i}\log \left( {{C_i}} \right)+\\ ( {1 - {{\hat C}_i}} )\log \left( {1 - {C_i}} \right) \end{array} \right]. \end{split} $ (4) The confidence loss contains two items corresponding to two conditions, i.e., the hoop is in the predicted boundary box and the hoop is not in the predicted boundary box. Similar to (2) and (3),
$I_{ij}^{hoop} = 1$ and$I_{ij}^{nohoop} = 0$ if the IoU between the ground truth hoop boundary box and the j-th predicted boundary box in the i-th cell is larger than 0.5. If the IoU between the ground truth hoop boundary box and the predicted boundary box is smaller than 0.5,$I_{ij}^{hoop} = 0$ and$I_{ij}^{nohoop} = 1$ .${\hat C_i}$ denotes the confidence score of the j-th predicted boundary box in the i-th cell.${\lambda _{nohoop}}$ is the hyperparameter.The entire loss function of the YOLO hoop detector is the sum of the above four loss functions. Formally,
$ loss = {\lambda _{coord}}(los{s_{xy}} + los{s_{wh}}) + los{s_{conf}} + los{s_{cls}} $ (5) where
${\lambda _{coord}}$ is the hyperparameter to control the scale of lossxy and losswh. -
After locating the position of the basketball hoop, the frame difference scoring detector is operated just on the hoop areas between the base frame and the following frames. The detailed pipeline is shown in Fig. 3. There are two key operations through gating the forward images. Gate 1 is to make the values of each pixel into binary with a proper threshold, while Gate 2 means selecting the holes through ranking the size of all holes. After these steps, the largest hole is extracted for the final scoring prediction. Here, the pixel of the base frame in the video is denoted as
$B(x,y)$ . The pixel of the current frame is denoted as${F_n}(x,y)$ . Before the frame difference operation, the video frames are converted to gray images. Then, the Gaussian filter is used for noise reduction. Afterwards, the frame difference operates as follows:Figure 3. Pipeline of frame difference scoring detector. Gate 1 is to make the values of each pixel into binary with a proper threshold, while Gate 2 means selecting the holes through ranking the size of all holes. After these steps, the largest hole is extracted for final scoring prediction.
$ {D_n}(x,y) = \left| {{F_n}(x,y) - B(x,y)} \right|. $ (6) Afterwards, a threshold T is set to obtain the binary result of the difference image. Formally,
$ {D_{n'}}(x,y) = \left\{ \begin{array}{l} 1,\quad {\rm{if}}\;{D_n}(x,y) > T\\ 0,\quad{\rm{otherwise}}. \end{array} \right. $ (7) The pixels of the moving objects are marked as 1 on the binary image. The pixels of the stationary objects are marked as 0. The connected component analysis operates afterwards to obtain the final frame difference results.
Lastly, the model determines the result of the basketball shot by detecting whether the moving object, i.e., the ball, goes through the red box in Fig. 1, i.e., the basketball hoop.
-
In order to train the proposed basketball scoring detection model, we have collected a basket and hoop image dataset. The dataset contains the photos of the basketball court, the surveillance images of the basketball court and the screenshots of basketball games. The example images of the dataset are shown in Fig. 4. This dataset contains 4 000 images, in which 3 500 images are used for training the YOLO hoop detector and 500 images are used for testing. In the dataset, 3 638 images contain only one basketball hoop. Meanwhile, 362 images contain at least two basketball hoops. The images are labeled with the data format of PASCAL VOC Challenge[41]. When testing the entire BSD method, five long basketball videos which are captured in real scenarios are used. The videos are resized to 1 280 × 720. The five test videos contain 44 basketball scoring moments. The representative frames of the test videos are shown in Fig. 5.
-
During the experiments, the YOLO basketball hoop detector is pre-trained on the Microsoft COCO[42] dataset. Then the YOLO basketball hoop detector is finetuned on the collected basket and hoop image dataset. The images are resized to 544 × 544 to input the YOLO basketball hoop detector. During the training process of the hoop detector, the learning rate of the model is set to 1 × 10−4 for the first 20 training epochs. Then the learning rate is set to 1 × 10−6 for the following 30 training epochs. The batch-size of the model in the training process is set to 6. The grid number K in the equations is set to 13, 26 and 52, respectively in the experiments. The non-maximum suppression[43] is utilized to obtain the final basketball hoop detection results. The number of the object box boundary candidate in every cell, i.e., M, is set as 3 in the experiments. In (5),
${\lambda _{coord}} = 5$ . For the confidence loss,${\lambda _{nohoop}} = 0.5$ .As the YOLO method has defined the default anchors, therefore, the anchor size selection is crucial. To this end, we implement the k-means clustering on all anchors across the training set. After clustering, there are 9 anchor groups, as shown in Fig. 6. Then, we rank the 9 cluster centers according to their sizes and set them as the default anchors for training.
Figure 6. Anchor size selection for YOLO training. We implement the k-means clustering on all anchors across the training set. After clustering, we rank the 9 cluster centers according to their sizes and set them as the default anchors for training.
The experimental platform is based on the Nvidia GTX TitanX GPU and the Intel i7-9750 CPU.
-
The quantitative experimental results of the YOLO basketball hoop detector on the collected basket and hoop image dataset are shown in Table 1. The quantitative experimental results of the BSD method on the five long test videos are shown in Table 2.
Method FPS AP50 (%) YOLO hoop detector 30 92.59 Table 1. Quantitative experimental results of the YOLO basketball hoop detector
Video Ground truth scoring Detected scoring FPS Accuracy (%) 1 10 9 80 90.00 2 12 11 80 91.67 3 9 8 80 88.89 4 10 8 80 80.00 5 3 3 80 100.00 All 44 39 80 88.64 Table 2. Quantitative experimental results of the basketball scoring detection method
As shown in Table 1, the YOLO basketball hoop detector can process 30 frames per second (FPS) on the Nvidia GTX TitanX GPU in the experiments. The detector is active only for the first frame of the input video. Therefore, the YOLO detector is fast enough for the real-time processing[44]. As to the hoop detection accuracy, the average precision at IoU = 0.5 (AP50) is used as the metric. For the AP50 metric, if the intersection-over-union (IoU) between the hoop detection result and the ground hoop annotation area is larger than 0.5, we can consider the detection result is correct. The AP50 of the YOLO hoop detector on the basket and hoop image dataset is 92.59%. The receiver operating characteristic (ROC) curve is shown in Fig. 7. It shows the YOLO based basketball hoop detector is robust for practical application.
The quantitative experimental results of the entire basketball scoring detection method are demonstrated in Table 2. The BSD method is able to process 80 frames per second, which meets the demand of real-time processing. The proposed model is tested on the five basketball videos with 44 scoring moments. The scoring detection accuracy on all the videos is 88.64%.
Some unsuccessful basketball hoop detection examples on the collected dataset are shown in Fig. 8. As we can see from the unsuccessful examples, the background of some images is complicated and the view angle of some hoops varies considerably.
The qualitative results of the basketball hoop detection on the five test long videos are shown in Fig. 9. The basketball hoop is marked with yellow boxes by the hoop detector. As is demonstrated, the YOLO basketball hoop detector successfully detects the locations of the basketball hoops for all the real scenario videos.
The sample results of the frame difference scoring detector are shown in Fig. 10. The left images are the original frames of the test videos. The right images are the corresponding frame difference results.
-
In order to demonstrate the influence of different sizes of input images to the YOLO basketball hoop detector, we change the image size to test the model on the collected basket and hoop image dataset. The quantitative results are shown in Table 3.
Input image size AP50 (%) 256 × 256 70.26 288 × 288 80.75 320 × 320 84.82 352 × 352 88.11 384 × 384 92.04 416 × 416 91.60 448 × 448 91.59 480 × 480 91.26 512 × 512 92.11 544 × 544 92.59 576 × 576 92.33 608 × 608 92.10 Table 3. Quantitative experimental results of the YOLO basketball hoop detector with different input image size
As demonstrated in Table 3, the YOLO basketball hoop detector with the input image size of 544 × 544 achieves the best performance on the basket and hoop image dataset. Therefore, we select 554 × 544 as the input image size of the basketball hoop detector and the basketball scoring detection system.
-
In order to compare the efficiency and effectiveness of the YOLO basketball hoop detector in the proposed model, the ablation study between the YOLO hoop detector and the other detection methods is implemented. The representative two-stage object detection method, i.e., faster RCNN, and another representative one-stage method, i.e., SSD, are used for the ablation study. The quantitative results are shown in Table 4.
Method Detection method category FPS AP50 (%) Faster RCNN detector Two-stage 0.80 89.26 SSD detector One-stage 1.25 91.30 YOLO detector One-stage 30 92.59 Table 4. Ablation study of different basketball hoop detector
As demonstrated by the ablation study results, the YOLO based detector achieves superior basketball hoop detection accuracy to the faster RCNN based detector and the SSD based detector. Moreover, the YOLO based detector achieves much faster results on running efficiency than the other two detectors, which makes sure that the entire model is a real-time system.
As shown in Table 4, the SSD based basketball hoop detector achieves approximated detection accuracy to the YOLO based detector. But, the running efficiency is much inferior to the YOLO based detector.
-
The proposed method has been installed at multiple basketball courts in Beijing. The examples of the BSD system user interface are shown in Fig. 11. The system is used for automatic scoring detection and acts as an intelligent scorer for the games.
Figure 11. Interface of the BSD system. The top is the scoring detection demo, while the bottom is the player activity analysis window. The whole BSD system could provide comprehensive understandings from raw videos, which could contribute to the training of the team and help the player perform better.
As shown in the bottom line of Fig. 11, the player detection function is added to the proposed BSD system, which is also a preparation for further analysis of the players on court.
-
In this paper, a basketball scoring detection method is proposed. The proposed BSD method contains a YOLO basketball hoop detector and a frame difference scoring detector. The model takes the videos of basketball game as input and detects the scoring condition automatically. The BSD method can process the basketball videos in real-time with satisfactory scoring detection accuracy.
In the future, larger scale datasets will be collected to further promote the BSD method. More efficient and effective detection models will be applied in the BSD method. What′s more, the BSD method will contain more basketball analysis functions.
-
This work was supported by Research on Educational Science Planning in Zhejiang Province (No. 2019SCG195), “13th Five Year Plan” Teaching Reform Project of Zhejiang University and Shandong Provincial Key Research and Development Program (Major Scientific and Technological Innovation Project) (No. 2019JZZY010119).
Camera-based Basketball Scoring Detection Using Convolutional Neural Network
- Received: 2020-05-20
- Accepted: 2020-09-25
- Published Online: 2020-12-23
-
Key words:
- Computer vision /
- convolutional neural network (CCN) /
- frame difference /
- motion detection /
- object detection /
- real-time system
Abstract: Recently, deep learning methods have been applied in many real scenarios with the development of convolutional neural networks (CNNs). In this paper, we introduce a camera-based basketball scoring detection (BSD) method with CNN based object detection and frame difference-based motion detection. In the proposed BSD method, the videos of the basketball court are taken as inputs. Afterwards, the real-time object detection, i.e., you only look once (YOLO) model, is implemented to locate the position of the basketball hoop. Then, the motion detection based on frame difference is utilized to detect whether there is any object motion in the area of the hoop to determine the basketball scoring condition. The proposed BSD method runs in real-time with satisfactory basketball scoring detection accuracy. Our experiments on the collected real scenario basketball court videos show the accuracy of the proposed BSD method. Furthermore, several intelligent basketball analysis systems based on the proposed method have been installed at multiple basketball courts in Beijing, and they provide good performance.
Citation: | Xu-Bo Fu, Shao-Long Yue, De-Yun Pan. Camera-based Basketball Scoring Detection Using Convolutional Neural Network. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1259-7 |