-
The past decade has witnessed the great progress of deep convolutional neural networks (DCNNs) in computer vision. DCNNs have made breakthroughs in various vision tasks such as image classification[1-3], object detection[4-6], and image segmentation[7-9]. Although amazing accuracy has been achieved, the high computational cost of several to dozens of BFLOPs (billions of floating-point operations per second) is a headache in practical applications. Especially when dealing with visual recognition problems in mobile and embedded scenarios, such as mobile phones, robots, unmanned vehicles, etc., high computational cost often means high hardware requirements and high power consumption. Therefore, in recent years, how to design efficient lightweight deep models while ensuring high performance has been a hot topic. More and more researchers have been attracted to this area.
The research on lightweight deep models is of great significance. In addition to the benefits of reducing hardware requirements and being energy-saving, lightweight deep models also have the following advantages:
1) It makes edge computing possible. This results in visual feedback being more timely. It is important for scenarios that require real-time feedback, such as self-driving cars.
2) Personal privacy is protected by avoiding uploading the original images data to the server.
3) It saves network bandwidth for there is no need to transfer images frequently between different hardware.
4) It reduces the cost of using visual recognition models so that it is beneficial to industry development.
A series of techniques have been proposed on lightweight deep models in recent years including model pruning[10, 11], model quantification[12, 13], lightweight model structure design[14, 15], knowledge distillation[16-18], etc. Among these, the lightweight model architecture design is one of the most attractive techniques. Lightweight model architecture design has two obvious advantages: It is not limited by the pre-trained model and it can be easily customized for different recognition tasks. MobileNet[14], one of the representative works in this area proposes a kind of convolution operation called separable convolution layer. It decouples a traditional convolution layer into a pointwise convolution layer and a depthwise convolution layer. DCNNs are constructed with separable convolution layer and compact models with comparable performance are achieved. ShuffleNet[15] explores channel shuffle operations with grouped convolution to further improve the efficiency and accuracy of lightweight models. The latest GhostNet[19] proposes a module called Ghost module to concatenate the output features from a pointwise convolution layer and a following cheap depthwise convolution layer. The Ghost module can reduce the overall computational cost by reducing the usage of pointwise convolutions while retaining rich features.
Abundant and even redundant features are proved to be essential to guarantee the performance[19]. Thus, a block design with first feature expansion and then feature reduction has become a standard procedure in the lightweight model architecture design works. Feature expansion and reduction rely on pointwise convolutions because pointwise convolutions can vary the output channels of features. Especially with the introduction of inverted residual bottleneck[20], the pointwise convolution is extensively utilized to expand or squeeze features. This leads to the pointwise convolution usually accounting for more than 90% of the overall computational cost in lightweight deep models.
This paper aims at solving the problem of excessive computational cost in the feature expansion stage. We explore effective ways to reduce the computational cost while ensuring to obtain abundant features. This results in a novel Poker module being proposed here. The proposed Poker module can generate abundant effective features via groups of cheap depthwise convolutions. To be specific, a small pointwise convolution layer does the channel correlation first, then groups of depthwise convolutions are employed and finally a self-attention mechanism is utilized to further enhance the expanded features. With the same input and output channels, the overall computational cost in the proposed Poker module gets 1.5× to 3.0× fewer computational cost compared with the module in previous works. Based on the Poker module, we design the efficient lightweight models, PokerNets. Experiments on the large scale image classification dataset verify that PokerNets surpass the SOTA lightweight models such as GhostNet[19], MobileNet series[14, 20, 21] and ShuffleNet series[15, 22], etc. PokerNets can reduce the computational cost with a large margin while obtaining comparable or even higher recognition accuracy.
The rest of the paper is organized as follows. Section 2 concludes the related work in the area. The proposed Poker module and PokerNets in Section 3. the experiments and the analysis in Section 4. Finally conclusions are given in Section 5.
-
Building lightweight deep convolutional neural networks with a good balance between computing efficiency and recognition accuracy has been a hot topic in recent several years. Hand-craft architecture design is the mainstream at first. However, with the rise of deep reinforcement learning, automatic neural architecture search has attracted more and more attention. These two branches are the most related to this paper.
As far as we know, NIN[23] is the first work that adopts a large number of pointwise convolutions (also called 1 × 1 convolution) to build a deep convolutional neural network. It achieves comparable accuracy with AlexNet[1] but obviously with far fewer parameters. Aiming directly at the goal of reducing the parameters of DCNNs, SqueezeNet[24] employs tremendous numbers of pointwise convolutions with expansion and squeeze operations and obtains deep models with parameters of several or even less than 1 MB. Recent work begins focusing on multiply-addition operations (MAdds), a more direct and fair evaluation indicator of computing efficiency, instead of the number of parameters of the neural network. MobileNetV1[14] creatively proposes a more efficient convolution structure, separable convolution layer. It decouples the traditional convolution layer into a pointwise convolution layer responsible for channel correlation and a depthwise convolution layer responsible for spatial correlation as illustrated in Fig. 1. Following this work, MobileNetV2[20] and MobileNetV3[21] further improve the efficiency as well as the accuracy by optimizing the residual bottleneck design and combining the hand-craft method with the automatic neural architecture search, respectively. ShuffleNet[15] proposes grouped convolution integrated with channel shuffle to further reduce the computational cost. ShuffleNetV2[22] drops the grouped convolution proposed in ShuffleNet[15] and optimizes the network architecture based on the principles from statistics of practice experiments. ShiftNet[25] employs the shift operation alternated with pointwise convolutions instead of expensive spatial convolution. The latest GhostNet[19] concatenates the features from a small pointwise convolution layer and the features from a follow-up depthwise convolution layer to reduce the computational cost while obtaining abundant features.
Figure 1. Traditional convolution layer (left) and separable convolution layer consisting of a depthwise convolution layer (middle) and a pointwise convolution layer (right).
The rising of deep reinforcement learning brings new vitality to lightweight network architecture design. Works[26-32] are the pioneers to employ deep reinforcement learning to automate the architecture design. Limited by the exponentially growing search space, early works can only focus on cell level structure search. MnasNet[27] makes block-level structure search possible by introducing multi-target Pareto optimal and simplifying the search space from cell-level to block-level (i.e., grouping convolutional layers into a number of predefined skeletons). Then, differentiable architecture framework with gradient-based optimization utilized in works[26, 33, 34] further reduces the computational cost of structure search. Automatic neural architecture design algorithms especially focusing on lightweight deep model design have emerged in works[34-38]. It is worth emphasizing that MobileNetV3[21] integrates the hand-craft methods and automatic architecture design methods together, which presents a promising trend of the fusion of the two branches.
Our work can be treated as a hand-crafted method. We propose a novel module to reduce the MAdds of a network by decreasing the use of pointwise convolutions in the feature expansion stage, at the meanwhile to generate abundant effective features to guarantee performance. Methods such as automatic architecture search[26-34], quantization[12, 13, 39] and network pruning[10, 11, 40-45] are orthogonal to our method and can be combined to further boost the performance.
-
In this section, we first illustrate how the Poker module is composed and can generate a large number of effective features via depthwise convolutions. Then, we design the bottlenecks according to different kinds of stride and channels. Finally, an efficient PokerNet architecture is developed with high recognition accuracy.
-
Recent works on building lightweight deep convolutional neural networks such as MobileNet series[14, 20, 21], ShuffleNet series[15, 22] and GhostNet[19] always combine the pointwise convolution and depthwise convolution as the base module to develop efficient CNNs. In these works, pointwise convolutions usually take up more than 90% of the overall computational cost. Especially at the stage of feature expansion, previous works mainly rely on the pointwise convolution layer because depthwise convolution layer can only generate features with the output channels equal to input channels.
Considering a convolution layer with input features
$X \in {{\bf{R}}^{m \times w \times h}}$ , where m is input channels and h and w are the height and width of features, respectively. Assume the kernel size is d × d, the stride equals to 1 and the number of output channels is n, where n = km (k = 1, 2, 3,···) because we only consider the expansion stage. -
Given the latest state-of-the-art (SOTA) Ghost module as an example, as illustrated in Fig. 2, the Ghost module firstly takes the m features as input and generates the first n/2 features by utilizing pointwise convolutions. Then, it generates another n/2 features by using depthwise convolutions with the former n/2 features as input. Finally, the two parts of n/2 features are concatenated together to form the final n output features.
Figure 2. Ghost module from GhostNet[19]
The complexity of a Ghost module is
$\begin{split} {{\rm{C}}_g} &= h \times w \times m \times \frac{n}{2} + h \times w \times {d^2} \times \frac{n}{2}=\\ & h \times w \times m \times \left(\frac{{km}}{2} + \frac{{k{d^2}}}{2}\right). \end{split} $ (1) As shown in (1), in practice, d is often taken as 3 and m
$\gg {d^2}$ . This leads to pointwise convolutions always occupying the most amount of the overall computational cost in lightweight deep models. -
Based on the above observation, we have been thinking about whether there is another way to generate abundant effective features while reducing the usage of pointwise convolutions. We point out that it is solvable and the Poker module we propose here can attain this goal.
As illustrated in Fig. 3, the proposed Poker module is composed of three parts. The first part is a small pointwise convolution layer with input channels and output channels keeping the same number of M. The second part is the crucial expansion part. It utilizes k groups of low-cost depthwise convolutions to expand the features from the number of M to kM, where k is the expansion factor as well as the number of groups and it can be customized in practice. A Squeeze-and-Excite (SE) layer[46] is adopted in the third part, which is vital and responsible for enhancing or suppressing the features selectively according to the response value acquired through a self-attention mechanism.
Figure 3. An illustration of the proposed Poker module. Features are expanded from the number of M generated by pointwise convolution to the number of kM by using k groups of cheap depthwise convolutions in parallel. A SE layer[46] is followed to enhance or suppress the expanded features.
In practice, the first and second parts are followed by batch normalization (BN)[47] and ReLU activation[1]. It should be emphasized that the third part is integral in our module, the importance of which will be further analyzed in the experimental part.
Our proposed Poker module has three major differences from the existing efficient model design. 1) Existing methods such as the MobileNet series[14, 20, 21] and ShuffleNet series[15, 22], utilize pointwise convolution to expand features, which is computationally expensive. In contrast, the Poker module first adopts a small pointwise convolution to do channel correlation and then groups of low-cost depthwise convolutions are utilized to expand features, which can largely reduce the computational cost as to be analyzed below. 2) Compared with the module in GhostNet[19] which concatenates the features from a pointwise convolution layer and a followed depthwise convolution layer, the groups of depthwise convolutions in the Poker module can be customized. 3) the Squeeze-and-Excite layer is just a choice in existing methods, but it is an essential part in our design.
-
When it comes to bottleneck design, we classify them into 3 different situations. 1) Stride is equal to 1 and input channels are equal to output channels. 2) Stride is equal to 1 and input channels are not equal to output channels. 3) Stride is equal to 2 no matter what the input channels and output channels are.
Taking advantage of the proposed Poker module, we redesign the bottlenecks as shown in Fig. 4. Following the inverted residual bottleneck[20] design concept, in the main branch, we first expand the input features from the previous bottleneck by utilizing the Poker module, then the Ghost module is adopted to narrow the features because it is more efficient under this condition. Particularly, when the stride is equal to 2, a depthwise convolution with stride equals to 2 is inserted between two modules to downsample the feature maps as Fig. 4(c) shows. In the shortcut branch of situations 2) and 3), a depthwise convolution with stride equals to 1 or 2 accordingly followed by a pointwise convolution is added to the shortcut path as Figs. 4(b) and 4(c) illustrate. As for situation i), the input is added to the output of the main branch directly.
Figure 4. Design of bottlenecks. Left: Poker bottleneck with stride equal to 1 and M equal to N; Middle: Poker bottleneck with stride equal to 1 and M not equal to N; Right: Poker bottleneck with stride equal to 2.
With the design of different bottlenecks, a lightweight deep model can be assembled through choosing and stacking these bottlenecks according to the input channels, the output channels and the size of feature maps.
-
Following the basic architecture of GhostNet[19] and MobileNetV3[21] for their preponderance, we propose the PokerNet by replacing the Ghost bottlenecks with the Poker bottlenecks. Table 1 shows the overall architecture of our proposed PokerNet. Some bottlenecks are kept unchanged because the computational cost of Ghost bottlenecks in these parts is lower than that of Poker bottlenecks. The selection criteria between traditional Ghost bottleneck and Poker bottleneck will be given in detail later. PokerNet can be divided into three parts, i.e., the head, the body, and the end. The head is a standard convolution layer with stride of 2 and 20 channels. The body is a stack of Poker bottlenecks or Ghost bottlenecks. At the last layer of each stage, the bottleneck will increase the channels and downsample the features gradually except for the last stage. The last layer in the last stage adopts a pointwise convolution directly to expand features from the number of 160 to 960. It should be noticed that, in the first two stages and the second and third bottlenecks in the fourth stage, Ghost bottlenecks are adopted for their lower computational cost according to the feature size and channels. The end is an average pooling layer followed by a pointwise convolution layer to transform the feature vector to 1280 dimensions before being put to the final classifier. The expansion channels of the first three bottlenecks at the fourth stage are slightly different from the number in GhostNet and MobileNetV3 for the expansion factor k should be an integer. The architecture illustrated in Table 1 is just a reference. Other hyper-parameter tuning methods can be used for further boosting performance.
Size Ops #in #exp #out SE Stride Act 224 Conv2d 3x3 3 − 20 − 2 ReLU6 112 G-bneck 20 20 20 − 1 ReLU6 112 G-bneck 20 40 20 − 2 ReLU6 56 G-bneck 20 60 20 − 1 ReLU6 56 G-bneck 20 120 40 − 2 ReLU6 28 P-bneck 40 120 40 1 1 ReLU6 28 P-bneck 40 240 80 1 2 ReLU6 14 P-bneck 80 160 80 1 1 ReLU6 14 G-bneck 80 160 80 1 1 ReLU6 14 G-bneck 80 160 80 1 1 ReLU6 14 P-bneck 80 240 120 1 1 ReLU6 14 P-bneck 120 480 120 1 1 ReLU6 14 P-bneck 120 720 160 1 2 ReLU6 7 P-bneck 160 960 160 1 1 HS 7 P-bneck 160 960 160 1 1 HS 7 P-bneck 160 960 160 1 1 HS 7 P-bneck 160 960 160 1 1 HS 7 Conv2d 1x1 160 − 960 − 1 HS 7 AvgPool 7x7 960 − 960 − 1 ReLU6 1 Conv2d 1x1 960 − 1280 − 1 ReLU6 1 Conv2d 1x1 1280 − 1000 − 1 ReLU6 Table 1. Overall architecture of PokerNet-1.0×. Size, #in, #exp and #out denote the input feature map size, input channels, expansion channels and output channels of a bottleneck, respectively. P-bneck and G-bneck denote Poker bottleneck and Ghost bottleneck, respectively. Act denotes the activation type.
We also utilize a width multiplier α to customize the width of the network for considering the limitation of computing resources of hardware in practical usage. It is denoted as PokerNet-α× when using PokerNet with width multiplier α. With a smaller α, the computational cost of the model roughly decreases
${\alpha ^2}$ , but this will lead to lower performance with no doubt. In the experiments, we adopt two kinds of α, i.e., 0.5 and 1.0. Correspondingly, we denote the models as PokerNet-0.5×, PokerNet-1.0×.The overall architecture of both PokerNet-0.5× and PokerNet-1.0× will be illustrated in detail in the supplementary materials.
-
In this section, we first detail the training setup and evaluation metrics utilized in our experiments. Then the computational cost of the proposed Poker module will be analyzed in detail. PokerNet models built with the novel Poker module will be tested on the large scale image classification benchmark. Finally, an ablation study on the influence of the Squeeze-and-Excite layer on the performance and an ablation study on verifying the effectiveness of a kind of nonlinearity called hard-swish will be further explored.
-
As has become standard, ImageNet[48] is adopted for all our classification experiments to verify the effectiveness of our proposed Poker module and Poker models. ImageNet is a large-scale image dataset. It contains over 1.2 M training images and 50 K validation images belonging to 1000 classes.
For the pre-processing strategy[49,50] of the ImageNet dataset, firstly, all the images are resized with the shorter side of 256. Then, before sending images to the network, we randomly crop 224 × 224 image patches with mean subtraction, randomly flipping and color jittering. No other data augmentation tricks such as multi-scale are adopted. The final model is evaluated on the validation dataset with only a single center crop.
Our method is evaluated in terms of multiply-addition operations (MAdds) and recognition accuracy. As for the latency, we compare our proposed PokerNet models with current SOTA GhostNet models under different width multipliers on an Intel CPU I7-7700K@4.2GHZ with 32GB RAM. It is worth emphasizing that the inference speed highly relies on the code optimization ability and the hardware platform. In our experiments, we just make the k groups of depthwise convolutions in parallel by using multithreading. Actually, our Poker module has excellent parallelism and all the kM convolutions can be processed in parallel. So the inference speed of PokerNet models can be further improved with better code optimization.
-
To train PokerNet models, all convolutional layers are followed by batch normalization layers. Both the batch-normalization layer and activation layer are adopted after the first module (Poker module or Ghost module) in the bottlenecks. The second module only uses the batch-normalization layer.
Our models are trained on 8 GPUs using standard pytorch stochastic gradient descent (SGD) optimizer with 0.9 momentum. Most of the hyper-parameter settings in our experiments follow the work of ShuffleNet[22]. We use a batch size of 1024 (128 images per GPU), total epochs of 300, multi-step learning rate policy (i.e., [160, 240, 260, 280]), dropout of 0.2 on the features before the classifier layer, initial learning rate of 0.5 and label smoothing of 0.1. The weight decay of PokerNet-0.5× is set to 1×10−5 and that of PokerNet-1.0× is set to 4×10−5 because we find that they face different degrees of over-fitting.
-
From Fig. 3 illustrated above, we can get that the computational cost of our proposed Poker module is
$ \begin{split} {C_p} &= h \times w \times m \times m + h \times w \times {d^2} \times m \times k=\\ & h \times w \times m \times(m + k \times {d^2}). \end{split} $ (2) For simplicity, the cost of the Squeeze-and-Excite part is omitted because it takes up very little computational cost compared with other parts.
When we compare it with the latest SOTA Ghost module, we can get the ratio
$ \frac{{{C_g}}}{{{C_p}}} = \frac{2}{k} \times \frac{{m + k{d^2}}}{{m + {d^2}}}. $ (3) When d is usually taken as 3 in practice, we plot the ratio curves under different k values and m values, which are shown in Fig. 5.
Figure 5. Computational cost comparison between the Ghost module and the proposed Poker module under different k (expansion factor) and m (number of feature channels) values.
As Fig. 5 illustrates, only when it comes to k = 2 and some cases where m and k are small, the ratio is less than 1. In most cases, the Poker module can achieve 1.5× to 3× fewer computational cost than the Ghost module. This can explain why we utilize Poker bottlenecks in most of the situations in our PokerNet but Ghost bottlenecks in the first two stages and the second and third bottlenecks in the fourth stage as shown in Table 1. Fig. 5 can be treated as the selection criteria between Poker bottlenecks and other bottlenecks when building different kinds of PokerNets.
-
We mainly evaluate our models on the ImageNet 2012 classification dataset. Several modern lightweight network architectures are selected as competitors, including MobileNet series[14, 20, 21], ShuffleNet series[15, 22], GhostNet[19], etc. The results are summarized in Table 2. Models, such as ProxylessNAS[26], FBNet[34] and MnasNet[27], are not listed in the table because their computational cost is much higher than our models (greater than 200M).
Models Params (M) MAdds (M) Top-1 Acc. (%) Top-5 Acc. (%) ShuffleNetV1 0.5× (g=8)[15] 1.0 40 58.8 81.0 MobileNetV2 0.35×[20] 1.7 59 60.3 82.9 ShuffleNetV2 0.5×[22] 1.4 41 61.1 82.6 MobileNetV3 Small 0.75×[21] 2.4 44 65.4 − GhostNet 0.5×[19] 2.6 42 66.2 86.6 PokerNet 0.5× 2.4 39 66.8 87.0 MobileNetV1 0.5×[14] 1.3 150 63.3 84.9 MobileNetV2 0.6×[20] 2.2 141 66.7 − ShuffleNetV1 1.0× (g=3)[15] 1.9 138 67.8 87.7 ShuffleNetV2 1.0×[22] 2.3 146 69.4 88.9 MobileNetV3 Large 0.75×[21] 4.0 155 73.3 - GhostNet 1.0×[19] 5.2 141 73.9 91.4 PokerNet 1.0× 5.7 119 73.6 91.5 Table 2. Comparison of SOTA lightweight models over parameters, MAdds and accuracy on ImageNet dataset. M stands for millions.
In terms of computational cost, we follow [19] and group these models into two levels, i.e., ~50M MAdds, ~150M MAdds. A consistent observation we can get from the table is that larger MAdds usually mean higher accuracy in lightweight models.
As can be seen from the results illustrated in Table 2, PokerNet outperforms all the competitors at every computational cost level. Compared with the previous SOTA Ghost models, our models achieve 7.1% (PokerNet-0.5×) and 15.6% (PokerNet-1.0×) MAdds reduction with even higher recognition accuracy.
The advantage of computational cost of our proposed PokerNet models can also be seen from the latency test shown in Table 3. By simply making the k groups of depthwise convolutions in parallel using multithreading, our PokerNet models run much faster than current SOTA GhostNet models. If all the kM convolutions are optimized to run in parallel, this advantage will be more obvious. This is due to the efficient Poker module we design. With the combination of groups of cheap depthwise convolution SE layer, the Poker module we propose can ensure the generation of abundant effective features, while minimizing the computational cost.
Models Latency (ms) PokerNet 0.5× 32.9 GhostNet 0.5× 36.7 PokerNet 1.0× 46.2 GhostNet 1.0× 51.3 Table 3. Comparison of inference speed between PokerNet models and GhostNet models (current SOTA lightweight models) under different width multipliers. The batch size is equal to 1.
Fig. 6 gives a more intuitive display of the comparison of PokerNets and different SOTA models in terms of computational cost and Top-1 accuracy.
-
The core idea of the Poker module is to utilize groups of depthwise convolutions to generate a large number of features and adopt the SE layer to enhance or suppress the generated features. In Section 3.1, we point out that the SE layer is essential to guarantee the performance in our design of Poker module.
To evaluate the importance of the SE layer in the Poker module, we conduct the ablation study experiments on PokerNet-0.5× and PokerNet-1.0×. We compare the final recognition accuracy with the SE layer being added to the Poker module or omitted. All other settings are kept exactly the same.
The results are shown in Table 4. From the results, we can see that the Poker modules with the SE layer slightly increase the MAdds (0.4M in PokerNet-0.5× and 2M in PokerNet-1.0×), but they outperform the one without the SE layer with a large margin (3.3% in PokerNet-0.5× and 3.0% in PokerNet-1.0× for Top-1 accuracy). It strongly favors why we take the SE layer as an important part of the proposed Poker module.
Models SE Params (M) MAdds (M) Top-1 Acc. (%) Top-5 Acc. (%) PokerNet 0.5× Yes 2.39 39.4 66.8 87.0 PokerNet 0.5× No 1.81 39.0 63.5 84.4 PokerNet 1.0× Yes 5.70 119 73.6 91.5 PokerNet 1.0× No 3.41 117 70.6 89.5 Table 4. Ablation study of Squeeze-and-excite layer in our proposed Poker module
-
In [23], a nonlinearity called hard-swish was introduced, which proves to significantly improve the accuracy of lightweight deep models. The hard version of swish is defined as follows:
$ hard {\text{-}} swish(x) = x\frac{{{\mathop{\rm Re}\nolimits} LU6(x + 3)}}{6}. $ To study the influence of hard-swish on PokerNet models, we follow the work[21] and replace the nonlinearities of deeper layers of PokerNet models from ReLU to hard-swish. This is because hard-swish can get the most benefits while occupying less computational cost in deeper layers as claimed in [21]. In practice, hard-swish is utilized in the layers of fifth stage as shown in Table 1.
We compare the PokerNet models under 3 kinds of settings of nonlinearities, i.e., ReLU, ReLU6 and ReLU6 + hard-swish. Table 5 illustrates the final results. It can be seen from the table that the setting of ReLU6 + hard-swish beats the other two kinds of settings in both PokerNet-0.5× and PokerNet-1.0×. Compared with the models with ReLU, hard-swish version achieves 0.4% and 0.6% accuracy improvement respectively. It proves the effectiveness of hard-swish nonlinearities in lightweight deep models.
Models Act Params (M) MAdds (M) Top-1 Acc. (%) Top-5 Acc. (%) PokerNet 0.5× ReLU 2.39 39.4 66.4 86.2 PokerNet 0.5× ReLU6 2.39 39.4 66.5 86.5 PokerNet 0.5× ReLU6+HS 2.39 39.4 66.8 87.0 PokerNet 1.0× ReLU 5.70 119 73.0 90.4 PokerNet 1.0× ReLU6 5.70 119 73.2 90.8 PokerNet 1.0× ReLU6+HS 5.70 119 73.6 91.5 Table 5. Ablation study of nonlinearities in our proposed Poker module
-
Pointwise convolutions usually occupy most of the computational cost in lightweight deep models. For building more efficient deep models, this paper presents a novel Poker module to reduce the usage of pointwise convolutions. The proposed Poker module utilizes groups of cheap depthwise convolutions instead of pointwise convolutions to expand the features. SE layer is adopted in Poker module to enhance or suppress the expanded features in a self-attention manner. Experiments conducted on benchmark models and datasets prove that the proposed module is standardized and can be applied to any place where feature expansion is needed. Moreover, PokerNets built with the proposed Poker module beat the SOTA lightweight models in both the computational cost and the recognition accuracy.
In the future, we would like to apply our proposed PokerNets to other computer vision tasks such as object detection and image segmentation. As for the model itself, we would like to explore the customized expansion factor k and the self-attention manner to construct more efficient lightweight deep models while boosting the performance.
-
This work was supported by National Natural Science Foundation of China (Nos. 61525306, 61633021, 61721004, 61806194, U1803261 and 61976132), Major Project for New Generation of AI (No. 2018AAA0100400), Beijing Nova Program (No. Z201100006820079), Shandong Provincial Key Research and Development Program (No. 2019JZZY010119), and CAS-AIR.
PokerNet: Expanding Features Cheaply via Depthwise Convolutions
- Received: 2020-12-23
- Accepted: 2021-02-01
- Published Online: 2021-03-24
-
Key words:
- Deep learning /
- depthwise convolution /
- lightweight deep model /
- model compression, model acceleration
Abstract: Pointwise convolution is usually utilized to expand or squeeze features in modern lightweight deep models. However, it takes up most of the overall computational cost (usually more than 90%). This paper proposes a novel Poker module to expand features by taking advantage of cheap depthwise convolution. As a result, the Poker module can greatly reduce the computational cost, and meanwhile generate a large number of effective features to guarantee the performance. The proposed module is standardized and can be employed wherever the feature expansion is needed. By varying the stride and the number of channels, different kinds of bottlenecks are designed to plug the proposed Poker module into the network. Thus, a lightweight model can be easily assembled. Experiments conducted on benchmarks reveal the effectiveness of our proposed Poker module. And our PokerNet models can reduce the computational cost by 7.1%−15.6%. PokerNet models achieve comparable or even higher recognition accuracy than previous state-of-the-art (SOTA) models on the ImageNet ILSVRC2012 classification dataset. Code is available at https://github.com/diaomin/pokernet.
Citation: | Citation: W. Tang, Y. Huang, L. Wang. PokerNet: Expanding features cheaply via depthwise convolutions. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1288-x doi: 10.1007/s11633-021-1288-x |