-
Person re-identification (Re-ID) aims to match pedestrian images with the same identity across multiple non-overlapping cameras. It has been a hot research topic since the last decade with potential practical applications like video surveillance and pedestrian retrieval for public security. However, person Re-ID encounters many challenges, such as the cross-view changes of a person′s pose, illumination, viewpoints, backgrounds clutter, occlusion, etc.
Extensive efforts have been made in the past decade to overcome these challenges. Early works dedicated to either feature extraction[1-3] or metric learning[3, 4] schemes. Feature extraction based methods aim to learn the discriminative features to maintain invariance of the same person, and the distinction among different persons. Metric learning based approaches mainly train a distance measurement or a classifier to solve the intra-class discrepancy and inter-class similarity. With the extensive applications of deep learning[5], convolution neural networks (CNNs) have been widely used in person Re-ID to automatically learn more discriminative features[3, 6-16]. These methods mainly employ deep classification models to learn discriminative feature representations for the visual appearance of person images. Meanwhile, deep metric learning methods are also widely implemented in person Re-ID[17-24] by minimizing the inter-class diversities and maximizing the intra-class distinctions. Recently, with the blossom of generative adversarial networks (GANs), some researchers have tried to use the generation models to relieve the pose and camera style variations across the cameras in person Re-ID[25-28].
Other researchers have focused on the temporal information or the gait information[29, 30] for video based person Re-ID. Despite of the great progress on person Re-ID, most of the existing single red green blue (RGB)-modality Re-ID methods and benchmark datasets are based on favorable lighting, which limits their capability in real-life applications in the adverse environments, such as poor illumination in bad weather or at nighttime.
Thermal infrared (TIR) cameras can capture infrared radiation emitted by subjects with a temperature above absolute zero[31]. These cameras are insensitive to lighting conditions and have a strong ability to penetrate haze and smog. Therefore, the advantage of thermal images is that they are not affected by low illumination, illumination changes and shadows, as shown in Fig. 1.
Figure 1. Advantage of TIR images compared to RGB images. Due to the illumination change or background clutter, the same person may appear differently under the visible RGB cameras, while the TIR images attenuate the influence in these challenging scenarios.
Recently, as the quality of TIR images has improved and the cost of infrared cameras has reduced, TIR data has been widely used in computer vision tasks to overcome the limitations in conventional RGB environments, such as RGB-thermal (RGBT) tracking[31, 32] and RGBT object detection[33], which take advantage of the characteristics of the two modalities to improve the performance of the corresponding tasks. Meanwhile, Nguyen et al.[34] proposed a RGBT dataset, which contains RGB and thermal infrared image pairs for pedestrian recognition. This dataset has been widely used for cross-modality person Re-ID[35-37].
Although TIR data can relieve the challenge in adverse conditions in a conventional RGB single modality, most of the surveillance environments are based on single visibility only, which results in the lack of TIR data resources for the RGBT Re-ID task. Therefore, how to utilize the advantage of the thermal infrared modality for the single RGB modality Re-ID is still an open problem.
In recent years, many researchers employ GANs in cross-modal generation to supplement the single modal information. For instance, Zhang et al.[38] used the image translation method to generate thermal images in thermal infrared tracking. Inspired by these works, by training on existing RGBT datasets, we propose to use the generative adversarial network to translate the labeled RGB person images to thermal infrared ones. The labeled RGB images and the synthetic thermal images consist of the labeled RGBT training set, which makes use of the complementary information from both visible and thermal modalities. Specifically, we employ CycleGAN[39] which was proposed to learn mappings between unpaired domains for the cross-modal generation.
After obtaining the synthetic RGBT training dataset, we can learn both RGB and thermal representations with the input of a single RGB-modality query during the test. The forthcoming issue is how to effectively fuse the information from both modalities. The intuitive approach is to concatenate the representations from each modality. However, different modalities may contribute unequally in different scenarios. Recently, attention models[40] have drawn much attention and been successfully applied to all kinds of visual mechanisms, such as pedestrian counting[41], action recognition[42], video summarization[43] and object detection[44]. In our task, we propose to employ a channel-spatial attention network[45] to learn more discriminative RGBT representations to further boost the performance.
Based on the above discussion, we propose a novel deep RGBT representation learning framework for single RGB person re-identification. First, we synthesize the thermal person images via CycleGAN[39] for RGB person images. Then, we learn the RGBT representation based on the synthetic RGBT data. Finally, we employ the attention network to improve the representation of the network and balance information between the two modalities. By exploiting multi-modal representation with the proposed method, it can relieve the illumination and background clutter in conventional RGB Re-ID tasks while making full use of the complementary information from both RGB and thermal modalities without additional modality resources. The contributions of this paper can be summarized as follows:
1) We propose a deep RGBT representation learning framework for single RGB-modality person Re-ID. By transferring the RGB query images to TIR ones, our method can take the advantages of both RGB and thermal modalities without additional modality resources in RGB person Re-ID.
2) We propose to employ the channel-spatial attentions in our network to automatically learn the important information when fusing RGB and thermal representations for robust RGB person Re-ID.
3) Extensive experiments on prevalent RGB person Re-ID datasets, including Market1501[18], DukeMTMC-reID[8] and CUHK03[6], show the promising performance of the proposed method especially in adverse scenarios.
-
Person Re-ID has been attracting more attention in recent years. Early approaches focused on extracting hand-crafted features. Representative descriptors include histograms of oriented gradients (HOG)[1], local maximal occurrence (LOMO)[3] and local binary patterns (LBP)[2]. Meanwhile, metric learning based methods emerged for learning the optimized subspace to minimize the cross-view gap in person Re-ID, such as keep it simple and straightforward metric (KISSME)[4], cross-view quadratic discriminant analysis (XQDA)[3] and top-push[20]. The development of CNNs accelerates the recent progress in person Re-ID. Chen et al.[46] proposed CNN structures to extract characteristic features with identification loss and verification loss. Zhao et al.[47] designed a deep CNN network named as spindle net to fuse global body features and body region features for person Re-ID. Su et al.[48] used a pose-driven deep CNN model to leverage the human part cues to alleviate the pose variations and learn robust features of global and local information. Sun et al.[49] proposed a part-based network part-based convolutional baseline (PCB) which learns more discriminative feature representations in the part level. Chang et al.[19] proposed a semantic level network that factorizes the visual appearance of a person. Ding et al.[13] proposed a feature mask network to re-weight different parts of high-level and low-level features.
Some researchers integrated the idea of metric learning into the deep CNNs for person Re-ID. Yi et al.[17] first combined the metric learning method (cosine distance) with deep CNN. Chen et al.[50] designed a quadruplet loss leading to large inter-class variation and smaller intra-class variation than triplet loss. Yao et al.[24] improved the performance by computing the person classification loss on each part separately. Zhu et al.[23] proposed a network to learn the distance metric by designing different objective functions for hard and easy negative samples. Yuan et al.[51] proposed a fast-approximated triplet (FAT) loss to preserve the effectiveness of triplet loss.
Recently, many researchers paid attention to GAN-based methods for person Re-ID. Zheng et al.[8] proposed to generate unlabeled samples with a simple semi-supervised pipeline on the original training dataset, and adopted the deep convolution generative adversarial network (DCGAN) for data generation.
Person transfer generative adversarial network (PTGAN)[26] is proposed to address the problem of poses variation in person Re-ID where the model is trained with rich pose variations which are generated via transferring pose instances. Ge et al.[27] proposed feature distilling generative adversarial network (FD-GAN) to learn pose-unrelated person features with pose guidance. Wu et al.[14] used adversarial learning to address the view discrepancy by optimizing the cross-entropy view confusion objective in person Re-ID. However, person Re-ID on a single RGB modality faces big challenges with illumination changes especially for dark lighting conditions in the severe weather or night-time.
-
Recently, with the development of multi-modal vision, multi-modal person Re-ID has gained much attention. Barbosa et al.[52] proposed a pattern analysis and computer vision (PAVIS) dataset, which contains two groups of RGB and depth person images. Munaro et al.[53] proposed a BIWI dataset, which consists of 50 different persons in RGBD data. Nguyen et al.[34] proposed a RegDB dataset which contains 4120 RGB and thermal person image pairs for person recognition.
Based on the above RGBD person Re-ID datasets, Pala et al.[54] combined clothing appearance with depth data for person Re-ID. Mogelmose et al.[55] proposed a tri-modal (RGB, depth, thermal) person Re-ID to combine RGB, depth and thermal features. Xu et al.[56] proposed a distance metric using RGB and depth data to improve RGB-based person Re-ID. John et al.[57] combined RGB-height histogram and gait features of depth information for person Re-ID.
Wu et al.[58] proposed a kernelized implicit feature transfer scheme to estimate the Eigen-depth feature from RGB images implicitly when the depth device was not available. Paolanti et al.[59] combined depth and RGB data with multiple k-nearest neighbor classifiers based on different distance functions. Ren et al.[60] exploited a uniform and variational deep learning method for RGBD object recognition and person Re-ID. However, most of existing surveillance systems are based on single RGB camera networks, and thus how to utilize the advantages of the thermal infrared modality in single RGB-modality person Re-ID is still an open question.
-
Generative adversarial networks (GANs)[61] have achieved great success recently, especially in image generation[62, 63], image editing[64] and image-to-image translation[39, 65]. Conditional GANs (cGAN)[62] have been proposed based on a selected input variable. With the rise of cross-modal simulation research, in recent years cross-modal image generation has attracted much attention. Xu et al.[66] proposed a method to reconstruct thermal images from the associated RGB data and learn cross-modal deep representations for detection. Zhang et al.[38] used the generative adversarial network of style transfer to generate thermal infrared images from visible images to alleviate the thermal tracking problem in weak illumination. Cross-modal generation has also performed well in other areas. Luo et al.[67] combined binocular images with monocular images to generate depth modality images of monocular images. Qiao et al.[68] proposed a novel global-local model to generate images from texts. Chen et al.[69] exploited conditional GANs to achieve the generation of audio-images. Zhou et al.[70] enabled arbitrary-subject talking face generation by learning disentangled audio-visual representations. The above progress on cross-modal generation provides another way to make use of the advantages of the thermal infrared modality in person Re-ID with a single RGB modality.
-
Our proposed framework is shown in Fig. 2. We aim to leverage the thermal information to boost the traditional RGB Re-ID in challenging scenarios. There are two main parts in our approach, including 1) thermal data generation networks, which transfer the RGB images into TIR ones via CycleGAN[39], and 2) attentive RGBT Re-ID network, which utilize the channel attention (CA) and the spatial attention (SA)[45] to highlight the meaningful information of input RGB-TIR image pairs for the Re-ID task. We shall elaborate the details of each module in the following two subsections.
Figure 2. Framework of the proposed approach: (a) Thermal data generation network[39], generating a large TIR person image dataset. Both TIR and RGB data are used as input while training the generation model. After translating RGB data into TIR data, we acquire the RGB data together with the generated TIR data for future Re-ID task. (b) Attentive RGBT Re-ID network, learning RGBT representation based on both RGB and generated TIR data for Re-ID task. Colored figures are available in the online version.
-
Image-to-image translation has been extensively researched in recent years. Representative translation methods include pix2pix[65], CycleGAN[39], etc. As we known, pix2pix[65] requires the paired input data. Due to the low quality RGB images in paired RGBT datasets RegDB[34], pix2pix[65] tends to generate blurring low quality data as shown in Fig. 3. Therefore, we employ the more advanced unpaired generation network CycleGAN[39] for better RGB to TIR transformation.
Figure 3. Samples of generated TIR images via CycleGAN and pix2pix for the corresponding RGB ones from Market1501[18]
CycleGAN[39] is an effective method in image translation between two domains when the paired images are not available. Based on generative adversarial networks (GANs)[61], CycleGAN[39] consists of two generators and two discriminators, which mutually map an image from a source domain to a target domain with the cycle consistency loss. The generator in CycleGAN[39] contains two convolutions, six residual blocks and two fractionally-strided convolutions as the generator. The image generator G takes the encoded RGB person image features as inputs, and aims to decode new TIR person images. The discriminator is a convolutional PatchGAN[71], which distinguishes the decoded TIR image patches as real or fake. Let x be an image from source RGB domain X and y be an image from the target thermal infrared domain Y. Our target is to learn the mapping functions between RGB domain X and thermal domain Y. First, the adversarial loss is defined as the objective function,
$ \begin{split} {\cal{L}}_{GAN} (G,{D}_{Y},X,Y) =&\; {{E}}_{y\sim p\left(y\right)} \left[{\rm{log}}{D}_{Y}\right(y\left)\right]+\\ & {{E}}_{x\sim p\left(x\right)} [{\rm{l}}{\rm{o}}{\rm{g}}(1-{D}_{Y}\left(G\right(x\left)\right)] \end{split}$ (1) where the generator G is to generate images G(x) that could transfer the style from source RGB domain X to thermal domain Y. The discriminator Dy tries to distinguish whether the generated thermal images G(x) are real or fake ones.
Additionally, the main idea of CycleGAN[39] is to introduce a cycle consistency loss, which maps the target domain Y back to source domain X. Therefore, unlike the conventional generative adversarial networks which only contain one generator, CycleGAN[39] includes another generator F to map Y → X. The cycle consistency loss is defined as
$ \begin{split} {\cal{L}}_{cyc} (G,F) =&\; {{E}}_{x} \left[{||F\left(G\right(x\left)\right)-x||}_{1}\right]+\\ & {{E}}_{y} \left[{||F\left(F\right(y\left)\right)-y||}_{1}\right]. \end{split}$ (2) The cycle consistency loss makes the reconstructed images closer to the input images. In the same manner as the minimizing-and-maximizing game in traditional adversarial learning, the final objective function of CycleGAN[39] is defined as
$ \begin{split} {G}^{{*}},{F}^{{*}}=& \mathop {{\rm{arg}}}\limits_{} \mathop {{\rm{min}}}\limits_G \mathop {{\rm{max}}}\limits_D \left[{\cal{L}}_{GAN} \right(G,{D}_{Y},X,Y)+\\& {\cal{L}}_{GAN} (F,{D}_{X},Y,X)+\lambda {\cal{L}}_{cyc} (G,F)]. \end{split} $ (3) Fig. 3 demonstrates several generated thermal samples in RGB dataset Market1501[18]. Some person images in the RGB modality are disturbed by background and illumination, especially as Person 2, Person 4 and Person 6 highlighted in red boxes. For instance, the third image of Person 2 captured outdoors is significantly disturbed by the background clutter compared with the first two indoor images. Similar background changes are also ubiquitous for other person images such as Person 4 and Person 6. While the corresponding generated TIR images can overcome these issues, compared with the pix2pix generation method, CycleGAN achieves more realistic synthesizing with much better visualized and more detailed appearance information.
-
We shall elaborate the data and training details to transfer the RGB person images into TIR ones in this section.
Data preparation. To transfer the RGB person images into TIR ones, the first requirement is the training data with both RGB and TIR person images. Currently, there are only two RGB-IR person Re-ID datasets, SYSU-MM01[72] and RegDB[34]. SYSU-MM01[72] is the prevalent RGB-NIR (near infrared) cross-modal dataset which captures with six individual non-overlapping cameras including four RGB ones and two NIR ones. However, NIR data is sensitive to the illumination and contains less information than TIR data, thus this dataset is not suitable for our work. RegDB[34] contains a large number of RGB-TIR image pairs which are captured by a binocular RGB-TIR camera set. Therefore, we train our cross-modal generation model on RegDB[34] with its TIR data in our paper.
As shown in Table 1, RegDB[34] contains 4120 RGB-TIR image pairs of 412 identities under different lighting conditions. Each identity contains 10 different RGB-TIR image pairs. To make the generation of our method better, we choose high-quality person TIR images in different environments which have different poses in RegDB dataset, amounting to 2154 TIR high-quality images for training. In order to relieve the influence of the unclear boundary in high temperatures, we select some data containing the challenge of unclear boundary conditions to train our generation network. The purpose of our approach is to integrate the generated thermal infrared information to the existing Re-ID datasets, such as Marker-1501[18], CUHK03[6], and DukeMTMC-reID[8] datasets. We select 2154 RGB person images from these datasets for training, which contains complex background and different poses. The generator is to learn more details about single-modal RGB data. For testing, we translate three RGB datasets to thermal infrared style, these amounts to total of about 10K images.
Types Datasets Number of images RGB IR Multi-modal dataset RegDB 4120 4120 Single-modal dataset Market1501 32217 − CUHK03 13164 − DukeMTMC-reID 36411 − Table 1. Datasets used for training the generation model and Re-ID task. We test our models at three single-modal datasets.
Training details. We train our generation network from scratch, initializing the weight from a Gaussian distribution with zero mean and standard deviation of 0.02. We empirically set λ = 10 in (3), and use the Adam method[73] to optimize the model with a batch size of 1. We use the same network architecture as in CycleGAN[39]. The input images have been resized as 128 × 128 pixels. We train CycleGAN[39] for 30 epochs with a learning rate of 0.0002. In the first 20 epochs, we keep the same learning rate and decay the rate to zero in the next 10 epochs.
-
After obtaining the TIR data, the next step is to fuse the multi-modality information[74]. To leverage the TIR information to complement the conventional RGB Re-ID task, this section elaborates our cross-modal convolution module which aims to learn both RGB and TIR person features. We first encode each modality to a feature map of size H × W × C, where W and H indicate the feature dimensions, and C denotes the number of channels. As shown in Fig. 2, unlike the common convolution operation taking 3-channel RGB data as inputs, we input 4-channel RGBT data, including three channels RGB data IRGB and one channel thermal data IT:
$ I={I}_{RGB}+\alpha {I}_{T}\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ (4) where α is the balance parameter indicating the weight/contribution of the generated thermal data, and we empirically set this to 1 in this paper. “+” is operation of concatenation. We utilize the ResNet-50[75] as our backbone. We keep the layers of ResNet-50[75] till the Pooling-5 layer as the base network and change the dimension of the fully connected layer to N, which indicates the number of identities in the training dataset. We add a new embedding layer followed by linear and batch normalization[76], and then randomly crop it into a 256 × 128 rectangular image, each of which is flipped horizontally with 0.5 probability. The model has an additional data loader for TIR data to obtain the generated TIR images. It outputs the ID prediction logits
$p $ to calculate the cross-entropy loss.Since the label of the generated TIR images are known, we calculate the difference between ID prediction logits
$p $ and the real labels. The cross-entropy loss is used to optimize the network, and formulated as$ {\cal{L}}_{id} \left(a,n\right) = -\frac{1}{n} \sum _{i=1}^{n} {\rm{l}}{\rm{o}}{\rm{g}}\,p\left({b}_{i} |{a}_{i}\right) $ (5) where n is the number of synthetic images in a training batch.
$p $ is the predicted probability of the input image ai belonging to identity bi. In general, the contributions of the generated thermal images and RGB images are different in different scenarios. To learn the diverse contribution of each modality, we further propose to employ the channel and spatial attention mechanisms to emphasize the discrimination. -
The convolutional block attention model[45] aims to produce a weighting map to carry out attention computation across the feature maps. In this way, the channel attention module will be oriented toward more important channels of the RGB-T feature.
Given an input RGB-T person image, we first obtain the feature map M from the first convolution layer of ResNet-50. The framework of the attention module is shown in Fig. 2. In particular, the network which embeds the channel attention module can be denoted as
$ \begin{array}{l}M{{'}}={A}_{c}\left(M\right)\otimes M\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \end{array} $ (6) where
$ \otimes $ denotes the weighting operator. Ac is the channel attention module. The attention map A is expected to focus higher on a person region contrary to the background. M′ indicates the channel attention features.The channel attention module exploits the channel relationship of features by choosing the more meaningful channels of an RGB-T feature map. One can achieve the attention map via aggregating the input feature maps. A common way of aggregation is to use average-pooling to learn the extent of the input object. To better select the discriminative feature and preserve more texture information, we further introduce a max-pooling operation in this paper.
Both average-pooling and max-pooling[45] descriptors are forwarded to a convolution block to achieve the channel attention map. Finally, we use the element-wise summation to merge the output features. The objective functions can be defined as
$ \begin{split} {A}_{c}\left(M\right)=&{M}_{{\rm{Avg}}}+{M}_{{\rm{Max}}}=\sigma \left({C}_{2}\right(ReLU\left({C}_{1}{\rm{Avg}}\right(M\left)\right))+\\ &{C}_{2}(ReLU\left({C}_{1}{\rm{Max}}\right(M\left)\right)\left)\right) \\[-10pt]\end{split} $ (7) where M denotes the feature of the image after different layers, σ denotes the sigmoid functions, C1 and C2 are two different convolution layers.
-
To capture the spatial relationship of features, we further employ the spatial attention module to emphasize the informative part of the features, as the complementary information to the channel attention. The spatial attention module in the network can be denoted as
$ M{{'}}{'}={A}_{s}\left(M{{'}}\right)\otimes M{{'}} $ (8) where M′ and M′′ indicate features after the channel attention module and final attention features respectively. As is the spatial attention module.
Unlike channel attention, we concatenate two descriptors to generate an efficient descriptor and then forward to a convolution layer to compute the spatial attention map. The spatial attention is computed as
$ \begin{split} {A}_{s}\left(M\right) =&\;{M}_{{\rm{Avg}},{\rm{Max}}}=\\ & \sigma \left({C}^{7\times 7}\right({\rm{Avg}}\left(M\right)+{\rm{Max}}\left(M\right)\left)\right) \end{split} $ (9) where σ denotes the sigmoid function, and M indicates the feature of the image after different layers. C7×7 indicates the kernel size of the convolution layer. Fig. 4 demonstrates several samples enhanced by the channel-spatial attention map.
Figure 4. Visualized feature maps of corresponding person images from the Market1501 dataset[18]. The results of baseline and CSA are achieved via ResNet-50 and ResNet-50 with the channel-spatial attention respectively on the original RGB data.
-
The backbone of our attentive RGBT Re-ID network is the standard ResNet-50[75]. The Re-ID network is trained with cross-entropy loss. The learning rate is set to 0.1 and then reduced to 0.01 at the 60th epoch. For the attention module, we use convolution layer with a kernel size of 7. The kernel size of spatial attention is 7×7.
We train our model in 90 epochs with an adjustable learning rate which will decrease while the epochs increase. We set the dropout to 0.5 to prevent overfitting. We use stochastic gradient descent (SGD) with the momentum of 0.9 and weight decay of 0.0005 to fine-tune the network. All input images are resized to 256 × 128 with horizontal flipping during training.
-
To verify the effectiveness of our proposed method, we evaluate the method on three large-scale person Re-ID benchmark datasets, including Market1501[18], DukeMTMC-reID[8] and CUHK03[6]. Performance is evaluated by the cumulative matching characteristic (CMC) and mean average precision (mAP).
-
Market1501[18] consists of 12936 images of 751 identities for training and 19281 images of 750 identities for testing from 6 camera views. There are on average 17.2 images per identity in the training set. In testing, 3368 images from 750 identities are used as queries to retrieve the matching persons in the dataset.
DukeMTMC-reID[8] is a subset of tracking dataset DukeMTMC[77] for image-based person Re-ID. The dataset contains 16522 images of 702 identities for training collected from 8 cameras and 2228 query images from the other 702 identities. The evaluation metrics of the dataset is the same as that of Market1501[18].
CUHK03[6] contains 14097 training images of 1467 identities captured from two cameras where the scenario is the Chinese University of Hong Kong (CUHK) campus. Image samples of CUHK03 from 767 identities are selected for training, and the remaining 700 identities for testing.
-
We first compare the performance of our method with the recent state-of-the-art Re-ID methods including some recent methods on three benchmark datasets.
-
Table 2 reports the comparison results on Market1501 datasets[18]. As we can see, our method beats the state-of-the-art methods on both Rank-1 and mAP, comparing to the prevalent methods, such as online instance matching (OIM) Loss[78], k-reciprocal[9], joint learning multi-loss (JLML)[79], and deep spatial feature construction (DSR)[80], although they devote to design complicated networks architecture or various loss for better performance. Furthermore, our method outperforms other GAN based Re-ID methods including deep convolutional generative adversarial networks (DCGAN)[79], DeformGAN[81] and Pose-transfer[26], which aim to synthesize person images with various poses and image styles. This indicates that the t complementary advantages from different modalities play a more important role than the cross-view pose variation and style adaption in person Re-ID.
Methods Market1501 Rank-1 mAP References Bow+KISSME[18] 44.4 20.7 ICCV2015 ReRank[9] 77.1 63.6 CVPR2017 OIM Loss[78] 82.1 60.9 CVPR2017 MSCAN[25] 76.3 53.1 CVPR2017 DCA[25] 80.3 57.5 CVPR2017 DCGAN[8] 78.0 56.2 ICCV2017 k-reciprocal[9] 77.1 63.6 CVPR2017 OL-MANS[82] 60.7 − ICCV2017 SVDNet[83] 82.3 62.1 ICCV2017 PA[84] 81.0 63.4 ICCV2017 JLML[79] 85.1 65.5 IJCAI2017 DSR[80] 82.7 61.2 CVPR2018 DeformGAN[81] 80.6 61.3 CVPR2018 Pose-transfer[26] 79.8 58.0 CVPR2018 FMN[13] 86.0 67.1 PRL2019 Ours 86.5 76.2 Table 2. Experimental comparison of the proposed approach with state-of-the-art methods on Market1501[18] (in %)
-
The evaluation results of our method on DukeMTMC-reID dataset[8] is shown in Table 3. Our method consistently outperforms the state-of-the-art methods including either metric learning based methods, e.g., Bow+KISSME[18] and OIM Loss[78], or GAN based methods, e.g., DCGAN[8], similarity preserving generative adversarial networks (SPGAN)[85] and Pose-transfer[26]. Consistent with the results on Market1501[18], our method significantly improves the mAP metrics by 9 %, which verifies that our method can distinguish more challenging situations at the first rankings.
Table 3. Experimental comparison of the proposed approach with state-of-the-art methods on DukeMTMC-reID[8] (in %)
-
Table 4 shows the results of our method with the state-of-the-art methods on the CUHK03 dataset[6]. Note that CUHK03[6] contains two folds, one of which is named as detected, where the person images/bounding boxes are obtained by pedestrian detector, while the other one by handcraft named as labeled. We test our method on the handcraft one labeled comparing it to the state-of-the-art methods. Our method achieves promising performance with 87.6 % and 84.1 % on Rank-1 and mAP respectively. It seems that our method has not improved as much as on the other two datasets Market1501[18] and DukeMTMC-reID[8]. The main reason is the limited challenges in the CUHK03 dataset[6], which contains few background clutters and illumination changes. In other words, our method can better improve the person Re-ID performance in more challenging scenarios.
Table 4. Experimental comparison of the proposed approach with state-of-the-art methods on CUHK03[6] (in %)
-
Fig. 5 (a) demonstrates two ranking results of corresponding queries on the Market1501 dataset[18], CUHK03 dataset[6] and DukeMTMC-reID dataset[8] respectively. Benefitting from the thermal information generated from the RGB modality, our method can overcome the challenges of background clutter (especially for Query (i), (ii) and (vi)), pose changes (especially for Query (i) to (iv) and (vi)), occlusions (especially for Query (iv) and (v)), and huge illumination changes (especially for Query (ii) and (vi)).
Figure 5. Qualitative examples of the proposed method; (a) Ranking results of the proposed method on three benchmark person re-identification datasets, where the left column indicates the query images, and the following ten columns are the corresponding top-10 hits obtained by our method; (b) Ranking results of the proposed method comparing with the baseline on Market1501 dataset[18]. The green and the red boxes indicate the right and the wrong hits respectively.
-
To verify the contribution of each component in our method, we evaluate several variants on the three datasets in this section, as shown Fig. 6. It is clear to see that: 1) By introducing the channel and spatial attention (CSA) or the TIR generation module, we can improve the rank-1 and especially the mAP accuracies, which verify the contributions of both components. 2) By integrating both CSA and TIR generation modules, we can further boost the performance which verifies the effectiveness of the proposed method. 3) TIR generation contributes more on Market1501[18] while CSA plays a more important role on the other two datasets DukeMTMC-reID[8] and CUHK03[6]. The reason is that the images of Market1501[18] consist of different background and occlusion and the complexity of the dataset is high, and TIR data could reduce the impact of these factors. The attention network can select discriminative regions of images and important channels of feature maps for different modal data in three datasets.
Fig. 5 (b) illustrates our method is better than the baseline especially in some challenges such as background clutter, pose changes on Market1501[18]. We also compare our CSA module with the widely used SE-block[87] based on ResNet-50 on Market1501. As shown in Table 5, our attention module outperforms the SE-block.
-
To evaluate the generality of our method, we further evaluate our method with different backbones, including ResNet-101[75], ResNet-34[75], Res2Net-50[88] and SeNet[87], besides ResNet-50[75]. Table 6 reports the results of our method with various backbones. Our CSA module is more suitable to the ResNet network and could achieve better performance. It is clear to see that, our method achieves promising performance on all the backbones. Furthermore, our TIR generation module and CSA module can boost the performance on each backbone, which verifies the contribution of the proposed method.
Backbones Methods Market1501 DukeMTMC-reID CUHK03 Rank-1 mAP Rank-1 mAP Rank-1 mAP ResNet-50 Baseline 82.8 59.9 63.8 43.0 72.5 66.0 + CSA 83.4 61.0 68.0 47.6 79.4 72.5 + TIR generation 85.5 66.0 67.3 47.1 77.8 74.2 + TIR generation + CSA ${\color{green}{{86.5} } }$ ${\color{green}{{76.2} } }$ ${\color{green}{{69.2} } }$ ${\color{red}{{55.0} } }$ ${\color{green}{{87.6} } }$ ${\color{green}{{84.1} } }$ ResNet-101 Baseline 83.5 61.9 68.7 48.9 80.5 75.5 + CSA 84.8 64.2 69.6 50.3 84.2 77.3 + TIR generation 86.4 68.8 69.9 49.4 82.8 79.5 + TIR generation + CSA ${\color{red}{{87.4} } }$ ${\color{red}{{78.2} } }$ ${\color{red}{{70.6} } }$ ${\color{green}{{54.2} } }$ ${\color{red}{{87.8} } }$ ${\color{red}{{84.3} } }$ ResNet-34 Baseline 75.4 51.4 55.8 32.9 63.0 56.7 + CSA 78.2 54.3 57.3 34.2 70.2 60.8 + TIR generation 79.0 56.8 60.1 37.6 68.5 63.2 + TIR generation + CSA ${\color{blue}{{82.5} } }$ ${\color{blue}{{72.3} } }$ 61.0 45.6 78.8 74.4 Res2Net-50 Baseline 76.0 61.2 56.2 33.6 63.6 57.7 + CSA 78.2 65.5 58.1 36.0 71.1 61.9 + TIR generation 76.8 62.1 60.3 37.9 70.5 65.1 + TIR generation + CSA 81.2 68.6 62.1 46.8 80.2 75.3 SeResNet-50 Baseline 79.1 66.8 65.1 45.3 69.0 63.7 + CSA 81.1 68.2 66.5 46.1 75.2 67.8 + TIR generation 80.1 67.2 66.1 45.8 72.5 65.2 + TIR generation + CSA 82.0 71.2 ${\color{blue}{{67.8} } }$ ${\color{blue}{{52.3} } }$ ${\color{blue}{{82.3} } }$ ${\color{blue}{{77.0} } }$ ResNet-101[75] slightly outperforms ResNet-50[75] by deeper convolution layers which could extract more high-level features. The remaining three backbones are overshadowed since Res2Net-50 and SeResNet-50 contain a large number of parameters leading to overfitting and more computation, ResNet-34 is shallow network which performs generally.
-
The important parameter in our method is α in (4), which balances the weight of RGB and thermal modalities during Re-ID as shown in Fig. 7. The larger α, the higher contribution of thermal information. We analyze the impact of α by varying 0.2, 0.5, 0.8, 1.0, 1.2, 1.5, 2.0, and observe that: 1) Our method with different weights consistently outperforms the baseline, which validates the effectiveness of generated TIR information. 2) Our method achieves the best performance in the range 0.8 < α < 1.2, which indicates that the generated TIR information contributes more or less the same as the RGB information. 3) A larger weight on TIR information may decline the overall performance due to less appearance information in TIR data compared to RGB data. In this work, we set α to 1 to balance the channel weights of the generated thermal and visible in the input. Higher α, lower illumination and more background clutters.
Figure 7. Evaluation with different weights of the generated thermal data on Market-1501 dataset[18]
-
In this work, we have proposed a RGBT representation learning network for person re-identification. It utilizes the generated model to obtain TIR data to solve hard backgrounds in Re-ID datasets. Benefiting from the thermal modality, it can learn more discriminative feature representation with both RGB and synthesised TIR information for person Re-ID. Furthermore, we have utilized a cross-modal attention network to adaptively integrate the multi-modal information for Re-ID. Our proposed framework achieves state-of-the-art performance on person Re-ID without additional computational cost. In the future, we will investigate more modality information to improve the robustness of single RGB modality based Re-ID tasks.
-
This work was supported by National Natural Science Foundation of China (Nos. 61976002, 61976003 and 61860206004), Natural Science Foundation of Anhui Higher Education Institutions of China (No. KJ2019A0033), and the Open Project Program of the National Laboratory of Pattern Recognition (No. 201900046).
Learning Deep RGBT Representations for Robust Person Re-identification
- Received: 2020-06-22
- Accepted: 2020-10-10
- Published Online: 2021-01-19
-
Key words:
- Person re-identification (Re-ID) /
- thermal infrared /
- generative networks /
- attention /
- deep learning
Abstract: Person re-identification (Re-ID) is the scientific task of finding specific person images of a person in a non-overlapping camera networks, and has achieved many breakthroughs recently. However, it remains very challenging in adverse environmental conditions, especially in dark areas or at nighttime due to the imaging limitations of a single visible light source. To handle this problem, we propose a novel deep red green blue (RGB)-thermal (RGBT) representation learning framework for a single modality RGB person Re-ID. Due to the lack of thermal data in prevalent RGB Re-ID datasets, we propose to use the generative adversarial network to translate labeled RGB images of person to thermal infrared ones, trained on existing RGBT datasets. The labeled RGB images and the synthetic thermal images make up a labeled RGBT training set, and we propose a cross-modal attention network to learn effective RGBT representations for person Re-ID in day and night by leveraging the complementary advantages of RGB and thermal modalities. Extensive experiments on Market1501, CUHK03 and DukeMTMC-reID datasets demonstrate the effectiveness of our method, which achieves state-of-the-art performance on all above person Re-ID datasets.
Citation: | Ai-Hua Zheng, Zi-Han Chen, Cheng-Long Li, Jin Tang, Bin Luo. Learning Deep RGBT Representations for Robust Person Re-identification. International Journal of Automation and Computing. doi: 10.1007/s11633-020-1262-z |