In May, Professor Li-wei Wang from Peking University proposed a two-phase framework to recognize images from unseen fine-grained classes, i.e., zero-shot fine-grained classification (ZSFC). Experimental results demonstrate that the proposed model outperforms the state-of-the-art zero-shot learning models in the task of zero-shot finegrained classification. Details are displayed in the paper Zero-shot Fine-grained Classification by Deep Feature Learning with Semantics.
Zero-shot Fine-grained Classification by Deep Feature Learning with Semantics
Ao-Xue Li, Ke-Xin Zhang, Li-Wei Wang
Fine-grained image classification, which aims to recognize subordinate level categories, has emerged as a popular research area in the computer vision community. Different from general image recognition such as scene or object recognition, fine-grained image classification needs to explicitly distinguish images with subtle difference, which actually involves the classification of many subclasses of objects belonging to the same class such as birds, dogs and plants.
In general, fine-grained image classification is a challenging task due to two main issues:
1) Since recognizing images in the fine-grained classes is a fairly difficult and expert task, the annotations of images in fine-grained classes are expensive and collecting large-scale labelled data just as general image recognition (e.g., ImageNet) is thus impractical. Therefore, the question of how to recognize images from fine-grained classes given the lack of sufficient training data for every class becomes a thought-provoking one in computer vision.
2) As compared with general image recognition, finegrained classification is a more challenging task, which needs to discriminate between objects that are visually similar to each other. Therefore, we have to learn more discriminative representation for fine-grained classification than that for general image classification.
Considering the lack of training data for every class in fine-grained classification, we can adopt zero-shot learning to recognize images from unseen classes without labelled training data. However, conventional zero-shot learning algorithms mainly explore the semantic relationship among classes (using textual information) and attempt to learn a match between images and their textual descriptions. In other words, this rarely works on zero-shot learning focus on feature learning. This is really bad for fine-grained classification, since it requires more discriminative features than general image recognition. Hence, we must focus on feature learning for zero-shot fine-grained image classification.
In this paper, we propose a two-phase framework to recognize images from unseen fine-grained classes, i.e., zero-shot fine-grained classification (ZSFC). The first phase of our model is to learn discriminative features. Most fine-grained classification models extract features from deep convolutional neural networks that are fine-tuned by images with extra annotations (e.g., bounding box of objects and part locations). However, these extra annotations of images are expensive to access. Unlike these models, our model only exploits implied hierarchical semantic structure among fine-grained classes for fine-tuning deep networks. The hierarchical semantic structure among classes is obtained based on taxonomy, which can be easily collected from Wikipedia. In our model, we generally assume that experts recognize objects in finegrained classes based on the discriminative visual features of images and the hierarchical semantic structure among fine-grained classes is their prior knowledge. Under this assumption, we fine-tune deep convolutional neural networks using hierarchical semantic structure among fine-grained classes to extract discriminative deep visual features. Meanwhile, a domain adaptation subnetwork is introduced into the proposed network to avoid domain shift caused by zero-shot setting.
In the second label inference phase, a semantic directed graph is firstly constructed over attributes of finegrained classes. Based on the semantic directed graph and also the discriminative features obtained by our feature learning model, we develop a label propagation algorithm to infer the labels of images in the unseen classes. The flowchart of the proposed framework is illustrated in Fig. 1. Note that the proposed framework can be extended to a weakly supervised setting by replacing class attributes with semantic vectors extracted by word vector extractors (e.g., Word2Vec).
To evaluate the effectiveness of the proposed model, we conduct experiments on two benchmark fine-grained image datasets (Caltech UCSD Birds-200-2011 and Oxford Flower-102). Experimental results demonstrate that the proposed model outperforms the state-of-the-art zero-shot learning models in the task of zero-shot finegrained classification. Moreover, we further test the features extracted by our feature learning model by applying them to other zero-shot learning models and the obtained significant gains verify the effectiveness of our feature learning model.
The main contributions of this work are given as follows:
1) We have proposed a two-phase learning framework for zero-shot fine-grained classification. Unlike most previous works that focus on zero-shot learning, we pay more attention to feature learning instead.
2) We have developed a deep feature learning method for fine-grained classification, which can learn discriminative features with hierarchical semantic structure among classes and a domain adaptation structure. More notably, our feature learning method needs no extra annotations of images (e.g., part locations and bounding boxes of objects), which means that it can be readily used for different zero-shot fine-grained classification tasks.
3) We have developed a zero-shot learning method for label inference from seen classes to unseen classes, which can help to address the issue of lack of labelled training data in fine-grained image classification.
The remainder of this paper is organized as follows. Section 2 provides related works of fine-grained classification and zero-shot learning. Section 3 gives the details of the proposed model for zero-shot fine-grained classification. Experimental results are presented in Section 4. Finally, the conclusions are drawn in Section 5.
For more up-to-date information:
1) WeChat: IJAC
2) Twitter: IJAC_Journal
3) Facebook: International Journal of Automation and Computing
4) Linkedin: Int. J. of Automation and Computing
5) Sina Weibo: IJAC-国际自动化与计算杂志