1. Introduction
With the rapid development of agricultural information technology [
1,
2,
3,
4,
5,
6], semantic segmentation using FCN [
7], UNet [
8], SegNet [
9], Deeplab [
10], ASPP [
11] has become one of the main technologies of agricultural intelligence. However, large pixel-by-pixel annotated datasets, which are costly to be obtained, are required to train these models. Although weakly supervised learning can reduce this cost to some extent, it still requires a lot of weakly annotated data. To solve the problem of obtaining a large number of annotated datasets, few-shot semantic segmentation [
12], which aims to learn from a few support images, is proposed and has gradually attracted attention in various fields, especially in plant disease segmentation, making it discriminative for new unseen classes.
Most few-shot segmentation methods learn from a small number of support images and then feed the learned knowledge into a parameterized module for query segmentation. However, this approach resulted in a mixture of model segmentation and supported semantic features due to the simultaneous feature extraction and object segmentation process. At the same time, the current few-shot semantic segmentation method is generally implemented by comparing the similarity between prototypes and query features. For example, SG-one [
13] extracts the guiding feature of the support image through VGG16 [
14] and then uses the cosine similarity to establish the relationship between the prototypes and the query image feature. Taking this research idea, the PANet [
15] adds support prototypes and query features, which are regularized to provide better generalization ability.
Although few-shot semantic segmentation has made great progress in processing natural images, these methods cannot handle the segmentation of plant disease images well. Because the guidance feature extracted from the support images through the backbone network and the query image is used for foreground detection, it cannot handle the huge differences in the shape and texture of different plant diseases well. Therefore, simply matching the prototypes generated from the deep feature of the supported image with the query image will lose many disease features, which can lead to predictions that ignore smaller disease areas and fail to identify differences among multiple diseases accurately.
To overcome the above problem of few-shot segmentation algorithms in plant disease images, a few-shot semantic segmentation network based on multi-scale and multi-match plant disease images is proposed. In this paper, semantic segmentation is performed for early disease images of plant leaves, and the diseased areas in leaves are drawn, which provides a new method for plant disease control. Specifically, the high-scale support feature and query feature are obtained through the last layer of the feature extraction network, and then the feature relationship is established to obtain the similarity map. In order to obtain more detailed features, this paper fuses the low-scale and high-scale features extracted from the feature extraction network VGG16 to obtain fused support features and fused query features that contain the more detailed features. Then, new prototypes generated from the fused support feature by masked average pooling are used to match with the fused query feature for similarity. Finally, the similarity map is obtained through multi-scale fusion. Specifically, the average similarity is obtained by averaging multiple similarities. By matching multi-support image prototypes and multi-query image features, the recognition accuracy of the network model can be improved to a certain extent.
Now, the cosine similarity is used in the original algorithm for matching the prototypes of the supported image with the query feature, which calculates the similarity between two vectors only by considering the similarity of their direction angles in the space and does not consider the distance between the two vectors. Therefore, a hybrid similarity calculation is adopted, which calculates the euclidean distance and the cosine similarity of the two vectors. Then a weighted sum is performed according to 9:1 (the cosine similarity: the euclidean distance). In this way, the method can obtain more accurate similarity maps.
With limited computing power, it is important to allocate computing resources to more important tasks. In deep learning, with the increase of network parameters, there will be a problem of information overload. Therefore, the CBAM (convolutional block attention module) [
16] is introduced into our network after the fusion of shallow features and deep features, which can not only pay more attention to important information and filter other irrelevant information but also improve the efficiency and performance of segmentation tasks.
There are few plant disease datasets suitable for few-shot semantic segmentation tasks. Therefore, this paper constructs a plant disease dataset (PDID-5i) containing ten different categories, which are then annotated at the pixel level, and conducts experiments on the dataset to verify the effectiveness of our network.
The main contributions of our method are as follows:
Multi-scale and multi-prototypes match is proposed for few-shot plant disease semantic segmentation. This method generates multiple prototypes and multiple query feature maps at different scales, and then the relationships between prototypes and query feature maps are established through the similarity measure method. Finally, the relationships at different scales are fused. With this approach, our network can more precisely identify plant disease signatures.
The mixed similarity is designed as the weighted sum of cosine similarity and euclidean distance. When the similarity of the direction and the actual distance between two vectors are jointly considered, more accurate similarity can be obtained.
A CBAM attention module is added to our network to make the network pay attention to the important plant disease feature and ignore interference information, which is beneficial to improve accuracy.
To accomplish the few-shot semantic segmentation task, we constructed a plant disease dataset(PDID-5i) that is suitable for the task. Experiments on the dataset show that the model we designed is very effective.
3. The Proposed Method
3.1. Problem Setting
The few-shot segmentation model proposed in this paper aims to obtain the guiding feature from a small number of annotated support images to segment the new segmented objects of the query image. This paper adopts the following strategies to train and test the model. First, we divide the dataset category set into the known category set Cknow and the unknown category set Cunknow. The data set Dtrain for training is from Cknow, and the test set Dtest is constructed from Cunknow. Finally, we train the model on the training set and evaluate the model performance on the test set with the trained model.
The training set and the test set consist of several episodes, each containing an annotated support image set Si and an unannotated query image set Qi. In the k-shot semantic segmentation task setting, each semantic category in the support set Si has K pairs of <image, mask>. At the same time, the Cknow category is taken from the total C categories for training, and the Cunknow category is taken for testing. The query set contains N query pairs of <image, mask>, where the categories are the same as those in the support set. The model first extracts the feature knowledge of C categories from the support set and then performs the segmentation task on the query set using the extracted knowledge. Through continuous training and learning of different semantic classes, the model has a good generalization to new semantic classes. Finally, we put the model trained from the training set Dtrain into the test set Dtest for segmentation performance evaluation.
3.2. Evaluation Indicators
In this paper, mIoU and binary-IoU are used as indicators to evaluate the performance of the model. The mIoU (Mean Intersection-over-Union) is the ratio obtained by computing the intersection and union of two sets of true and predicted values. Binary-IoU is to take all classes of objects as foreground and calculate the average IoU of foreground and background. We use mIoU and binary-IoU to evaluate the model performance comprehensively.
3.3. Method Overview
Different from most of the current few-shot segmentation methods, the method in this paper first extracts the guiding feature from the supported image through the feature extraction network to generate the prototypes. On this basis, we also fuse shallow features with deep features, generating prototypes with more detailed information. The query image is extracted by the feature extraction network to generate the query feature map. Similarly, we fuse the shallow feature and deep feature of the query image to generate a query feature map containing the detailed feature of the query image. Finally, we combine the feature maps of multiple prototypes and multiple query images to establish relationships by mixing similarities.
As shown in
Figure 1, the model proposed in this paper performs the segmentation task as follows. First, a shared backbone network is used to extract the feature maps of support images and query images. Then, the feature maps of the support images are further processed by average masked pooling to obtain prototypes. Finally, the relationship between prototypes and query feature maps is established using our proposed hybrid similarity, as described in
Section 3.5. To better measure the relationship between prototypes and query feature maps, multi-scale feature maps and multi-scale prototypes are constructed, and then relationships between prototypes and query feature maps are obtained at multiple scales, as described in
Section 3.3. At the feature extraction stage, we adopt VGG-16 [
14] as a shared backbone to extract deep features from support images and query images. At the same time, we fuse the shallow feature after the third convolution block with the deep feature after the last convolution block and then pass through the CBAM attention module to finally generate fused feature maps of support images and query images, as described in
Section 3.4.
3.4. Multi-Scale and Multi-Prototypes Match
Currently, common few-shot semantic segmentation methods use single-scale prototypes and query features for similarity calculation. This method will lead to a rough match of the query feature because the prototypes feature cannot sufficiently represent the details of the plant disease feature. To address this challenge, a multi-scale and multi-prototypes match (MPM) method is proposed.
Suppose we obtain the support feature Fs1 and the query feature Fq1 from the supported image and the query image through the VGG-16 extraction network, respectively. Then, we fuse the feature after the third convolutional block of VGG-16 with the feature of the last convolutional block of VGG-16 to obtain the fusion feature Fs2 of support images and Fq2 of query images, respectively. In the same way, the feature of the second convolution block of VGG-16 and the feature of the fourth convolution block of VGG-16 can be fused so as to obtain the fusion feature Fs3 of the support images and the fusion feature Fq3 of the query images, respectively. Considering the efficiency of the model, here we take generating an additional support feature and generating an additional query feature as an example. First, we pass the feature Fs2 and Fq2 through the CBAM module so that the model pays more attention to the feature with a higher attention value in the training process. Next, we use a global pooling operation on the support feature Fs1 and Fs2 to generate prototypes P1 and P2. Finally, we calculate the similarity between multiple prototypes and multiple query feature maps.
3.5. CBAM Module
Although the fusion of shallow and deep features can obtain more detailed features of plant disease textures, it does not consider the differences between different pixel categories, channel features, and spatial features. Different feature learning weights affect the effect of plant disease segmentation. The introduction of the attention module can make the network model pay attention to the characteristics of plant disease areas during training and reduce unimportant learning weight coefficients, such as background area learning weight coefficients. So, introducing an attention module after the fusion of each shallow feature and deep feature allows the network to add different weights to a different feature.
The CBAM module [
16] is a convolution-based attention mechanism module. Inspired by SENet [
30], CBAM combines both channel attention and spatial attention, as shown in
Figure 2. It can be clearly seen that CBAM is composed of CAM and SAM modules, which assign weights to channels and spaces, respectively.
In the CAM module, max pooling and global average pooling operations are performed on the input feature map
F (H × W × C) based on the height and width directions to obtain two (1 × 1 × C) feature maps. Next, the obtained feature map is respectively input to a multi-layer perceptron (
MLP) for the pixel addition operation. Then the sigmoid activation function is applied, resulting in the final channel attention map
Mc. Finally, the channel attention map
Mc and the input feature
F are multiplied pixel by pixel to obtain the input feature
Fc required by the SAM module. The specific calculation is shown in the following Formula (1) [
16]:
The SAM module takes the channel attention module output feature
Fc as the input feature. First, channel-based global max pooling and global average pooling are performed on
Fc to obtain two (H × W × 1) feature maps. Then, a 7
× 7 convolution is, respectively, acted on the feature map, which is the concatenation of two obtained feature maps along the channel dimension to reduce the number of channels to 1. Finally, the spatial attention map
Ms is obtained through the sigmoid activation function, and the attention map
Ms is multiplied pixel by pixel with the input feature
Fc of the module to obtain the final feature
Fs we need. The specific calculation process is shown in Formula (2) [
16]:
3.6. Hybrid Similarity
Usually, in few-shot segmentation tasks, cosine similarity is used to establish the relationship between prototypes and query features. However, cosine similarity uses the cosine value of the angle between two vectors in the vector space as a measure of the difference between two individuals, which only distinguishes differences in direction and is not sensitive to absolute numerical values. In order to make up for this defect, the euclidean distance calculation was added to the original basis. The specific Formula (3) is as follows. Euclidean distance can reflect the difference between two individual numerical features, making up for the disadvantage that cosine similarity is not sensitive to the numerical value. The CES (cosine euclidean similarity) module is a new method for similarity calculation between prototypes and query features proposed in this paper. The principle is shown in
Figure 3. In the three-dimensional space, we multiply the cosine value of the angle between the two vectors
A and
B and the euclidean distance between the vectors
A and
B by a certain scaling factor, and the sum of the two is used as a new way to establish the relationship between prototypes and query feature. The role of the scale factor is to balance the effects of cosine similarity and euclidean distance on the calculation of the difference between two vectors. When establishing the relationship between the prototypes and the query feature maps, the CES module can not only consider the similarity of the two vectors in the spatial direction but also pay attention to the similarity of the two vectors in the spatial distance.
3.7. Loss Function
As shown in
Figure 1, our model is trained according to two processes, which are Support-Query and Query-Support, respectively. Specifically, the process of Support-Query is to learn knowledge from the annotated support set, establish the relationship with the query image, and then predict the query image to obtain the segmentation result. We calculate the loss between the obtained segmentation result and the real label of the query image, as shown in Formula (4) [
15]. In contrast, the process of Query-Support, which is only performed during training, is to flow query information to the support set. Average pool operation is employed to the query feature to obtain another set of prototypes, which are used to match with the support feature maps to obtain the support prediction results. Next, this paper calculates the loss of the prediction result according to Formula (5) [
15] and then returns to the training process to adjust the weight.
where
Mq is the ground truth segmentation mask of the query image, (
x,
y) denotes the index of the spatial location, C indicates that the current processing is the one in a set of support images, K represents the current category and N is the total number of spatial locations.