1. Introduction
With the development of remote sensing technology, sensing images generated from satellite and air vehicles are widely used in land-use mapping, urban resources management, and agricultural research [
1,
2,
3]. The world population is anticipated to be over nine billion by the year 2050, which will cause a rapid escalation of food demand [
2,
4]. Therefore, the agricultural industry needs to be upgraded, becoming intelligent and automated to meet the needs of increasing food demand. In this paper, we completed the task of semantic segmentation of crop growth images by studying high-resolution agricultural remote sensing images to obtain crop growth information, which can effectively improve the level of agricultural intelligence.
Recently, as a method of image processing, deep learning has been widely used in pixel-level classification (e.g., semantic segmentation) tasks with good results. In particular, the CNN and Transformer based on deep learning methods have attracted more and more attention in crop segmentation tasks due to their excellent performance. To fully realize the potentials of the CNN and Transformer, three problems in agricultural segmentation should be solved. It can be seen from
Figure 1, first, that the ultra-high-resolution images need to be cut into small patches because of the memory limitation, which leads to the deficiency of global information and the incomplete structure of the object edge. Second, the crop field and the inside soils are both labeled to the crop category, causing the misclassified prediction of the crop field due to the obvious different characteristics of crop and soils. Third, to obtain the complete image, the patches need to be spliced together, which usually causes edge artifacts and small misclassified objects and holes [
5,
6]. To cope with the aforementioned three problems and further achieve good segmentation performance, the global context is needed to help the incomplete patches obtain more surrounding information, hence improving the object edge [
7]. The local details such as color and texture of the categories are meanwhile needed to help with finer area division. Finally, some well-designed methods are required when restoring the complete images.
The CNN can obtain local information well, but lacks global information caused by the limited receptive field of convolution. FCN [
8] and ResNet [
9] generate a larger receptive field through continuous stacked convolution layers. Inception [
10,
11,
12,
13] and AlexNet [
14] obtain more neighborhood information through a larger convolution kernel. DANet [
15] and CBAM [
16] establish global relations by introducing the attention mechanism. Both methods could obtain global information through the attention mechanism, but this damages the CNN’s original local information, making the acquired global information very limited. Different from CNN-based methods, Transformer can obtain sufficient global information, but lacks local information because of its entirely attention-mechanism-based architecture. ViT [
17] is the first visual Transformer to design position encoding, which can supplement the lost position information caused by serialized input. Based on ViT, SETR [
18] introduces a CNN decoder and successfully applies Transformer in the semantic segmentation task. ResT [
19], VOLO [
20], BEIT [
21], Swin [
22], and CSwin [
23] adopt a hierarchical structure similar to the CNN for achieving local information and obtain SOTA performance in many semantic segmentation datasets. It is observed that local information is also very important in the Transformer structure, but the local information generated by the attention module is not as effective as the local feature obtained by the CNN.
According to the aforementioned literature, the CNN lacks global information, but has sufficient local information, and Transformer lacks local information, but has sufficient global information. Therefore, we considered combining the respective advantages of the CNN and Transformer and further propose the Coupled CNN and Transformer Network (CCTNet) to combine the features from the CNN and Transformer branches in this paper. The CCTNet has two independent branches of the CNN and Transformer to retain the advantages of their respective structures. However, it is very difficult to fuse the advantages of the two different architectures. Hence, we propose the Light Adaptive Fusion Module (LAFM) and the Coupled Attention Fusion Module (CAFM) to effectively fuse the features of these two branches. In addition, in order to learn better features of the CNN, Transformer, and the fusion branches, we used three supervised loss functions, respectively. Furthermore, in the inference stage of the model, we propose three methods to improve the performance of the patches, including Overlapping Sliding Window (OSW), Testing Time Augmentation (TTA), and the Post Processing (PP) method of correcting misclassified areas especially for small objects and holes. The code is available at
https://github.com/zyxu1996/CCTNet, accessed on 14 April 2022.
To be clearer, the main contributions of this work are summarized as follows:
We propose the Coupled CNN and Transformer Network (CCTNet) to combine the local modeling advantage of the CNN and the global modeling advantage of Transformer to achieve SOTA performance on the Barley Remote Sensing Dataset.
The Light Adaptive Fusion Module (LAFM) and the Coupled Attention Fusion Module (CAFM) are proposed to efficiently fuse the dual-branch features of the CNN and Transformer.
Three methods, namely Overlapping Sliding Window (OSW), Testing Time Augmentation (TTA), and the Post-Processing (PP) method of correcting misclassified areas are introduced to better restore complete crop remote sensing images during the inference stage.
2. Related Work
In this section, some related works regarding State-Of-The-Art (SOTA) remote sensing applications and the model designs are introduced, including CNN-based models, Transformer-based models, and the fusion methods of the CNN and Transformer. These methods provide us with experience in solving global and local information deficiency in crop remote sensing segmentation.
CNN-based models for local information extraction: The conventional FCN [
8] model consists of convolution layers and pooling layers, where the convolution operation extracts local features and the pooling operation downsamples the feature size to obtain compact semantic representations. However, it is difficult to obtain semantic representation while preserving local details, since the downsampling operation damages the spatial information [
24]. To retrieve the local information, UNet [
25] proposes the skip-connection to fuse shallow layers, achieving good performance in the local details. HRNet [
26,
27] maintains high-resolution representations throughout the process to avoid the deficiency of local information.
To obtain global context information, DeepLab [
28,
29] and PSPNet [
30] propose multi-scale pyramid-based fusion modules to aggregate global context from different receptive fields. Lin et al. proposed FPN [
31] to aggregate features of different scales step by step from top to bottom and assign different scale objects to different resolution feature maps. Inspired by the substantial ability of attention mechanisms at modeling global pixels’ relations, DANet [
15] designs a dual-attention mechanism of the channel and spatial dimensions to obtain a multi-dimensional global context. HRCNet [
32] proposes a lightweight dual-attention module to enhance the global information extraction ability, successfully applying it to the remote sensing image segmentation tasks and achieving SOTA performance.
Despite its advantages in local feature extraction, the ability of the CNN to capture global information is still insufficient, which is very important for crop remote sensing segmentation. Although DeepLab [
28,
29] and PSPNet [
30] expand the receptive field to obtain the multi-scale global context, the global information is still limited to a local region. The attention mechanism provides a good pattern for modeling global information, but is limited by the few module numbers and huge computational burden. Consequently, a pure attention-based lightweight architecture is needed to achieve sufficient global information extraction.
Transformer-based models for global information extraction: Transformer is a pure attention-based architecture with powerful representation capabilities of global relations, but is weak at obtaining local details [
33]. Vision Transformer [
17] is the first work to apply the Transformer structure to visual tasks by splitting images into patches to meet the input format of Transformer. SETR [
18] improved on ViT, applying a CNN decoder to obtain the segmentation results, successfully applying Transformer in semantic segmentation tasks and achieving SOTA performance. Although the ViT architecture explores a feasible way to apply Transformer in visual tasks, it ignores the local representations.
To obtain local representations in the Transformer architecture, ResT [
19] designs a patch embedding layer to obtain hierarchical feature representations. VOLO [
20], Swin [
22], and CSwin [
23] adopt a local window attention mechanism to obtain local representations such as the convolution operation. The above methods continuously reach new SOTA performance on the semantic segmentation tasks because of the multi-stage architectures, such as ResNet [
9], which are suitable to obtain multi-resolution features [
34]. Such architectures provide Transformer with local information, but they are not as good as the CNN. Therefore, the fusion of the CNN and Transformer is intuitively aware and becomes an important research direction.
CNN and Transformer fusion methods: Fusing the CNN and Transformer is intended to combine the superiority of each method, such as the local information extraction ability of the CNN and the global information extraction ability of Transformer. However, it is hard to fuse both superiorities. Therefore, Conformer [
35] adopts a parallel structure to exchange features from the local and global branches to maximize the retention of the local and global representations. TransFuse [
36] incorporates the multi-level features of the CNN and Transformer via the BiFusion module, so that both the global dependencies and the low-level spatial details can be effectively captured. WiCoNet [
37] incorporates a large-scale context branch and a local branch to fuse global and local information, achieving good performance on the BLU, GID, and Potsdam remote sensing datasets. Besides the aforementioned parallel fusion methods, the serial fusion methods can also be used to fuse the CNN and Transformer. Xiao et al. [
38] revealed that early convolutions can help Transformers learn better. BoTNet [
39] proposes a serial architecture by replacing the spatial convolutions with global self-attention, achieving a strong performance while being highly efficient. CoAtNet [
40] is proposed to combine the large model capacity of Transformer and the right inductive bias of the CNN, which achieves the same scores as ViT-huge with 23× fewer data. ConvTransformer [
41] was first proposed for video frame sequence learning and video frame synthesis, and it applies a convolutional self-attention layer to encode the sequential dependence and uses a Transformer decoder to capture long-term dependence.
Although the aforementioned fusion methods achieve good performance, there are still some problems: (a) they are trained from scratch; hence, many existing models and pretrained weights cannot be used; (b) this integral architecture will damage the respective characteristics of the CNN and Transformer. To avoid the formerly mentioned deficiencies, we propose a new CNN and Transformer fusion and training method called the CCTNet. The CCTNet fuses the CNN and Transformer branches to generate a new branch, and the three branches are trained with different decoders and loss functions, so that they can keep their respective superiority. In the meantime, the CNN branch and Transformer branch can be flexibly replaced by other better architectures. To achieve better fusion performance, we also employed the LAFM and the CAFM to effectively fuse the local and global features. With the support of the above designs, the CCTNet achieved the best performance on the Barley Remote Sensing Dataset.
Table 1 provides a summary of the related work, including the pure CNN methods, the pure Transformer methods, and the CNN and Transformer fusion methods.
4. Experimental Results
In this section, we first explore the performance of the mainstream CNN and Transformer models on the Barley Remote Sensing Dataset and select the best-performing Transformer model CSwin-Tiny and the most widely used CNN model ResNet-50 as benchmarks to explore the fusion method of the CNN and Transformer. Then, we study the effects of the fusion modules, the LAFM and CAFM, as well as the auxiliary loss functions. Finally, we discuss the combination of different model sizes of the CNN and Transformer to verify the flexibility of our proposed CCTNet.
4.1. Dataset and Experimental Settings
The Barley Remote Sensing Dataset (
https://tianchi.aliyun.com/dataset/dataDetail?dataId=74952, accessed on 28 February 2022) presents a rural area in Xingren City, Guizhou Province, China, containing a large amount of crop fields. It was collected by an Unmanned Aerial Vehicle (UAV) near the ground, including the three spectrum bands of red, green, and blue. It collected data on crops such as cured tobacco, corn, and barley rice and provides a data basis for crop classification and yield prediction. The dataset is shown in
Figure 10, which contains two ultra-high-resolution images, image_1 and image_2, with resolutions of
pixels and
pixels, respectively. A total of four categories including background (white), cured tobacco (green), corn (yellow), and barley rice (red) are labeled as Classes 0, 1, 2, and 3. Except for the above four categories, the rest of the picture is transparent. The background category mainly includes buildings, vegetation, and other unimportant crops. The three crops of cured tobacco, corn, and barley are divided by region, and the soil inside the crop field is also labeled as the corresponding crop category. The crop fields are not all regular, but the same crops are more likely to be distributed in the same area.
As can be seen from
Figure 10, the ground truth, the category distributions of the two images are inconsistent. In image_1, cured tobacco and barley rice take the larger proportion and corn takes a small proportion, but in image_2, corn and barley rice take the larger proportion and cured tobacco takes a small proportion. This situation will affect the training of the model, leading to low performance. Therefore, we divided the training set and test set after cutting the two images into many patches, each of which is
pixels, as shown in
Figure 10. The training region and test region were interval-sampled, and the completely transparent parts were discarded. Finally, we obtained 44 samples for training and 41 samples for testing. It is worth noting that for the rest with a resolution smaller than
pixels, we reversed the direction to cut a
area; see the rightmost column and the bottom row in image_1 and image_2. In fact, due to memory limitations, the obtained images with a resolution of
pixels could not be directly used for training and testing, and further processing was required. In
Figure 11, we use a
sliding window [
32,
42] with
overlap to select training and testing data online, as well as to discard the completely transparent parts. Finally, the
patches were restored to
, and then, all the
patches were spliced to obtain image_1 and image_2 with original resolutions of
pixels and
pixels.
The dataset division details, experimental platform, and training settings are listed in
Table 2, where TrS, TeS, DL, CE, and LR denote Training Size, Testing Size, Cross-Entropy, Deep Learning, and Learning Rate, respectively. The evaluating metrics follow the official advice, including
Precision,
Recall,
F1, Overall Accuracy (
OA) and mean Intersection over Union (
mIoU). The formula is shown in Formulae (
3) and (
4). The AdamW optimizer was used to accelerate the training process. The Poly learning rate scheduler was applied to make the training process smoother. The CE loss function is the commonly used segmentation loss function to calculate errors between the prediction and ground truth. LR, mini-batch size, and epoch were manually adjusted according to the evaluation results.
where
T,
F,
P, and
N represent true, false, positive, and negative, respectively.
TP denotes the pixels of the truly predicted positive class.
TN is the truly predicted negative pixels.
FP is the falsely predicted positive pixels.
FN is the falsely predicted negative pixels.
mIoU is the average of all categories of the IoU. Among the formerly mentioned metrics,
F1,
OA, and
mIoU are the main reference indicators.
4.2. Methods’ Comparison on the Barley Remote Sensing Dataset
This part first explores the performance of the classic CNN and the latest Transformer models on the Barley Remote Sensing Dataset, including BiseNet-V2 [
43], FCN [
8], UNet [
25], FPN [
31], DANet [
15], ResNet-50* [
9] (* means the segmentation application of ResNet), PSPNet [
30], and DeepLab-V3 [
28]. It is worth noting that DANet [
15], PSPNet [
30], and DeepLab-V3 [
28] all use dilated ResNet-50 as the backbone, which uses dilated convolution to replace downsampling convolution to keep high-resolution representations. FPN and ResNet-50* are based on the original ResNet-50 backbone, using an MSF decoder. The Transformer-based models all use a multi-stage structure like ResNet-50* to fuse multi-scale features, which can better fuse local details and deep semantic information.
The experimental results are shown in
Table 3. Taking the mIoU as the main comparison indicator, it can be seen that BiseNet-V2 [
43] and FCN [
8], which are stacked with simple convolutions, achieved 62.18% and 64.51% accuracy, respectively. Using the encoder–decoder method of UNet [
25] to fuse shallow features, the result achieved a 0.32% improvement compared to FCN. By introducing the attention mechanism or incorporating multi-scale features, FPN, DANet, ResNet-50*, PSPNet, and DeepLab-V3 had mIoU scores close to 70%. We can conclude that the attention mechanism and multi-scale fusion methods can significantly improve the segmentation performance, which benefits from the promotion of global information and local information. Transformer models that use multi-scale feature fusion, such as ResT-Tiny [
19], Swin-Tiny [
22], and VOLO-D1 [
20], obtained good mIoU scores, but were slightly weaker than the CNN models. However, CSwin-Tiny [
23] using Locally Enhanced Positional Encoding (LEPE) significantly outperformed other Transformer models. As the Transformer model lacks local information, the LEPE enhances CSwin-Tiny’s ability to extract local features, which shows the significance of local information.
Finally, in order to pursue the best performance while considering the flexibility of the model size, we adopted the best-performing CSwin as the Transformer benchmark and the flexible ResNet as the CNN benchmark. CSwin provides four model sizes of Tiny, Small, Base, and Large, and ResNet includes four model sizes of ResNet-18, ResNet-34, ResNet-50, and ResNet-101. The subsequent experiments were based on the Transformer branch of CSwin-Tiny and the CNN branch of ResNet-50, and the decoder adopts the MSF decoder in
Figure 5. The experimental settings followed those in
Table 2.
4.3. Study of the CNN and Transformer Fusion Modules
Because of the huge diversity of the CNN and Transformer features, the simple strategy to directly merge these two features will be a big challenge, resulting in a bad performance. Existing methods, such as Conformer [
35] and TransFuse [
36], fuse the CNN and Transformer at the shallow level and send the fused features in the subsequent training of the CNN and Transformer branches. Such a method confuses the characteristics of the CNN and Transformer, making their respective advantages fade out. Therefore, we propose the Light Adaptive Fusion Module (LAFM) and the Coupled Attention Fusion Module (CAFM) to fuse features without damaging their diversity. In this section, the location settings of the LAFM and CAFM at four positions to generate CT1, CT2, CT3, and CT4 are discussed in the experiments. The four positions are respectively behind the four stages in the CNN and Transformer branches, making the fusion of the CNN and Transformer features. Moreover, some detailed designs of the structure are also discussed.
4.3.1. The Location Settings of the LAFM and the CAFM
To generate the four fused features CT1, CT2, CT3, and CT4, four fusion modules at the corresponding positions are selected from the LAFM or the CAFM. ①, ②, ③, and ④ represent the positions to generate CT1, CT2, CT3, and CT4 respectively. “∖” means that the marked module was not used. Taking the mIoU as the main reference indicator, the first line in
Table 4 represents that without fusion modules, the score was lower than using only CSwin-tiny, which can only reach 71.12%. This shows that simply fusing the features of the CNN and Transformer cannot obtain their respective advantages, but results in being worse. The second line uses the LAFM to replace the original rough fusion method, and the mIoU increased by 0.36%, proving the effectiveness of the LAFM. The third and fourth lines replace the two positions ③ or ④ with the CAFM, which further improved the performance. When both positions ③ and ④ are replaced by the CAFM, the best mIoU score of 72.07% was obtained, which was 0.59% higher than the second experiment, indicating that the CAFM is effective and the effect is better than the LAFM.
However, the effect of the CAFM is related to the position. For example, being set in the ②④ positions, the score was not as good as that in the ④ position as the CAFM is an attention module that usually works better on rich semantic features, such as positions ③ and ④. Moreover, the CAFM numbers were exquisite; the positions ②③④ all use the CAFM, but obtained poor performance. When the CAFM is placed in the ① or ② position, it took up much memory. Consequently, the LAFM is more lightweight and is suitable for the positions ① and ②; the CAFM has better performance, but has a large computational burden; it is suitable for the positions 3 and 4 to fuse rich semantic features.
4.3.2. The Structure Settings of the LAFM and the CAFM
To make the LAFM and the CAFM work, some special structural designs are required. The LAFM includes the residual connection to fuse shallow layers and accelerate the model convergence. The CAFM consists of two major parts, the global to local (G → L) module and the local to global (L → G) module, to generate two fusion features. After obtaining the two fusion features, there are two methods (Concat and add) to fuse them. The Concat operation splices the two features in the channel dimension, and it can better retain the original information. The add operation fuses the two features in an elementwise sum way, which may confuse the information of local and global. Here, we performed an ablation study (see
Table 5) by adding or removing the residual connection in the LAFM. Furthermore, the choices of the G → L and L → G modules are discussed, and the fusion methods of Concat and add were tried when fusing the G → L and L → G modules in the CAFM.
Table 5 shows the experimental results. Taking the mIoU as the main reference, the analysis is given below. First, looking at the top three lines, when only G → L or L → G is used, the mIoU was 71.38% and 71.55%, respectively, which is a significant drop compared to the 72.07% score obtained by using the two modules simultaneously. This shows that the CNN and Transformer can complement each other to improve the performance. The fourth line replaces the Fusion mode with add, and the mIoU had a big decrease of 0.85%, indicating that the Concat fusion method is significantly better than add. The fifth line removes the residual connection in the LAFM, and the mIoU was also reduced by 0.55%, proving that the residual structure is significant in the LAFM. Through the exploration of the structure designs, we finally determined the appropriate structure settings for the LAFM and the CAFM and achieved good performance.
4.4. Ablation Experiments of the Auxiliary Loss Function
The auxiliary loss function plays a vital role in learning more effective representations in semantic segmentation tasks. It works in the training stage and can be completely discarded when performing inferring, so it is free on the inference consumption. A total of three auxiliary loss functions were used in this paper, where Aux Loss CT3 was applied to supervise the feature CT3 and Aux Loss C and T were used to optimize the CNN and Transformer branches. In
Table 6, we discuss the experimental settings with (w/) or without (w/o) these auxiliary loss functions and the situation of Aux Loss C and T sharing the same decoder. The detailed analysis is as follows:
The experiment in the first line removed Aux Loss C and T; compared with the second line, the mIoU score decreased by 0.82%. Without the extra supervision of Aux Loss C and T, the model generated poor features of the CNN and Transformer branches. Moreover, the auxiliary loss functions can also synchronize the learning progress of the CNN and Transformer, achieving the coupled optimization. The third experiment shared the same parameters in the decoder, resulting in the mIoU dropping sharply. Because the diversity of the CNN and Transformer is tremendous, the same decoder parameters cannot afford the CNN and Transformer at the same time. The last experiment removed Aux Loss CT3, and the mIoU had a slight drop of 0.31%. The purpose of Aux Loss CT3 is to make CT3 learn better features, so that the next CT4 can also learn good features. Therefore, in the semantic segmentation task, it is common and effective to add an auxiliary loss function at the third output feature of the encoder. Consequently, auxiliary loss functions are very important in the fusion process of the CNN and Transformer, especially the CCTNet proposed in this paper, which relies on the independent and good features of the CNN and Transformer. Of course, we can also train the CNN and Transformer branches separately and then fix the weights; this will keep their respective characteristics. However, this practice increases the complexity of the model design; it is not an end-to-end architecture like our proposed CCTNet.
4.5. Results of Different CNN and Transformer Model Sizes
Because of the independent design of the CNN and Transformer branches, the CCTNet makes it easy to adjust the size of the CNN and Transformer models, meaning that the pretrained weights can be used to make it easier to fit in different downstream tasks. This section discusses the combinations of different CNN and Transformer model sizes for the Barley Remote Sensing Dataset in
Table 7. The CNN was selected from ResNet-18, ResNet-34, ResNet-50, and ResNet-101, where the larger number means a larger size. Transformer was selected from CSwin-Tiny, CSwin-Small, and CSwin-Base; the model size increases in turn.
Fixing Transformer to CSwin-Tiny, the mIoU gradually increased when the model size changed from ResNet-18 to ResNet-50, but the mIoU of ResNet-101 had a slight decrease, indicating that changing the model size of the CNN will significantly affect the performance. When the CNN branch was ResNet-50, the accuracy saturated. Therefore, we fixed the CNN to ResNet-50 and changed the Transformer size; the mIoU had a small decline. It can be seen that the Transformer size did not affect the final accuracy very much, and CSwin-Tiny can already meet the needs of the CCTNet for global information. Furthermore, the Transformer branch had the main computation consumption, so using the lightweight Transformer model in this paper is better.
4.6. Study on the Improvements of Each Category
In this section, we compare the performance of ResNet-50* [
9], CSwin-Tiny [
23], and CCTNet on the mIoU performance of four categories: background, cured tobacco, corn, and barley rice. It can be seen in
Table 8 that the Transformer method was more effective, because the Transformer-based method CSwin-Tiny was 0.25%, 0.48%, 1.57%, and 1.91% higher than the CNN-based method ResNet-50* in the above four categories, especially on the corn and barley rice categories. The reason is that the background and cured tobacco categories are concentrated in continuous regions without too much interference, and the background is easier to recognize, so both the CNN and Transformer classified correctly more easily. However, the corn and barley rice categories were scattered, and these two categories appear alternately and interfere with each other. Therefore, the global context is needed here, and Transformer performed better in the corn and barley rice categories. Compared with ResNet-50*, the CCTNet achieved a 1.00%, 1.28%, 2.48%, and 2.81% IoU promotion, respectively, as well as performed greatly in the corn and barley rice categories, indicating that the CCTNet has the advantages of the Transformer structure. Compared with CSwin-Tiny, the CCTNet increased the IoU scores by 0.75%, 0.80%, 0.91%, and 0.90%, respectively, which benefited from the introduction of the CNN for improving the segmentation in edge details. Meanwhile, we analyzed the inference speed of each model; the unit was the processed images per second (img/s). It can be seen that the CCTNet only obtained a slight inference speed decrease, which shows that the increment of the mIoU score was not obtained at the expense of significantly increasing the execution time.
In
Figure 12, we compare the prediction maps of ResNet-50* [
9], CSwin-Tiny [
23], and the CCTNet on the local details and global classification. It can be seen that the CCTNet performed better on edge and detail processing than CSwin-Tiny and performed better on global classification than ResNet-50*. This shows that the CCTNet successfully combines the advantages of CNN in local modeling and Transformer in global modeling and achieved good results in the crop remote sensing segmentation task.
5. Discussion
Some methods for improving model performance at inference time were introduced here, including the Overlapping Sliding Window testing (OSW), Testing Time Augmentation (TTA), and Post-Processing (PP) methods to remove small objects and small holes. Due to the large image resolution and memory limitations, we cut the image into pieces before running the model, resulting in incomplete objects at the slit and affecting the segmentation effect [
44]. Therefore, the OSW method was introduced. It retained a
overlap during inference, so that the overlapped part was inside the image, thus reducing the influence of incomplete objects. TTA is a method to fuse different predictions by translating different inputs, by averaging the results to obtain better performance. In this paper, horizontal flipping and vertical flipping were used to generate different inputs in TTA. Moreover, we observed that the category regions in the Barley Remote Sensing Dataset were mainly large; small areas are very rare in this dataset. Therefore, we specifically proposed a post-processing method to remove small objects and small holes. The specific process was as follows: first, calculate the number of pixels in a connected area, then set a threshold (40,000 pixels in this paper) for the maximum pixels in the connected area; finally, replace the objects or holes smaller than the threshold with surrounding pixels.
Table 9 and
Figure 13 display the improvements of using the above three methods. Without OSW, the gap of the patch had obvious edge artifacts. When using OSW, the edge parts were significantly improved, and the mIoU increased by 0.47%, which shows the availability of OSW for improving the edge parts. When using TTA, the mIoU further increased by 0.16%, but there were still some small wrongly predicted regions and holes. When PP was used, the mIoU increased by 0.27%; these mispredicted parts were replaced with surrounding categories, making them look cleaner and more complete. Compared with the insignificant mIoU promotion, the visual effects are more important.
6. Conclusions
In this paper, we analyzed challenging problems in crop remote sensing images’ segmentation, such as incomplete objects at the edge and the imbalance of global and local information, which severely damage the performance. To solve the above problems, the proposed model should combine the advantages of CNN in local modeling and Transformer in global modeling. Therefore, we proposed the CCTNet to fuse the features of the CNN and Transformer branches, while keeping their respective advantages. Furthermore, to conduct end-to-end training of the CCTNet, the auxiliary loss functions were proposed to supervise the optimization process. In order to smoothly fuse the features of the CNN and Transformer, we proposed the LAFM and the CAFM to selectively fuse their advantages while ignoring their drawbacks. By using the above methods, our CCTNet achieved a 1.89% mIoU improvement compared to the CNN benchmark ResNet-50* and a 0.84% mIoU promotion compared to the Transformer benchmark CSwin-Tiny. This proves the importance of combining the CNN and Transformer for the crop segmentation task. In addition, three methods were introduced to further improve the performance at inference time, for example using the overlapping slide window to eliminate edge artifacts, applying the testing time augmentation method to enhance the stability, and employing the post-processing method to remove small objects and holes to obtain clear and complete prediction maps. The application of the above three methods further brought a 0.9% increase in the mIoU, finally achieving 72.97% mIoU scores on the Barley Remote Sensing Dataset. The ability of the current CCTNet may be limited, but based on the flexibility of the structure, it can be continuously optimized with the development of the CNN and Transformer methods. In the future, we will consider introducing multi-spectral crop data to improve the classification of the challenging corn and barley rice categories.