2.2. Data and Pre-Processing
UAV remote sensing images were obtained in the field on 4 July 2022 at local time. Drones were mainly used for the data collection by hovering. The equipment used was DJI Phantom 4, which is equipped with 6 vision sensors. During the data collection process, the relative distance between the UAV and the ground was always maintained at 200 m, and the resolution of all the cameras was 2 million pixels, with clear and stable imaging. The spatial resolution of the images was 1 m, including three red, green, and blue bands. In the data collection phase, we first conducted field surveys in the collection area to analyze the distribution and features of the tree species and to classify the different tree species into areas. We then performed preliminary data acquisition under clear and cloudless weather conditions, capturing a total of 1247 initial images of all categories included in the classification system. After data collection, we performed radiometric calibration on the remote sensing images using the pseudo-standard feature radiometric correction method [
34] via Envi (version 5.31), an image processing software. This method achieves radiometric calibration by establishing a linear relationship between the measured ground reflectance and the actual ground reflectance coefficient. Ten aerial experimental standard reflectance reference plates were positioned uniformly around the data acquisition point. The DN values corresponding to the standard reference plates were extracted from the UAV images. The DN values of the reference plates for each band and the known reflectance values of the calibration reference plates were used to establish an equation so as to convert the DN values of the UAV images into the reflectance after radiometric calibration. The specific radiation correction obeys the following equation:
Here, RT is the reflectance of the target feature, DNT is the DN average of the target feature, DNR is the DN average of the standard reflectance reference plate for aerial photography experiments, and R′ is the reflectance value of the reference plate.
In order to enhance the model’s robustness through an increase in the number of training samples, we performed data augmentation manipulations such as rotations, shifts, and flips, forming a total of 2500 images that were sent to the model for training. Meanwhile, to facilitate the subsequent AMDNet extraction of image spectral features, we used Lableme (version 4.60) to construct the corresponding tree species sample set by manual visual interpretation under the guidance of professionals. Finally, we randomly selected 4/5 samples for the model training, using the remaining samples for validation. The specific numbers of samples corresponding to tree species are shown in 
Table 1.
  2.3. Methods
In this part of the paper, a new solution is presented to address the segmentation problem in remote sensing tree species images. Our proposed model, AMDNet, comprises multiple sensitive and targeted modules that are optimized to extract tree species information with greater accuracy. Considering the problems of the small sample size, high information coincidence, image background aggregation, and complex scenes in the classification of remote sensing tree species, we developed a series of customized modules and model strategies tailored to the distinct characteristics of tree species information. They are: (1) the attention residual module; (2) structural re-parameterization; and (3) the improvement of the training strategy.
  2.3.1. Dual-Residual Attention Module
The final process of the model’s implementation aims to accurately predict the tree species, which entails the resolution of a pixel-level prediction problem. Addressing pixel-level prediction problems requires the integration of both local features and global context information. The local features and global dependence of tree species information inherently show a strong correlation, which allows the attention module to play a significant role in enhancing the model.
To enable the adaptive integration of both local features and global dependence, a dual-attention module is incorporated at the head of the model; that is, two modules are attached to the traditional expansion network to simulate the semantic dependencies of spatial and channel dimensions, respectively.
As shown in the position attention module in 
Figure 2, the position attention module sends the feature map A (C × H × W) to the three convolutional layers, respectively, to obtain three feature maps and then reshapes the three feature maps to obtain C × N (N = H × W). Next, the transpositions (N × C) of the first reshaped feature (C × H × W) and the second reshaped feature (C × N) are multiplied to yield a spatial attention map (H × W × H × W), which is computed using Softmax. Following this step, the reshaped third feature (C × H × W) is multiplied with the transposition of the map (H × W × H × W) and then further multiplied by the scaling coefficient α (initialized to 0 and gradually learned to obtain larger weights). After reshaping it to its original shape, it is then added to the original map to produce the final output E (C × H × W).
Similarly, as illustrated in 
Figure 2, the channel attention module reshapes (C × N) and reshapes and transposes (N × C) on the original image (C × H × W), respectively, and then multiplies the two obtained feature maps using Softmax to obtain the channel attention map (C × C). Next, the transposition of the map (C × C) is multiplied with the original map (C × N) and is then further multiplied with the scaling coefficient β (initialized to 0 and gradually learned to obtain larger weights). The resulting feature map is then reshaped back into the original shape and added to the initial image to generate the output E (C × H × W).
The dual-attention residual network is essentially a stack of multiple dual attention modules. Each attention module is divided into two parts: the residual branch and the trunk branch. The trunk branch can be any current convolutional neural network model. In our case, MobileNetV2 is selected for feature processing. The residual branch uses a combination of bottom-up and top-down attention to learn an attention feature map that is consistent with the dimension of the output of the trunk feature processing. Then, the feature maps of the two branches are combined using the dot product operation to obtain the final output feature map.
If the output feature map of the trunk branch is Qi, F(x), and the output feature map of the mask branch is Pi, F(x), the final output feature of the attention module is:
In the attention module, the attention mask functions as a forward feature selector and as a gradient update filter for backpropagation.
          
In practice, since there are background occlusions and complex scenes in the training images for remote sensing tree species classification, the fusion of multiple attentions is required to process the feature information. If the method of stacking attention modules is not used, a larger number of channels will be needed to cover the combined attention of different factors. Additionally, due to the limited capacity of a single attention module to modify the feature, the model’s fault tolerance rate is decreased, and increasing the number of training iterations may not improve the overall robustness of the model.
While attention modules play a significant role in target segmentation, their incorporation into a model without consideration will lead to the degradation of the model’s performance. This can be attributed to two main reasons:
- In order to generate a feature map with normalized weights, the addition of the Sigmoid activation function to the mask branch is necessary. However, when the output is normalized between 0 and 1 before conducting a dot product with the main branch, the output response of the feature map will be weakened. Multi-layer stacking of this structure will lead to a decrease in the values at each point of the final output feature map. 
- The feature map output from the masked branch may adversely affect the benefits provided by the trunk branch. For example, replacing the shortcut mechanism in the residual connection with the masked branch may cause inadequate gradient transmission in the deep network. 
To solve the problems mentioned above, we use a method similar to residual learning to conduct an element-wise addition of the obtained attention feature map to the trunk feature map, so that the output is expressed as follows:
We refer to this learning mechanism as dual-attention residual learning. By capturing the global feature dependencies in spatial and channel dimensions, this module can build a rich context dependencies model on local features so as to significantly improve the segmentation results. In addition, it has been observed that employing a decomposition structure to increase the size of the convolution kernel or introducing an effective coding layer at the top of the network can capture richer global information. The addition of more attention modules contributes to a linear improvement of the classification performance of the network. Furthermore, additional attention models can be extracted from feature maps at different depths. With this strategy, the network can easily be extended to hundreds of layers, due to the residual structure, while still maintaining its robustness to noisy labels.
  2.3.2. Structural Re-Parameterization
The backbone is the baseline network of the entire model during training and inference, which largely determines the upper limit of the model. In the case of AMDNet, MobileNetV2 was selected due to its relatively balance between speed and accuracy, effected through the inverse residual structure. However, the overall training and inference speed of the model still falls short of expectations. To address this problem, we propose modifying MobilNetV2 using structural re-parameterization.
The specific modifications are shown in 
Figure 3. Firstly, we introduce an identity and residual branch into the original MobileNetV2 during the training process to re-parameterize it into a one-way structure, which is equivalent to adding the advantages of ResNet. Moreover, we modify the location of the branches by directly connecting each layer instead of using cross-layer connections. It is also demonstrated that both residual branching and conv_1 × 1 can improve the network’s performance compared to the native model. Finally, in the model inference stage, the residual structure used in training is transformed into a 3 × 3 convolution layer through the Op fusion strategy to facilitate subsequent model deployment and acceleration.
The re-parameterization process in the model inference stage is essentially a process of OP fusion and OP replacement. Firstly, the Conv3 × 3 + BN layer and Conv1 × 1 + BN layer are fused, respectively, and the formula can be expressed as follows:
The Di refers to the parameters of the convolution layer before the conversion, μi refers to the mean value of the BN layer, and σi refers to the variance of the BN layer. The scale factor and the offset factor of the BN layer are represented by γi and βi, respectively, and the weight and bias of the convolution layer after fusion are represented by D’ and c’, respectively.
After this step, the fused convolutional layer is converted to Conv3 × 3. For the Conv1 × 1 branch, the conversion process aims to replace the 1 × 1 convolution kernel with the 3 × 3 convolution kernel, which involves the transfer of the value in the 1 × 1 convolution kernel to the center point of the 3 × 3 convolution kernel. For the identity branch, it is necessary to set a 3 × 3 convolution kernel and assign a weight value of 1 to each of the 9 positions so that it will keep the original value after the multiplication with the input feature map.
Finally, the weight W and bias B of all the branches are stacked together to obtain a Conv3 × 3 network layer after the fusion.
Since the introduction of the residual structure with multiple branches in network improvement has the effect of adding multiple gradient flow paths to the network, training this network is similar to training multiple networks at the same time, which are later integrated into a single network. This can be viewed as an example of model integration, which also improves the model robustness to a certain extent. Moreover, the addition of the 1 × 1 convolution branch and identity mapping branch can also enhance the benefits of multi-branch model training. The simpler the network structure in the model inference phase is, the more effective the model acceleration will be. Therefore, in the inference stage, the model is transformed into a single-branch structure to improve the memory utilization of the device and thus improve the inference speed of the model. In this way, the benefit of multi-branch model training (a high performance) and the advantages of single-path model inference (a fast speed and saving memory) can be leveraged simultaneously.
  2.3.3. Model “Modernize”
In 2022, Liu et al. [
31] proposed the idea of modernizing the CNN-style model; that is, aligning the ConvNets-style network model with the Transformer-style model. The aim of this proposal was to explore the design space and test the limits of ConvNets and break the trend of the monopoly of the Transformer. It was also suggested that the performance gap between traditional ConvNets and the Vision Transformer may be largely attributable to the training strategy level, but the authors did not elaborate much on what design can be used to optimize ConvNets. However, it is clearly stated in the Vision Transformer paper that the Transformer structure lacks some of the innate inductive bias of CNN, such as translation without deformation and the inclusion of local relations; thus, its performance on small- and medium-sized datasets is not particularly good. In the task of remote sensing tree species recognition, however, due to the limitations of the datasets and the uneven distribution of data within the class, a model structure that is suitable for large datasets, such as Transformer, may not be suitable for practical applications. Nevertheless, the training techniques corresponding to the Transformer may still have room for improvement.
Therefore, aiming to address the problems of the small sample size, uneven distribution of feature information, and large amount of repetitive information in tree species classification, we proposed a customized and optimized training strategy. We first added HSV_AUG, Cutmix, RandomScale, and other common training strategies to the original training strategy while modifying the training method of the traditional CNN-style model. Prior to Transformer, the main training method of CNN was based on SGD and learning rate decay. In the experiment, we attempted to take SGD and learning rate decay as the training strategies for Transformer, and the final results were relatively poor, which also explained the difference in training skills between CNN and Transformer to a certain extent. To modernize the CNN model, it is necessary to analyze whether Transformer’s training techniques are feasible for CNN, namely, the LR Decay of Warmup and AdamW-style training (the improvement is not simply focused on the training methods but, due to the space limitations on this paper, emphasis is placed on the feasibility of applying Warmup + AdamW to remote sensing tree species datasets).
The inclusion of Warmup in the training is necessary for several reasons. Firstly, the learning rate, which determines the step size, plays a crucial role in achieving the optimal performance. Incorrect learning rate settings can lead to issues such as divergence or slow convergence. In neural networks, if the learning rate is set to be too low, it will fall into the local optimal solution. To address these concerns, a conventional approach is to begin with a higher learning rate during the initial phase of training, which is then reduced in a linear manner. Specific strategies are shown in 
Figure 4 below:
Warmup increases the learning rate and then decreases it linearly, as shown in the figure above. The use of Warmup is very important in the training of Vision Transformer. If Warmup is used, with AdamW as the optimizer, the training can converge normally and generate the desired results. If not, it will be difficult to converge the training process and adjust the setting of the learning rate. Similarly, here, we conducted a comparative experiment between the experimental group utilizing Warmup + AdamW and CNN trained using SGD, and the loss change during training is shown in 
Figure 5 below:
It can be seen that since AdamW contains both first-order momentum and second-order momentum, it has a clear advantage in terms of convergence speed compared with SGD. However, in the experiment, the final loss of the AdamW group was higher than that of the AdamW + Warmup group, and the final performance of the model was worse. This discrepancy can be attributed to AdamW’s unstable initial learning rate. The Warmup strategy starts with a low learning rate for training epochs or iterations, followed by a linear or non-linear increase in the learning rate towards a preset value as the training process progresses. In addition, AdamW dynamically and adaptively adjusts the learning rate during the process. If the gradient is more likely to deviate from the truly optimal direction, the adaptive adjustment will mitigate the deviation, and if the deviation is in the wrong direction, a subsequent adjustment will be made to prevent the normal distribution from becoming distorted.
In the application of the model, such training configuration is not only helpful in order to mitigate the phenomenon of early over-fitting of the mini-batch in the initial stage and keep the distribution stable but also to maintain the stability of the model depth. Compared with the SGD series, AdamW has a faster convergence speed, which can greatly reduce the training time, and has more practical significance for the collection and analysis of tree species information; thus, it has more advantages in forestry engineering applications. Furthermore, the inclusion of Warmup allows for a low learning rate within the first few epochs of the training, which facilitates the stabilization of the model. Once the model reaches a state of relative stability, the preset learning rate is selected for training, which increases the model convergence speed and improves the model effect.
Moreover, ablation experiments were conducted on the data enhancement technology and regularization scheme of Transformer, as follows:
First, the enhancement of the HSV color model is undertaken. The brightness and saturation features of each species in the proposed tree species dataset contain a large amount of feature information. In the HSV color model, colors are not represented by three channels in the RGB but according to their hue, saturation, and value. Based on this, the HSV model can improve the discriminability between different species, so that the model can more easily locate the distinguishing features in the effective parts so as to achieve the effect of improving the accuracy of tree species identification. The second is RandomSCALE, which essentially uses the specified scale factor to filter the image in order to construct the scale space so as to change the size of the image content or the degree of the blur. When applied to the task of this paper, this technique can enhance the randomness of the image and, in effect, expand the dataset size, ultimately achieving the result of improving the robustness of the model. Finally, the Cutmix is a technique of fusing mixup with cutout. The mixup method is likely to generate an image that is fuzzy and unnatural in the local area, which will confuse the model and cause the model to learn the noise information, ultimately diminishing the model’s effect. In contrast, the part deleted by cutout is usually replaced with 0 or random noise, which leads to the waste of some information in the training process and results in a reduction in the amount of information obtained by the model and a decline in the model’s ability. Therefore, in the experiment, we used Cutmix to enhance the tree species information. It filled the cutout part of a certain image with the cutout part of another image, which helped to retain the advantages of cutout, for example, by enabling the model to learn the characteristics of the target from the partial view and causing the model to focus on the less discriminative parts while being more efficient than cutout. The cutout part is filled with some parts from another image, so that the model can learn the features of two targets at the same time, avoiding the problem of confusing the model, as in the case of mixup, and enhancing the final effect of the model to a certain extent.
Our complete set of hyperparameters can be found in 
Table 2. In terms of the remote sensing tree species dataset produced by AMDNet, most of the training configurations resulted in significant and positive performance improvements for the whole model. The success of Vision Transformer is also related to its special training mode, which has potential implications for future research in forestry remote sensing.
  2.3.4. Implementation Details
Based on the dual-attention module and the structural re-parameterized feature extraction network mentioned above, we introduce AMDNet, a new model for tree species feature information processing that builds upon the DeepLabV3+ framework. The overall structure is shown in 
Figure 6 below. Next, the implementation process of the model is explained.
Firstly, the dual-residual-attention module is positioned at the head of the model to facilitate the identification and reinforcement of feature information. After the external data are fed into the model for training, the initial information is channeled, respectively, into two feature information enhancement paths, the position and channel. In each path, the feature information of each branch block is integrated with that of the preceding block to obtain the enhanced outputs of the position and channel, respectively. Upon merging the two outputs, they are fused with the initial up-sampling information. From a macro-perspective, this module uses two convolutional channels to process the initial feature information and weighs the unique information of different tree species in the image so that the segmentor will spend more “energy” on the unique information in the subsequent segmentation learning and improve the classification performance of the model.
Next, the output of the feature map from the attention module is transmitted to the segmentor section. Although it has entered the segmentation module, the feature extraction is still needed. Here, we opted for the lighter MobilenetV2 instead of the ResNet used by Deeplabv3+. Firstly, due to the original encoder–decoder structure, the attention module is added to the head. In this case, although the accuracy is significantly improved, it also induces a substantial decrease in speed. Considering the problems involved in the subsequent model deployment and practical applications, we chose to carry out lightweight processing on the model. Secondly, as the attention module of the head has strengthened the feature information, the feature extraction in the segmentor exerts a minimal influence on the final effect. Even in extreme cases (such as those involving a small number of species, i.e., less than 3 species), the straight-through network may function better as a feature extraction network. However, as the proposed dataset has 7 categories, the accuracy of the straight-through network is expected to lag behind to some degree. Consequently, taking into account the two reasons mentioned above, we finally chose MobilenetV2 as the feature extraction network of the segmentor.
After passing through the feature extraction network, the feature map is fed into an encoder–decoder structure. The spatial pyramid pooling module in the encoder structure outputs both high-level semantic features and low-level semantic features. After being up-sampled once, the high-level semantic features are then spliced with the low-level semantic features that were down-sampled to produce the final predicted output through up-sampling.
The elaboration outlined above basically covers the whole training stage of the model from the data input to the output of the prediction results. As mentioned in 
Section 2.3.2, the performance of our model in the ablation experiment did not meet expectations, especially at the level of the inference speed, which still lagged significantly behind that of the SegFormer and other models. Therefore, we suggest re-parameterizing the structure of the feature extraction network. In the training stage, the addition of the 1 × 1 convolution and the directly connected residual branches increases the flow of feature information in the model, which enhances the feature extraction ability of the model, to a certain extent, and improves the robustness of the model. Moreover, this helps to compensate for the slight drop in accuracy due to the abandonment of ResNet. In the inference stage, weight fusion is performed on the residual branches to achieve a single-branch network with which to extract feature information, which greatly reduces the time consumed by the segmentor in the inference stage. In this way, even with the inclusion of the attention mechanism in the model head, the speed of AMDNet is comparable to that of the SegFormer.
Next, we turn to the training strategy section. As mentioned in 
Section 2.3.3 above, while an initial attempt was made to take the SGD as the optimizer and the poly as the strategy to train the network in the beginning, this did not yield satisfactory results. The metrics, such as the MIOU, and the results cannot surpass those of the Transformer. In the subsequent ablation experiments, through the application of a series of training strategies and the modification, replacement, and debugging of the optimizer, a training strategy more suitable for the classification of remote sensing tree species, namely Adamw + Warmup, was finally obtained. Furthermore, some techniques were applied to the model (the detailed procedures and experimental data are explained in 
Section 3), which, compared with the original training strategy, showed a significant improvement. Ultimately, the proposed approach outperformed the Transformer when applied to the tree species dataset.