1. Introduction
Mangroves are vital coastal ecosystems that provide significant ecological, economic, and social benefits [
1,
2,
3]. Extensively distributed across tropical and subtropical regions, they serve as natural coastal barriers, reducing erosion, buffering storm surges, and mitigating climate change impacts by sequestering large amounts of carbon [
4,
5]. Additionally, mangroves support diverse biodiversity, functioning as nurseries for marine species and habitats for various birds and wildlife [
6].
However, over the past few decades, mangroves have faced increasingly severe threats from global climate change, rising sea levels, coastal development, and human activities. These pressures have resulted in a continuous decline in mangrove coverage, habitat fragmentation, and the degradation of ecological functions, with some species even at risk of extinction [
7,
8]. Consequently, the protection and monitoring of mangrove ecosystems—particularly through species identification and ecological assessment—have become critical from both scientific and practical perspectives [
9].
Mangrove species recognition has long been a challenging task due to the significant morphological similarity between species [
10] and their frequent co-existence within the same environment. Distinguishing species accurately using the naked eye or traditional measurement methods is inherently difficult. Furthermore, variations in growth stages, external morphology, and canopy structure add to the complexity of species identification [
11].
Moreover, mangroves exhibit high species diversity, spatiotemporal heterogeneity, and intricate community structures [
12,
13]. These factors pose significant challenges to mangrove conservation and research, especially in large-scale ecological monitoring. The inability to accurately classify species directly affects assessments of mangrove health and species distribution.
With the rapid advancement of UAV and spectral technologies, traditional mangrove species classification has increasingly leveraged hyperspectral and multispectral data. For example, Cao [
14] utilized UAV hyperspectral images to classify mangrove species on Qi’ao Island, Zhuhai, Guangdong Province, China. By integrating spectral, textural, and height features and applying SVM, the study achieved an accuracy of 88.66%, demonstrating the effectiveness of UAV hyperspectral data in species recognition. Similarly, Zulfa [
15] applied the SAM spectral algorithm to classify mangrove species in the Matang Mangrove Forest Reserve (MMFR). The study employed SID and SAM algorithms to classify 40,000 trees from 19 mangrove species using medium-resolution satellite images. The SAM algorithm achieved a classification accuracy of 85.21%, and the findings indicated that mangrove species distribution was influenced by human activities. Both algorithms demonstrated complementary strengths in enhancing mangrove species identification.
Wang [
16] employed the Extremely Randomized Trees (ERT) algorithm with multi-source remote sensing data to classify mangrove species in Fucheng Town, Leizhou City. By combining optical data (Gaofen-1, Sentinel-2, Landsat-9) and fully polarized SAR data (Gaofen-3), the study achieved a high classification accuracy of 90.13% (Kappa = 0.84). The ERT algorithm outperformed other methods, such as Random Forest and K-Nearest Neighbors (KNN), demonstrating the effectiveness of multi-source and multi-temporal satellite data for mangrove species classification.
Cao [
17] combined UAV-based hyperspectral and LiDAR data for fine mangrove species classification in Qi’ao Island, Zhuhai, China. The proposed method, which used the Rotation Forest (RoF) algorithm, achieved an overall accuracy of 97.22%, outperforming other algorithms, such as Random Forest (RF) and Logistic Model Tree (LMT). Incorporating LiDAR-derived canopy height improved classification accuracy by 2.43%.
Zhen [
18] utilized the XGBoost algorithm with multi-source satellite remote sensing data to classify mangrove species. By integrating data from WorldView-2, Orbita HyperSpectral, and ALOS-2, they achieved a high classification accuracy of 94.02% for six mangrove species. This study highlights the value of combining spectral, texture, and polarization features to improve mangrove species mapping and support conservation efforts.
Meanwhile, Yang [
19] compared the performance of four machine learning algorithms—Adaptive Boosting (AdaBoost), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Light Gradient Boosting Machine (LightGBM)—for classifying mangrove species using both multispectral and hyperspectral data. The study found that the average recognition accuracy for multispectral data was 72%, while hyperspectral data achieved an average accuracy of 91%. The highest accuracy, 97.15%, was achieved using the LightGBM method.
Although hyperspectral data offer a high resolution and the ability to capture subtle species differences, with strong robustness in complex environments, obtaining high-quality hyperspectral data presents numerous challenges. Variations in lighting, terrain, and occlusions in complex environments like wetlands can significantly affect data quality. Additionally, hyperspectral sensors are expensive, with high hardware and maintenance costs, which limits their widespread application in large-scale or long-term monitoring.
In recent years, significant progress has been made in the segmentation of UAV remote sensing images. With the development of deep learning [
20], semantic segmentation networks have achieved remarkable advancements. However, the potential of these networks for mangrove species recognition remains largely unexplored. The Fully Convolutional Network (FCN) [
21] was one of the earliest networks applied to semantic segmentation. It replaces fully connected layers with convolutional layers for end-to-end image segmentation. While FCN is simple and efficient, its performance is limited in complex scenarios, particularly when distinguishing subtle differences among mangrove species.
DeepLab v3 [
22] and DeepLab v3+ [
23] introduced atrous convolution and the Atrous Spatial Pyramid Pooling (ASPP) technique. These improvements significantly enhance segmentation accuracy, especially in capturing multi-scale information. However, these models still struggle to process fine-grained details, especially when it comes to capturing subtle edges and leaf structures, particularly in mangrove images.
UperNet [
24] enhances segmentation by combining global context and local details. However, it struggles with small-scale features, which impacts recognition accuracy. MobileNet [
25] is a lightweight network designed for mobile devices. It offers efficient computational performance, but it faces difficulties in capturing both global and local details in complex mangrove environments, which results in a decline in segmentation and recognition performance.
OCRNet [
26] improves context expression through its object context module. This enhances overall scene understanding. However, OCRNet performs poorly under complex lighting conditions, with accuracy dropping under strong light and shadow. DMNet [
27] (Dynamic Multi-scale Network) improves model flexibility by adjusting feature representations at different scales. However, it still fails to capture fine details of leaves and canopy structures in mangroves.
Although networks like Swin Transformer [
28] capture global information effectively, they still encounter challenges when handling subtle species differences and complex backgrounds in mangrove environments.
To address the shortcomings of existing methods in mangrove species recognition, this paper proposes a segmentation network named MHAGFNet, based on UAV remote sensing images, aimed at significantly improving the recognition accuracy of mangrove species through image segmentation techniques. The experimental results demonstrate that the proposed MHAGFNet achieves an average recognition accuracy of 93.87% for five mangrove species on the constructed dataset, with an average segmentation accuracy of 88.46%, demonstrating excellent performance.
The main contributions of this paper are as follows:
1. Design of the Multi-Scale Feature Fusion Module (MSFFM): MSFFM combines early shallow features, such as detailed texture information, with deep features, such as high-level semantic information, to enhance the ability to distinguish visually similar mangrove species.
2. Design of the Multi-Head Attention Guide Module (MHAGM): MHAGM captures multi-scale features from the images, improving the proposed MHAGFNet’s ability to perceive both global structures and fine details.
3. Building a Mangrove species dataset (MSIDBG) using a high-resolution UAV: This MSIDBG dataset was collected in Shankou Mangrove National Nature Reserve in Beihai, Guangxi Zhuang Autonomous Region, China, and serves as an important basic resource for mangrove species recognition research.
2. Mangrove Species Selection and Challenges
Mangroves grow in the intertidal zones of tropical and subtropical coastal areas, playing important roles such as purifying seawater, breaking waves, sequestering carbon, and maintaining biodiversity.
Mangrove species typically grow in similar intertidal environments, as shown in
Figure 1a,b. The similarity in background height makes it challenging to distinguish them based solely on environmental features. Additionally, variations in light angles, shadow effects, and strong reflections can affect image clarity and detail capture, increasing the complexity of image processing and species recognition.
The mangrove ecosystem in southern China, as shown in
Figure 2.
Figure 2a–e display the following species: (a)
Sonneratia apetala, (b)
Avicennia marina, (c)
Bruguiera gymnorhiza, (d)
Rhizophora stylosa and (e)
Kandelia obovata.
Figure 2a:
Sonneratia apetala, an exotic mangrove species native to the Bay of Bengal and the Indochina region. Its leaves are oval-shaped, and its fruits are green and round.
Sonneratia apetala adapts to humid and low-oxygen environments, often growing in mudflats and intertidal zones. However, due to its common leaf morphology and partial similarity with other species, combined with its relatively short stature and susceptibility to tidal influences (mud coverage), recognition in complex environments can be challenging.
Figure 2b:
Avicennia marina, a highly salt-tolerant mangrove species that thrives in environments with high salinity and frequent tidal inundation. Its leaves are thick with a waxy surface, and its bark is grayish-white. This species exhibits remarkable adaptability to extreme environments; however, its bark color and leaf morphology can resemble those of some terrestrial plants, leading to potential misclassification in the complex mangrove growth environment.
Figure 2c:
Bruguiera gymnorhiza, with smooth, dark green leaves and slight serrations along the edges. These leaf characteristics aid its growth in the intertidal zone. However, in dense mangrove areas, the leaves of
Bruguiera gymnorhiza may be confused with those of
Rhizophora stylosa, complicating recognition.
Figure 2d:
Rhizophora stylosa, characterized by opposite leaves with blunt tips, which are typically larger and dark green, exhibiting strong adaptability. Although the leaf morphology of
Rhizophora stylosa is relatively distinctive, in certain situations, variations in shadow and light may affect recognition, leading to blurred boundaries.
Figure 2e:
Kandelia obovata, with smooth, oval-shaped leaves with serrated edges, and elongated seeds. The leaves are highly adaptable and can survive in saline-alkaline environments. While its morphology is relatively easy to identify, similarities with the leaves of
Sonneratia apetala in certain features may pose recognition challenges in dense mangrove areas.
As can be seen from
Figure 2, mangrove species recognition faces several challenges, with the first being the morphological similarity between species. Although each species differs in leaf shape and fruit morphology, these visual differences can become blurred in images captured from a distance or with a lower resolution. For instance, as shown in
Figure 2, the leaves of
Sonneratia apetala and
Avicennia marina are easily confused under varying conditions, as their leaf colors and shapes appear similar. This similarity in morphological features exhibits a form of symmetry that challenges the recognition model to differentiate species accurately. Similarly, distinguishing the leaves of
Bruguiera gymnorhiza and
Rhizophora stylosa in
Figure 2 may be difficult under changing light conditions or shadows.
3. Methods
In this section, we first introduce the overall network architecture. Then, we provide a detailed description of each submodule.
3.1. Overview of the Architecture
The overall architecture of the proposed MHAGFNet is shown in
Figure 3. MHAGFNet consists of the ResNet50 [
29] backbone network and the Multi-Scale Feature Fusion Module (MSFFM). ResNet50, as the backbone network for feature extraction, generates six feature maps of different resolutions from the input images. These feature maps represent information from low-level to high-level features, providing a rich feature base for subsequent modules. The Multi-Scale Feature Fusion Module (MSFFM) mainly includes the Multi-Head Attention Guide Module (MHAGM). By enhancing feature maps at different levels through multi-head attention guidance, MSFFM can better extract and fuse features from various scales, thereby improving the network’s ability to recognize mangrove species, particularly in situations with complex backgrounds and varying lighting conditions.
3.2. Multi-Scale Feature Fusion Module (MSFFM)
Here, we provide a detailed explanation of the core module, the MSFFM module in MHAGFNet. The structure of MSFFM is shown in
Figure 4, which comprises a 1 × 1 convolution layer, three Multi-Head Attention Guide Modules (MHAGM), a feature fusion layer, an upsampling layer, and a transposed convolution layer.
MHAGM is the core component of MSFFM, consisting of three key submodules: Depthwise Separable Convolution, Multi-Head Global Spatial Enhancement Module (MHGSEM), and Mangrove All-View Attention Module (MHAM). It captures key information across different scales through the multi-head attention mechanism, enhancing the model’s sensitivity to the features of various mangrove species. Each MHAGM module independently processes features at different scales, ensuring that the model can effectively handle the complex and variable environments of mangroves and capture subtle feature differences. The feature fusion layer efficiently integrates features from various scales, reinforcing the synergy between local and global information, thereby enhancing recognition performance. The upsampling layer and transposed convolution layer restore the spatial resolution of the feature maps, ensuring that the final output segmentation results achieve higher precision. Through the extraction and fusion of multi-scale features, the MSFFM significantly improves the accuracy of mangrove species recognition in complex backgrounds and changing environments. In addition, the 1 × 1 convolution layer is utilized to reduce the channel dimensions and perform preliminary processing on the input features, preserving important information while minimizing computational complexity.
The flow of MSFFM starts with a 1 × 1 convolution applied to feature map C1, resulting in feature map F1, which reduces computational complexity. Next, feature maps C2 to C4 are input into the MHAGM modules, where they are processed using the multi-head attention mechanism, yielding enhanced feature maps F2, F3, and F4. Feature maps C5 and C6 are then processed with max pooling and 1 × 1 convolution operations, respectively, producing feature maps F5 and F6. Subsequently, feature maps F1 through F5 are fused together, followed by further compression of the fused features using a 1 × 1 convolution. This is followed by a 4× upsampling operation to enhance the resolution while retaining more detailed information. The upsampled features are concatenated with the compressed feature map F6, integrating features from all levels. Finally, the fused features undergo further processing through a 3 × 3 convolution, and a transposed convolution generates the final segmentation map. The transposed convolution restores spatial resolution and ensures the accuracy of the segmentation results.
3.3. Multi-Head Attention Guide Module (MHAGM)
Here, we provide a detailed explanation of MHAGM in MSFFM, as shown in
Figure 4. The structure of MHAGM is illustrated in
Figure 5. It consists of Depthwise Separable Convolution (DSC), the Multi-Head Global and Spatial Enhanced Module (MHGSEM), the Multi-Head Attention Module (MHAM), and skip connections.
Depthwise Separable Convolution (DSC) extracts local features from input data by expanding the convolution kernel and applying dilation rates, thereby enhancing feature extraction efficiency. This improves performance in complex mangrove environments.
The MHGSEM combines multi-head channel attention and spatial attention mechanisms. It enhances feature representation at both global and local levels, thereby improving the network’s recognition accuracy in complex backgrounds.
The MHAM module enhances the network’s ability to capture subtle mangrove feature differences. It integrates global, local, and edge features through a multi-dimensional attention mechanism.
Batch normalization and the ReLU activation function help accelerate model convergence. They also improve the ability to capture non-linear feature representations, enhancing overall performance.
(1) DSC Module
The MHAGM module first decomposes the Depthwise Separable Convolution (DSC) into a depthwise convolution and a pointwise convolution. The depthwise convolution extracts spatial features independently for each channel, while the pointwise convolution aggregates information across channels. Compared to standard convolution, depthwise separable convolution significantly reduces computational complexity. For an input feature map
, its computational process is as follows:
where
represents the operation using a 3 × 3 convolutional kernel applied independently to each channel, while
denotes the operation using a 1 × 1 convolutional kernel to perform a linear transformation on the channel dimension.
(2) MHGSEM Module
The feature map is then fed into the MHGSEM module, which adaptively adjusts the importance of each channel. MHGSEM extracts global information using a multi-head attention mechanism and generates weights to modulate each channel’s contribution. After learning the weights for different channels, they are normalized using the Sigmoid function to obtain the enhanced features. The formula is as follows:
where
represents the Sigmoid function.
(3) MHAM Module
After obtaining the enhanced features, to better focus on the edges, shapes, and textures of the mangrove leaves, they are fed into the MHAM module to further capture important spatial information. The formula is as follows:
(4) Residual connection
Finally, to mitigate the vanishing gradient problem and accelerate convergence, a skip connection is used to enhance the propagation of residual information. When the input and output channels differ, the skip connection adjusts the dimensions using a 1 × 1 convolution and adds it to the feature map after convolution, thereby maintaining the integrity of the features. The formula is as follows:
where
represents the skip connection, and
represents the activation function.
3.4. Multi-Head Global Spatial Enhancement Module (MHGSEM)
Here, we provide a detailed explanation of the MHGSEM in the MHAGM, as shown in
Figure 5. The structure of MHGSEM is illustrated in
Figure 6. It consists of a multi-head self-attention mechanism, a spatial attention module, and a feature fusion module.
The multi-head self-attention fully connected layer computes relationships between channels using a multi-head mechanism. Each attention head focuses on different aspects of the feature map, improving the performance of the channel attention mechanism.
The spatial attention module enhances spatial feature extraction across the feature map. It emphasizes key spatial information, such as mangrove boundaries and textures, which aids in the recognition of small mangrove species in complex scenes with lighting variations and water reflections.
The feature fusion module combines channel and spatial attention, ensuring the fused feature map retains both global and local details. This integration strengthens the model’s ability to capture comprehensive feature representations at multiple scales.
(1) Multi-head fully connected layer
For the input feature map
, both adaptive global average pooling and adaptive global max pooling are applied simultaneously to reduce the dimensionality of the input features. This step extracts channel-level global features
and
. The resulting global features are then passed into the multi-head attention mechanism. In this mechanism, the outputs of the average pooling and max pooling are processed separately by four fully connected layers. The output of each head is represented as:
where
represents different attention heads, and
is the fully connected layer. All output features are aggregated through a weighted sum to obtain the final channel attention weights:
The choice of four attention heads is motivated by the need to balance computational efficiency and model performance. Using multiple attention heads allows the model to capture diverse and complementary information from different subspaces, improving its ability to focus on relevant features. Through empirical testing, we found that four heads strike an optimal balance between complexity and accuracy. This setup effectively extracts multi-scale features while avoiding excessive computational burden. Moreover, this number of heads captures the necessary diversity of feature interactions for the task of mangrove species classification, while maintaining processing efficiency.
(2) Spatial Attention Module (SE)
For the input feature map
, first, average pooling and max pooling are applied along the channel dimension to obtain two distinct spatial feature maps:
where
represents the number of channels,
represents the features after average pooling, and
represents the features after max pooling.
These two distinct spatial feature maps are concatenated and passed into the SE module, followed by a Sigmoid activation function to generate the spatial attention weights:
where
represents the final spatial attention weights, and
denotes the Sigmoid activation function.
(3) Feature Fusion
Finally, the input features, channel attention, and spatial attention are combined to obtain the final weighted output:
where F represents the final spatial attention weights,
denotes the initial input features, and
represents the fusion operation.
3.5. Mangrove Holistic Attention Module (MHAM)
Here, we provide a detailed explanation of the MHAM within the MHAGM, as shown in
Figure 5. The structure of MHAM is illustrated in
Figure 7. It consists of attention processing layers for the height and width dimensions, local feature extraction convolution layers, edge feature extraction convolution layers, and global feature extraction layers.
The attention processing layers extract features along the height and width dimensions, capturing long-range dependencies. This helps distinguish the overall morphology of different mangrove species.
The local feature extraction convolution layer focuses on fine-texture details of leaves and branches. It enhances the ability to differentiate subtle texture variations among similar species, improving fine-detail recognition.
The edge feature extraction convolution layer enhances contour detection. It aids in recognizing species with blurry edges or complex shapes, thus improving classification accuracy.
The global feature extraction layer captures overall image information, which aids in distinguishing the general structure of each species and enhances recognition in complex backgrounds.
(1) Height dimension attention
For the input feature map
x, independent attention computation is performed along the height dimension. This is performed by applying a 1D convolution operation along the height, followed by normalization and a Sigmoid activation function. These steps generate the corresponding attention map
for that dimension.The formula is as follows:
where Dim = 3 indicates that the mean is computed along the width dimension.
refers to the convolution operation applied along the channel dimension, which is used to calculate the attention along the height.
stands for group normalization, which is applied to the feature map after the convolution.
represents the Sigmoid activation function, which outputs attention weights in the range [0, 1].
(2) Width dimension attention
Next, the mean along the height dimension of the image is calculated, and then a convolution operation is applied to compute the attention map along the width dimension. The formula is expressed as:
where Dim = 2 indicates that the mean is computed along the height dimension.
refers to the convolution operation applied along the channel dimension, which is used to calculate the attention along the width.
represents group normalization, applied to the feature map after the convolution.
is the Sigmoid activation function, which outputs attention weights in the range [0, 1].
(3) Local texture feature extraction
Local features are extracted using a local convolution operation. A 3 × 3 convolution kernel is applied to the input feature map, extracting features from each local region. The computation process is as follows:
where
refers to the 3 × 3 convolution operation, and
stands for group normalization.
represents the initially extracted local texture features, and
denotes the final extracted local texture features.
(4) Edge feature extraction
Edge features are crucial for recognizing the morphology of mangrove species, particularly when species exhibit similar shapes or contours. To capture these edge details, this paper uses the Sobel edge detection kernel, which highlights high-gradient areas such as sharp leaf edges. These edges are then processed with group normalization for further feature extraction.
While modern deep learning methods can learn edge features automatically, Sobel edge detection remains valuable in this setting due to its simplicity and efficiency. It explicitly extracts sharp, well-defined edges, which are crucial for accurately identifying mangrove species.
The MHAM module, with its multi-level attention mechanism, also focuses on edge-like features during training. This raises the question of whether Sobel becomes redundant when combined with MHAM. In fact, Sobel and MHAM complement each other rather than overlap. Sobel provides explicit, predefined edge features, which serve as a solid foundation for further refinement. Meanwhile, MHAM enhances these features by applying both local and global attention to emphasize important boundaries. This combination allows the model to better capture fine-scale details and resolve challenging species boundaries, especially in complex cases where species overlap.
After extracting the edge information, group normalization is applied to obtain the edge features. The computation process is as follows:
where
refers to the convolution operation using the Sobel kernel, and
is group normalization, which is applied to process the convolution results.
(5) Global feature extraction
Global features are essential for identifying the large-scale patterns and spatial layouts of mangrove species. For instance, features such as crown size and tree height play a crucial role in distinguishing different species. Global features are extracted using adaptive pooling, which compresses the spatial dimensions to obtain a global feature vector. This vector is then passed through two fully connected layers to obtain the global attention
. The formula is as follows:
where
refers to average pooling,
represents the two fully connected layers, and
is the Sigmoid activation function, which outputs the global attention weights.
(6) Feature fusion
After computing the attention for height, width, and global dimensions, the three attention components are combined through element-wise multiplication to obtain the final spatial and channel attention. This refined attention map is subsequently applied to the original input feature map
x to generate the final feature map
. The computation process is as follows:
Finally, the local features and edge features are integrated with the attention-enhanced feature map through element-wise addition to obtain the final fused features. The computation process is as follows:
5. Conclusions
In this paper, we propose a novel approach, MHAGFNet, designed to enhance mangrove species recognition in complex ecological environments. MHAGFNet integrates several innovative modules that significantly improve its adaptability and accuracy in species classification. The Multi-Scale Feature Fusion Module (MSFFM) plays a crucial role in enhancing the network’s ability to capture subtle differences in leaf shape and canopy structure, making it particularly suited for precise classification tasks. The Multi-Head Attention Guide Module (MHAGM), which combines the Multi-Head Attention Module (MHAM) with the Multi-Head Global Semantic Enhancement Module (MHGSEM), further enriches the extraction of local, edge, and global features, improving the model’s capacity to recognize both global structures and detailed features, especially in complex and dynamic environments.
However, despite the promising results, there are limitations to the proposed method. The performance of MHAGFNet may be compromised under extreme environmental conditions, such as intense midday sunlight causing high water reflection, or when significant wind movement affects canopy stability. Additionally, the model may struggle with challenging scenarios, such as unusual lighting angles or completely submerged roots. In these situations, the network’s stability and sensitivity could be impacted, leading to potential misclassification.
To address these limitations, future work could focus on enhancing the model’s robustness to variations in lighting conditions and complex environmental factors. One potential direction is to explore advanced techniques, such as image super-resolution (SR), which can improve the quality of low-resolution images and enable the model to capture the fine details and features essential for accurate species classification. Additionally, incorporating temporal data or sensor fusion methods could further strengthen the model’s ability to adapt to dynamic environmental conditions, such as lighting variations, water reflections, and canopy movement. Further development of adaptive mechanisms to handle challenging scenarios, such as submerged structures or species with similar visual characteristics, would be beneficial for improving model stability and accuracy under extreme conditions.