GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images

He, Tao; Chen, Jianyu; Pan, Delu

doi:10.3390/rs17152652

Open AccessArticle

GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images

by

Tao He

^1,2,3

,

Jianyu Chen

^1,2,3,4,*

and

Delu Pan

^1,3

¹

Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), Guangzhou 511458, China

²

Ocean College, Zhejiang University, Zhoushan 316021, China

³

State Key Laboratory of Satellite Ocean Environment Dynamics, Second Institute of Oceanography, Ministry of Natural Resources, Hangzhou 310012, China

⁴

School of Oceanography, Shanghai JiaoTong University, Shanghai 200030, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2652; https://doi.org/10.3390/rs17152652

Submission received: 18 June 2025 / Revised: 20 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

Download

Browse Figures

Versions Notes

Abstract

Geographic object-based image analysis (GEOBIA) has demonstrated substantial utility in remote sensing tasks. However, its integration with deep learning remains largely confined to image-level classification. This is primarily due to the irregular shapes and fragmented boundaries of segmented objects, which limit its applicability in semantic segmentation. While convolutional neural networks (CNNs) excel at local feature extraction, they inherently struggle to capture long-range dependencies. In contrast, Transformer-based models are well suited for global context modeling but often lack fine-grained local detail. To overcome these limitations, we propose GOFENet (Geo-Object Feature Enhanced Network)—a hybrid semantic segmentation architecture that effectively fuses object-level priors into deep feature representations. GOFENet employs a dual-encoder design combining CNN and Swin Transformer architectures, enabling multi-scale feature fusion through skip connections to preserve both local and global semantics. An auxiliary branch incorporating cascaded atrous convolutions is introduced to inject information of segmented objects into the learning process. Furthermore, we develop a cross-channel selection module (CSM) for refined channel-wise attention, a feature enhancement module (FEM) to merge global and local representations, and a shallow–deep feature fusion module (SDFM) to integrate pixel- and object-level cues across scales. Experimental results on the GID and LoveDA datasets demonstrate that GOFENet achieves superior segmentation performance, with 66.02% mIoU and 51.92% mIoU, respectively. The model exhibits strong capability in delineating large-scale land cover features, producing sharper object boundaries and reducing classification noise, while preserving the integrity and discriminability of land cover categories.

Keywords:

multi-scale optimized segmentation; land cover classification; global–local context; semantic segmentation; geographic object-based image analysis

Graphical Abstract

1. Introduction

Advancements in sensor technology have propelled the development of high-resolution remote sensing imagery. Owing to its rich detail and spatial information, such imagery is widely used in ecological monitoring [1], land cover mapping [2,3], change detection [4], and urban planning [5]. Extracting meaningful information and identifying target objects from remote sensing images, or the intelligent interpretation of such images, remains a key research focus. Image segmentation, as a subtask of intelligent interpretation, aims to extract the desired objects from remote sensing imagery. Traditional methods in remote sensing image analysis, exemplified by object-based techniques (GEOBIA), have been widely used over recent years [6,7]. These algorithms can be broadly categorized into threshold-based [8], edge-based [9], region-based [10,11], and hybrid approaches [12]. The core principle of these algorithms is to partition the image into numerous high-homogeneity segments with high heterogeneity between segments. However, a key limitation is that these geographic entities, as components of semantic information, lack intrinsic semantic meaning [13]. While multi-scale optimization segmentation algorithms [14,15] have improved the semantic relevance of segments and created more meaningful geographic entities (geo-objects), many segments still remain in their original state, exhibiting significant over-segmentation [16]. Consequently, traditional methods face substantial challenges in optimizing segmentation parameters and addressing the complexity and large scale of remote sensing images.

The rapid advancement of deep learning in recent years has resulted in substantial breakthroughs in image segmentation, particularly through the application of convolutional neural networks (CNNs). The introduction of Fully Convolutional Networks (FCNs) [17] marked a historic breakthrough in semantic segmentation tasks. Building on this, ResNet [18] demonstrated powerful feature extraction capabilities through residual connections in deep convolutional networks. Concurrently, UNet [19] addressed the issue of spatial information loss caused by successive convolutions by employing skip connections for feature fusion. Network architectures based on ResNet or incorporating UNet’s skip connections have become widely adopted in semantic segmentation tasks [20,21].

Remote sensing imagery poses unique challenges for semantic segmentation arising from the intricate diversity of land cover types (intra-class spectral variability and inter-class spectral similarity) and scale variability. While CNNs leverage local connectivity and weight sharing to exhibit locality and translation invariance—thereby reducing model parameters and enhancing feature learning efficiency—they still have limited capability in capturing global semantic information [22]. On the one hand, DeepLabv3+ [23] and PSPNet [24] expand the receptive field to obtain multi-scale global context, but their global information is still constrained by local regions. On the other hand, attention mechanisms such as DANet [25], which uses parallel channel attention mechanism (CAM) and spatial attention mechanism (SAM), and CBAM [26], which serially combines CAM and SAM, offer efficient approaches for modeling global information. Unfortunately, they are constrained by the number of modules and computational overhead. Therefore, there is an urgent need for a purely attention-based architecture to facilitate comprehensive global information extraction.

Transformers, renowned in natural language processing for their ability to capture global relationships, offer a novel approach to semantic segmentation [27]. The Vision Transformer (ViT) was the first to employ a pure Transformer architecture for image recognition tasks [28]. Pretrained on large datasets, ViT slightly outperforms CNN-based models, demonstrating the Transformer model’s robust feature extraction capabilities and effectiveness in image processing. Zheng et al. introduced the first semantic segmentation network with a pure Transformer backbone, SETR [29]. Subsequently, Liu et al. [30] introduced the Swin Transformer, a hierarchical Transformer architecture that applies self-attention within small windows and establishes cross-window connections through shifted windows, significantly enhancing the efficiency of self-attention calculations. Additionally, the hierarchical nature of the Swin Transformer can be selectively integrated into existing semantic segmentation models. While Transformers effectively capture long-range global features, local feature information is often overlooked. Therefore, creating a model that leverages both Transformer and CNN advantages is crucial for effectively segmenting remote sensing images, allowing for the capture of both global context and fine spatial details.

To achieve the goal of intelligent interpretation of remote sensing images, we propose GOFENet (Geospatial Object Feature Enhancement Network), a hybrid semantic segmentation network that integrates CNN and Transformer architectures. This approach effectively addresses the limitations of CNNs in capturing global context and the tendency of Transformer models to overlook fine-grained details. Built upon a ResNet backbone and integrated with Swin Transformer blocks, the proposed network leverages UNet-style skip connections for hierarchical feature fusion. While multi-scale segmentation often produces objects at varying meaningful levels, it effectively enhances pixel connectivity and global context representation. To leverage this, we introduce an auxiliary encoding branch that extracts high-level structural cues from GEOBIA-derived segmentation maps. The experimental results on three benchmark datasets demonstrate that GOFENet achieves outstanding segmentation performance, exhibiting excellent capabilities in extracting both local and global information.

The main contributions of this study are summarized as follows:

(1) A hybrid encoder architecture with feature enhancement modules. GOFENet combines the local detail modeling strength of CNNs with the long-range dependency modeling of Swin Transformers. To further enhance feature representation, we introduce a cross-channel selection module (CSM), which applies channel-wise attention to highlight informative features and suppress irrelevant ones. Additionally, we design a feature enhancement module (FEM) that incorporates both channel and spatial attention mechanisms to integrate global contextual cues into CNN features. Together, these modules refine the feature representations and improve both the accuracy and robustness of the segmentation model.

(2) Incorporation of object-level priors from GEOBIA into a DL framework. We introduce an auxiliary encoder that employs cascaded atrous convolutions to extract global structural information from GEOBIA-derived segmentation maps. A shallow–deep feature fusion module (SDFM) is then designed to integrate feature representations from both the main and auxiliary encoders at aligned semantic levels. This design enables the network to effectively combine multi-scale object-level and pixel-level information, thereby enhancing its capability for scene understanding and contextual reasoning.

2. Related Work

2.1. Semantic Segmentation of Remote Sensing Images Based on CNN, Transformer, and Hybrid Models

2.1.1. CNN-Based Semantic Segmentation Models

Techniques such as multi-scale feature extraction, expanding network receptive fields, integrating local and global contexts, and enhancing global or detailed features have proven effective in improving segmentation performance on remote sensing images [31,32,33,34]. However, high-resolution remote sensing images often contain intricate patterns and man-made structures. CNN-based segmentation networks, which primarily extract local semantic features, exhibit limitations in modeling global information across the entire image. Consequently, relying solely on local information can make it challenging to accurately identify and segment these complex patterns and objects. To overcome this limitation, integrating global contextual understanding is crucial for improving segmentation accuracy in such intricate settings.

2.1.2. Transformer-Based Semantic Segmentation Model

Leveraging its unique self-attention mechanism, the Transformer architecture effectively captures long-range dependencies and excels in handling both the details and structure of images, rendering it especially promising for remote sensing applications. Transformer-based encoders and decoders, collectively referred to as pure Transformer structures, include notable models such as Segmenter [35], SegFormer [36], and SwinUNet [37]. These models have demonstrated significant advancements in image segmentation applications for remote sensing data. While Transformers are adept at global modeling, they often exhibit inadequate localization capabilities, which can undermine segmentation accuracy in scenarios requiring precise spatial information [38]. This lack of fine-grained spatial resolution is a critical challenge, as accurate segmentation of complex remote sensing images often demands both robust global context understanding and detailed local feature extraction.

2.1.3. CNN–Transformer Hybrid Model

Researchers have shown significant interest in combining CNNs’ proficiency in extracting local features with Transformers’ capability to capture long-range dependencies and model sequential images to create synergistic network architectures. For instance, DC-Swin employs the Swin Transformer as the encoder and adopts a convolutional decoder with dense connectivity specifically designed for the segmentation of high-resolution remote sensing imagery, achieving superior performance compared to CNN-based methods [39]. Wang et al. [40] proposed a dual-path perception network that integrates a ResT-based dependency path with a CNN-based texture path, enabling the simultaneous extraction of global contextual information and fine-grained local details. In many studies, CNNs and Transformers are configured as parallel branches [41,42,43], forming dual-encoder frameworks that effectively fuse global semantics with spatial detail. These approaches underscore the potential of hybrid architectures to harness the complementary strengths of CNNs and Transformers, enhancing segmentation accuracy in complex remote sensing scenarios. By contrast, UNetFormer [44] adopts a ResNet encoder paired with a Transformer-based decoder, also demonstrating promising segmentation performance.

2.2. Deep Learning Framework Coupled with GEOBIA

Image segmentation and classification constitute the core processes of GEOBIA, where segments contain rich spectral, textural, morphological, and remote sensing index attributes that facilitate the recognition of complex targets in remote sensing imagery. However, the irregular shapes of these segments pose significant challenges for their integration into deep learning frameworks [45]. Based on an extensive literature review, existing integration strategies can be broadly categorized as follows (see Figure 1):

(a) Integration with CNN-based Segmentation Results. This approach refines or merges segmentation outcomes produced by CNNs [46,47,48]. The major drawback of this method is the requirement for manually constructed rule sets.

(b) Object-based patch classification. In this method, irregular objects are expanded into fixed-size rectangular patches and classified using CNNs or Transformer models [49]. Two padding strategies are commonly used: inserting placeholder values (e.g., 0 or 255) or incorporating contextual information surrounding the object. The latter has been shown to improve classification accuracy [50,51]. Nevertheless, this method is highly dependent on segmentation accuracy, with under-segmentation potentially leading to significant performance degradation [52].

(c) Deep feature extraction from object attributes. Deep learning models are utilized to obtain deep and abstract features from object attributes [53,54,55]. However, such representations may disregard spatial relationships. To address this, recent work has incorporated spatial context into tabular models to reduce misclassification errors [56], or combined tabular attributes with deep features within dual-branch architectures [55].

(d) Heterogeneous patch filtering. This strategy involves preprocessing image patches—such as applying variance-based filtering—prior to CNN classification, aiming to enhance model input quality and downstream performance [50,51,57].

2.3. Attention Mechanism

Attention mechanisms enhance CNNs’ ability to capture global information by using scene- or image-level data. CAMs assess and weight the importance of each channel to enhance key features [31,58], while SAMs focus on significant spatial locations via pixel-wise weighting [25,26]. Some approaches combine both to form pixel-level global attention maps [32,43,44], and others leverage cross-attention to strengthen interactions between different feature levels or modalities [59].

The self-attention module was first introduced to computer vision by Wang et al. in a non-local fashion to extract long-range dependencies between features, known as the dot-product attention mechanism [60,61]. This concept is frequently utilized in network designs for remote sensing images [58,62]. However, its computational complexity increases quadratically with spatial resolution, resulting in a substantial rise in computational cost. To mitigate the memory consumption, approaches such as sparse attention [63,64], linear attention [65], and query-key value reduction strategies [66,67] have been proposed. Transformers extend self-attention to image patch sequences, enabling global context modeling for vision tasks.

3. Method

3.1. Multi-Scale Segmentation Optimization Algorithm—EIODA

The Edge-guided Image Object Detection Approach (EIODA) is a local-scale optimization segmentation algorithm within multi-scale optimization methods. Its key innovation lies in considering both spectral and edge information from images [14]. Specifically, during the segment growth phase, the algorithm performs multi-scale analysis while leveraging edge information as a constraint to determine the optimal segmentation stage for segments. This method aims to find the optimal segmentation of real objects of various scales in the image. The algorithm comprises four main stages: First, edge detection algorithms are used to extract the image edges. Next, a segmentation algorithm, typically Multi-Resolution Segmentation (MRS) [10], is employed for pre-segmentation of high-resolution remote sensing images, ensuring highly homogeneous over-segmented objects. Then, an R-tree and a region adjacency graph (RAG) are constructed to store the edge information and the over-segmented image objects, respectively. The final and critical step involves calculating edge completeness curves during the image growth process to identify the optimal segmentation scale. An example of EIODA segmentation is shown in Figure 2. The original image (Figure 2a) is segmented using EIODA to obtain the segmentation result (Figure 2b), and then the spectral mean values of the RGB channels in each segment are calculated to generate the segmented image (Figure 2c). This integration of spectral and edge information enhances the detection capability and reliability of geo-objects [16].

3.2. Overall Network Structure

The proposed GOFENet architecture is a typical encoder–decoder structure composed of three main components: the primary encoder, the auxiliary encoder, and the decoder. Overall, the network integrates CNN and Swin Transformer. It incorporates the effective design of UNet, utilizing skip connections between the encoder and the decoder to transfer low-level features. To preserve the network’s ability to extract global context information, Swin Transformer basic blocks (SWTB) are added, as shown in Figure 3. Notably, we also introduce an auxiliary branch based on optimal segmentation results. This branch employs a series of dilated convolutions to capture object-based image features, thereby enhancing the network’s multi-scale and global perceptual capabilities. The design of the auxiliary encoder will be explored in the Section 5. Additionally, we have developed CSMs and FEMs to further enhance the performance of CNN. The overall architecture of GOFENet is depicted in Figure 4.

Unlike the conventional ResNet50 main encoder, our approach employs CSM before each down-sampling operation (except for the first layer). It strengthens the relationships between channels and improves the model’s capacity to capture and express critical information. After the first ResBlock in the original remote sensing image

R^{h \times w \times 3}

, the resolution of the feature map is halved, and the number of channels is increased to 64. The output feature map of the n-th ResBlock, where n

\geq

2 and C₁ = 128, is denoted as

T_{n} \in R^{(h / 2^{n}) \times (w / 2^{n}) \times 2^{n - 1} C_{1}}

. The SWTB, which is central to the Swin Transformer architecture, is introduced in the third, fourth, and fifth layers of the backbone network. For efficient modeling, Swin Transformer employs both the conventional window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA) mechanisms. W-MSA partitions the feature map into non-overlapping windows of size S × S and computes self-attention within these local windows. SW-MSA, on the other hand, improves inter-window information flow by offsetting the windows by half their size ([S/2]), and then computing attention within these shifted windows. In the experiment, the window size used is S = 7. In the SWTB, W-MSA and SW-MSA are connected in sequence and executed alternately, with the computation process as follows:

\begin{array}{l} {\hat{t}}^{l} = W - MSA (LN (t^{l - 1})) + t^{l - 1} \\ t^{l} = MLP (LN ({\hat{t}}^{l})) + {\hat{t}}^{l} \\ {\hat{t}}^{l + 1} = SW - MSA (LN (t^{l})) + t^{l} \\ t^{l + 1} = MLP (LN ({\hat{t}}^{l + 1})) + {\hat{t}}^{l + 1} \end{array}

(1)

where t^l−1 signifies the input feature of the SWTB and t^l+1 indicates the output feature of the SWTB. The feature output size of the SWTB remains consistent with the input size, denoted as

W_{n}

; the size of feature map is

W_{n} \in R^{(h / 2^{n}) \times (w / 2^{n}) \times 2^{n - 3} C_{2}}

, where n

\geq

3, C₂ = 512. Additionally, the FEM, guided by SWTB, is designed to enhance the representativeness and effectiveness of the low-level features from the CNN through skip connections by highlighting the importance of CNN channels.

In the auxiliary branch, after the original remote sensing image

R^{h \times w \times 3}

is processed by the EIODA segmentation algorithm, a segmented image

Z^{h \times w \times 3}

is obtained. The auxiliary encoder shares the weights of the first layer with the main encoder. It is then processed through three dilated convolutions. The output of the n-th stage can be represented as

S_{n} \in Z^{(h / 2^{n}) \times (w / 2^{n}) \times 2^{n - 3} C_{3}}

, where n

\geq

2, C₃ = 128.

After the encoding stage, the low-level features—

T_{n}^{'}

processed by the FEM and

S_{n}

from the auxiliary branch network—are fused by summation. These fused features are then up-sampled to increase resolution and passed through 3 × 3 convolutional layers to decrease the number of channels. Each convolutional layer is succeeded by batch normalization and ReLU activation. This procedure is repeated four times, gradually expanding the feature map F to

F^{'} \in R^{(h / 2) \times (w / 2) \times 64}

. Ultimately, bilinear interpolation up-sampling and two additional 3 × 3 convolutional layers are implemented to the

F^{'}

to produce the final predicted mask.

3.3. Cross-Channel Selective Module

While ResNet networks excel at extracting deep-level features from images, they do not adequately address multi-scale and contextual information. This is essential for the semantic segmentation of remote sensing imageries, which often display complex features and intra-class variability. Figure 5 depicts the structure of the CSM, and its computational process can be expressed as follows:

Y = X + X ⊙ Softmax (© {{CA}_{m} ({GConv}_{m} (X_{m}))}) m \in [1, 4]

(2)

X

denotes the input features and

X^{'}

represents the output features. CAm refers to the channel attention block, while GConvm and

X_{m}

denote the m-th group convolution block and feature input, respectively. The © means concatenation on the channel dimension, and ʘ stands for element-wise product. First, the feature map X obtained from the Resblock is divided into channels according to a specified ratio. In this work, an equal division is employed, with the ratio r1:r2:r3:r4 = 1:1:1:1, resulting in feature maps

X_{1}

,

X_{2}

,

X_{3}

, and

X_{4}

. Next, each of these feature maps undergoes group convolutions with different numbers of groups (g1, g2, g3, g4 = 1, 4, 8, 16) and kernel sizes (3 × 3, 5 × 5, 7 × 7, 9 × 9), producing outputs

X_{1}^{'}

,

X_{2}^{'}

,

X_{3}^{'}

, and

X_{4}^{'}

, respectively. Subsequently, each

X_{m}^{'}

is processed through a channel attention block that uses adaptive-sized 1D convolutional kernels to improve cross-channel communication within each group. The adaptive convolutional kernel size

K_{s}

is calculated using the following formula:

K_{s} = {|\frac{\log_{2} C + b}{γ}|}_{o d d}

(3)

where C represents the number of input channels,

{| t |}_{o d d}

denotes the nearest odd number to t. b is the bias coefficient, and γ is the scaling factor, with default values of 1 and 2, respectively. After concatenating the obtained

{\tilde{X}}_{m}

along the channel dimension, the softmax function is utilized to boost inter-group correlation. Finally, the channel weights are multiplied by the original feature matrix X to generate the attention map, which is then added to X to produce the output feature Y.

3.4. Feature Enhancement Module

In terms of the overall network architecture, a dual-branch structure with both CNN decoders and Transformer decoders is not utilized, as this would significantly increase the model complexity. Instead, the SWTB is integrated into the intermediate and deep layers of the network to ensure the preservation of global information. It is worth noting that the features of skip connections are not enhanced by CSM, as this may interfere with the capability of the SWTB to capture global information. To effectively apply global contextual features to the channel attention mechanism of CNN feature maps, we designed the FEM. It emphasizes the channels crucial for capturing global semantic information while suppressing irrelevant features. This design improves the model’s generalization ability and robustness across different data distributions. The structure is illustrated in Figure 6. As previously mentioned, T_n and W_n denote the outputs of the n-th layer Resblock and SWTB, respectively. FEM adopts a dual-path structure to enhance feature representations by capturing both spatial and channel dependencies. The spatial path applies a 3 × 3 dilated convolution followed by directional pooling to extract horizontal and vertical contexts, which are fused into a spatial attention map. The channel path reshapes the input and employs Swin Transformer blocks to extract global context, followed by mean and std pooling across spatial dimensions. The pooled statistics are concatenated and passed through a group-wise 1D convolution and sigmoid to generate channel attention. The two attention maps are fused and applied to the input via element-wise multiplication to refine features. This entire process can be encapsulated by the following equation:

A_{s} = {Conv}_{1 \times 1} (P_{h} (T_{n}) ⊙ P_{v} (T_{n}))

(4)

A_{c} = δ (Conv 1 D_{group = c} (© {μ (W_{n}), σ (W_{n})}))

(5)

{T_{n}}^{'} = T_{n} ⊙ (A_{s} ⊙ A_{c})

(6)

where

P_{h} (\cdot)

and

P_{v} (\cdot)

are directional pooling along horizontal and vertical axes,

μ (\cdot)

and

σ (\cdot)

denote mean and standard deviation pooling.

T_{n}^{'}

denotes the output features after passing through FEM, GConv1D signifies the 1D grouped convolution with a kernel size of 2 × 2, and δ represents the sigmoid function.

3.5. Shallow–Deep Feature Fusion Module

Object-based image features aggregate similar pixels, which simplifies remote sensing images by describing meaningful segments and reflecting the overall properties of the image. However, this approach has the drawback of losing detailed information. Since the overall information is obtained directly from the raw image, we consider it to be a form of shallow image features. To effectively merge pixel-based features (local features) with object-based features (global features), the design of the SDFM is illustrated in Figure 7. As depicted in Figure 4, the channel count in the main encoder at each layer (excluding the first layer) is four times that of the auxiliary encoder. Therefore, to enable the addition of deep features (DF) and shallow features (SF), the DF need to undergo channel reduction via a 1 × 1 convolution. The fused features obtained by adding DF and SF are referred to as additive features (AF). The pixel attention module (PAM) comprises two components. First, the fused features undergo maximum and average pooling along the channel dimension to capture spatial statistics. Second, adaptive average pooling is applied to generate channel-wise attention. These channel and spatial attentions are then combined to produce the attention map (AM). This process is mathematically expressed by the following formula:

A M = MLP (AvgPool (A F)) \oplus {Conv}_{7} (Ⓒ {GAP (A F), GMP (A F)})

(7)

where MLP denotes the fully connected layer, while GAP and GMP represent the calculation of the mean and maximum values of pixel space along the channel dimension, respectively. The AF and AM are cross-arranged along the channel dimension. Pixel-wise attention maps (PM) are then generated by applying a 7 × 7 grouped convolution followed by a sigmoid function. The PM is then multiplied by the SF and DF using the weight factor α and summed to produce the fused features (MF), as described in Equation (6). The channels are subsequently restored to the size of DF using a 1 × 1 convolution. The default value for α is 0.5.

M F = A F + α \times P M ⊙ S F + (1 - α) \times P M ⊙ D F

(8)

4. Results

4.1. Datasets

4.1.1. GID

The Gaofen Image Dataset (GID) is a large-scale land cover classification dataset [68], which includes 150 high-quality GF-2 images collected from over 60 different cities in China. It contains a fine land cover classification (FLC) dataset, which comprises 15 annotated classes known as GID15, with each image measuring 6800 × 7200 pixels. Currently, only 110 annotated images from the training and validation sets are publicly available, and the official test server is inaccessible. As a result, the dataset is partitioned into training, validation, and test sets with a ratio of 9:1:1.

4.1.2. LoveDA Dataset

The LoveDA dataset consists of 5987 high-resolution images, categorized into 2522 training images, 1669 validation images, and 1796 test images [69]. These images are collected from both urban and rural environments, with each image sized at 1024 × 1024 pixels. The dataset encompasses seven land cover types: roads, buildings, lakes, bare land, forests, farmland, and background, with the highest number of annotations for buildings. The challenges of the dataset arise from factors such as the heterogeneous spatial distribution of urban and rural landscapes, complex backgrounds, and multi-scale objects, making LoveDA one of the most challenging datasets for land cover classification in the realm of remote sensing semantic segmentation.

4.2. Implementation Details

4.2.1. Experimental Settings

The models are trained using the PyTorch 2.0 framework, with all experiments performed on an NVIDIA RTX 3090 GPU with 24 GB of memory. To expedite convergence, the AdamW optimizer is employed with an initial learning rate of 0.0001, adjusted according to the yolox_warm_cos_lr schedule [70]. To facilitate efficient model, images from the GID and LoveDA datasets are cropped to a non-overlapping size of 512 × 512 pixels, discarding any excess portions. Various data augmentation techniques are applied during training, including random scaling within the range of [0.8–1.2], random vertical and horizontal flips, random rotations, and color enhancements, with a batch size of 8. The maximum number of training epochs is established at 300 for the GID and 100 for the LoveDA dataset.

For the EIODA algorithm, the MRS algorithm is used as the initial segmentation method, utilizing fixed segmentation parameters: a scale of 18, with compactness and smoothness set to the default value of 0.5. The EIODA segmentation method is applied to all image tiles in the dataset to obtain the complete segmentation results. It is implemented in a C++ environment on a computer with an Intel i7-12700F CPU (2.1 GHz) and 32 GB of RAM.

4.2.2. Loss Function

As depicted in Figure 8, the class imbalance present in the GID and the LoveDA dataset leads to a model that predominantly emphasizes the majority classes during training, frequently overlooking the minority classes. To tackle this issue, we use a combination of dice loss and cross-entropy loss as the final loss, which is formulated as

L = L_{C E} + L_{D i c e}

(9)

4.2.3. Evaluation Metrics

Overall Accuracy (OA), mean Pixel Accuracy (mPA), and mean Intersection over Union (mIoU) are employed to quantitatively assess the segmentation performance of the model. OA denotes the ratio of accurately predicted pixels to the total number of pixels, ranging from 0 to 1, and reflects the overall performance of the model across all classes. The evaluation index mPA is calculated as the average of pixel accuracies across all land cover categories. Intersection over Union (IoU) is a commonly adopted measure in segmentation tasks that evaluates the overlap between predicted and ground truth regions for each class. mIoU signifies the average IoU across all classes and offers a comprehensive evaluation of the model’s performance, particularly in addressing class imbalance. It provides more detailed insights than OA. The formulas for these metrics are as follows:

O A = \frac{T P + T N}{T P + T N + F P + F N}

(10)

m P A = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i}}

(11)

I o U = \frac{T P}{T P + F N + F P}

(12)

m I o U = \frac{1}{N} \sum_{i = 1}^{N} I o U = \frac{1}{K} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i} + F P_{i}}

(13)

Notably, in the above equations, TP, TN, FP, and FN refer to the counts of true positives, true negatives, false positives, and false negatives, respectively, with N indicating the number of classes.

4.3. Ablation Study

Ablation studies are performed on the GID and the LoveDA dataset to assess the proposed network architecture and the efficacy of the three key modules—CSM, FEM, and SDFM—using a baseline network for comparison. This approach enables a thorough analysis of the significance and effectiveness of each component, with all experiments conducted under the same settings. The baseline network employs a complete ResNet50 architecture with SWTB introduced at the third, fourth, and fifth layers. The configuration aligns with the “Swin-Tiny” version, with a window size of 7 and block repetitions of {2, 6, 2}. To ensure compatibility with the output channels {512, 1024, 2048} of the last three layers of the encoder, the number of heads per layer is adjusted to {4, 8, 16}. In the ablation studies, the presence of the SDFM signifies the inclusion of the auxiliary encoder branch.

4.3.1. Effect of Integrating SWTB (Baseline)

To address the issue of global context loss as the depth of convolutional layers increases, our baseline network incorporates the SWTB module into the middle and deeper layers (the third, fourth, and fifth layers) of the ResNet50 backbone. A comparison of Table 1, Table 2 and Table 3 reveals that incorporating the SWTB module enhances segmentation performance on the GID. Specifically, mIoU increases from 59.73% to 62.73%, and OA rises from 80.98% to 82.60%, compared to the original UNet network. Similarly, for the LoveDA dataset, mIoU increases from 47.8% to 48.8%. This demonstrates that the SWTB module proficiently captures long-range dependencies within the data, aggregating more information beneficial for semantic prediction.

4.3.2. Effect of Cross-Channel Selective Module

Table 1 reports that the inclusion of CSM results in improvements of 0.74% and 1.01% in mIoU for the two datasets, respectively. In the LoveDA dataset, the Barren category exhibits the most significant enhancement in segmentation accuracy, achieving an increase of 2.93%. This is closely followed by the Agriculture category, which shows an improvement of 2.79%. Figure 9 presents the segmentation outcomes of the baseline before and after the incorporation of CSM. The figure demonstrates that it enables the model to adaptively adjust channel weights, thereby enhancing significant features while suppressing less critical ones. This adjustment reduces certain patches in the building and forest categories, which improves the model’s capacity for feature representation.

4.3.3. Effect of Feature Enhancement Module

Table 1 demonstrates that integrating FEM into the baseline framework yields improvements of 1.34% and 1.31% in mIoU for the GID and the LoveDA dataset, respectively, highlighting its effectiveness in the network. Notably, segmentation accuracy for various land cover types has been improved, particularly in the building and agriculture classes, with IoU increases of 3.97% and 2.17%, respectively. Figure 10 visually compares segmentation results before and after applying FEM to the baseline model. The results illustrate that it not only maintains robust segmentation performance but also tends to merge discontinuous and closely spaced ground objects into cohesive units. This suggests that incorporating global context information into CNNs can significantly improve segmentation accuracy by better aggregating similar regions.

4.3.4. Effect of Shallow–Deep Feature Fusion Module

Table 1 demonstrates that the introduction of an auxiliary encoding branch, using optimally segmented input images, results in improvements in mIoU, OA, and mPA on the GID by 0.56%, 0.53%, and 0.54%, respectively. On the LoveDA dataset, mIoU increases by 0.83%. Figure 11a provides a clear visual comparison of the segmentation results before and after integrating the SDFM into the baseline model. In the first and second rows, it successfully preserves the boundaries and structural integrity of various land cover types. The third and fourth rows demonstrate that, unlike the FEM, SDFM maintains the separability of distinct land cover classes—such as buildings and water bodies—preventing their erroneous merging into a single entity.

To further assess its contribution, we compare model variants with and without this module under various configurations on the GID and the LoveDA dataset (Figure 11b). The results consistently indicate that its inclusion improves overall segmentation performance. Additionally, as shown in Table 1, it enhances the model’s capacity to capture global features. This improvement is particularly evident in the segmentation performance of large-scale land cover categories such as water bodies, barren land, and agricultural areas in the LoveDA dataset.

4.3.5. Joint Effects of Different Modules

Furthermore, the joint effects of different modules on segmentation results within the baseline framework are explored on both datasets, as shown in Table 1. For the GID and the LoveDA dataset, introducing both CSM and FEM together improve segmentation results by 2.55% and 1.98% mIoU, respectively. Notably, even without incorporating the SDFM, the Baseline + CSM + FEM configuration achieves consistently strong results across both datasets. This demonstrates that the CNN–Swin Transformer hybrid design (CSM + FEM) alone can effectively learn discriminative and robust representations, making it potentially applicable to other vision tasks. When both CSM and SDFM are used, segmentation results increase by 1.84% and 2.36% mIoU, respectively. Considering both FEM and SDFM together result in mIoU increases of 1.66% and 2.66%, respectively. Based on the baseline framework, the proposed GOFENet, which integrates the three key modules (CSM, FEM, and SDFM), achieve mIoU improvements of 3.29% and 3.08% on the respective datasets.

Table 1. Ablation experiment of the proposed modules on the GID and the LoveDA dataset.

Model Name	Module			GID			LoveDA
	CCM	FEM	SDFM	mIoU (%)	mPA (%)	OA (%)	IoU per Category (%)							mIoU (%)
	CCM	FEM	SDFM	mIoU (%)	mPA (%)	OA (%)	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU (%)
Baseline				62.73	74.89	82.60	43.59	51.67	55.06	75.80	15.62	43.84	56.27	48.84
Baseline + CSM	✓			63.47	74.01	82.99	44.14	52.87	54.06	74.67	18.55	45.57	59.06	49.85
Baseline + FEM		✓		64.07	74.75	83.44	43.30	55.64	55.08	76.81	16.78	45.01	58.44	50.15
Baseline + SDFM			✓	63.29	75.43	83.13	42.79	53.52	54.04	76.91	19.79	43.76	56.88	49.67
Baseline + CSM + FEM	✓	✓		65.28	76.49	83.99	44.36	56.55	55.64	75.53	16.06	45.02	62.55	50.82
Baseline + CSM + SDFM	✓		✓	64.57	75.69	83.63	43.26	55.46	54.23	76.28	20.53	45.46	63.19	51.20
Baseline + FEM + SDFM		✓	✓	64.39	75.37	83.52	44.34	55.04	55.73	77.61	19.99	44.30	63.51	51.50
Baseline + CSM + FEM + SDFM	✓	✓	✓	66.02	77.78	84.07	45.20	55.63	56.75	78.13	19.28	44.88	63.60	51.92

Figure 9. Comparison of segmentation results with and without the integration of CSM in the baseline architecture.

Figure 10. Comparison of segmentation results with and without the integration of FEM in the baseline architecture.

Figure 11. (a) Comparison of segmentation results with and without the integration of SDFM in the baseline architecture. (b) Ablation study on the effect of integrating SDFM into different model variants. The plots show mIoU (%) performance on the GID (left) and the LoveDA dataset (right) before and after incorporating the SDFM.

4.4. Performance Evaluation and Comparisons with Other Models

We evaluate our proposed model against several state-of-the-art semantic segmentation methods on the GID and the LoveDA dataset to highlight its superiority. These methods are categorized into three main types:

(1) CNN-based methods: UNet [19], PSPNet [24], DeepLabV3+ [23], DANet [25], MANet [71], SegNet [72], FactSeg [73], LANet [74], HRNet [75], C-PNet [76].

(2) Transformer-based methods: SegFormer [36], Segmenter [35], Mask2Former [77].

(3) CNN–Transformer hybrid networks: BANet [40], TransUNet [78], DC-Swin [39], SwinUperNet [30], UNetFormer [44], LMA-Swin [41], ESDINet [79].

Table 2. Comparison of segmentation results on the GID. The bold text highlights the highest values in each column, while the second-highest values are marked with underlining.

Method	Backbone	IoU per Category (%)																mIoU (%)	mPA (%)	OA (%)
Method	Backbone	BG	INL	UR	RU	TL	PF	IL	DC	GP	AW	SL	NG	AG	RI	LA	PD	mIoU (%)	mPA (%)	OA (%)
UNet [19]	ResNet50	64.07	57.72	68.68	59.62	51.60	72.48	77.86	67.43	30.53	72.55	68.73	38.51	32.53	67.56	81.66	44.22	59.73	75.23	80.98
PSPNet [24]	ResNet50	64.73	58.35	68.26	59.43	43.60	72.03	76.42	69.74	32.43	73.34	68.19	43.96	35.65	71.49	85.67	47.06	60.65	71.95	81.25
SegNet [72]	ResNet50	62.39	55.41	67.00	55.80	43.01	66.36	75.05	65.22	31.85	71.07	67.16	45.85	29.24	65.73	75.81	45.98	57.68	71.53	79.32
DeeplabV3+ [23]	ResNet50	65.15	57.69	67.12	58.73	48.06	71.42	75.88	70.07	34.43	72.45	69.41	45.80	32.79	68.12	80.52	48.20	60.37	74.08	80.91
DANet [25]	ResNet50	65.82	58.25	67.93	59.51	48.09	71.99	75.49	71.10	29.30	72.45	67.79	42.71	37.86	73.22	85.44	48.93	60.99	74.29	81.32
SegFormer [36]	MIT-B1	66.91	58.35	68.54	59.72	47.48	72.30	77.53	71.85	30.62	74.00	69.40	50.05	32.74	72.71	85.43	50.23	61.74	73.70	82.23
DC-Swin [39]	Swin-Tiny	65.91	57.34	67.38	59.41	46.35	69.29	76.36	70.29	32.51	73.10	68.08	48.01	33.06	72.38	86.93	50.94	61.08	72.98	81.51
MANet [71]	ResNet50	68.77	60.81	70.23	62.00	54.28	73.13	78.84	73.75	37.82	75.60	72.21	51.00	35.79	77.56	88.28	54.41	64.69	76.23	83.59
UNetFormer [44]	ResNet18	67.82	61.20	70.22	61.32	53.85	72.56	77.56	71.73	36.47	73.59	70.06	52.86	36.46	75.19	87.28	52.54	63.79	75.74	82.80
LMA-Swin [41]	Swin-Small	68.50	59.10	70.61	62.63	55.46	70.83	77.68	73.32	37.19	74.76	70.93	45.99	37.62	76.84	88.37	50.61	63.78	74.79	83.12
GOFENet (ours)	ResNet50	69.24	61.77	71.33	63.68	55.90	74.79	79.66	74.82	40.38	75.24	72.89	55.25	40.43	77.06	87.55	56.35	66.02	77.78	84.07

Note: BG (Background), INL (Industrial Land), UR (Urban Residential), RU (Rural Residential), TL (Traffic Land), PF (Paddy Field), IL (Irrigated Land), DC (Dry Cropland), GP (Garden Plot), AW (Arbor Woodland), SL (Shrub Land), NG (Natural Grassland), AG (Artificial Grassland), RI (River), LA (Lake), PD (Pond).

4.4.1. Results on GID

Table 2 provides a quantitative comparison of various semantic segmentation methods on the GID. Our proposed model, GOFENet, outperforms other networks in overall metrics, achieving mIoU, mPA, and OA values of 66.02%, 77.78%, and 84.07%, respectively. Among the 16 land cover categories, including the background, GOFENet almost secures both the highest and second-highest IoU values, demonstrating impressive performance. Notably, compared to the second-best values, the IoU improvements for the AG, GP, and NG categories exceed two percentage points, while the IoU increases for the RU, PF, DC, and PD categories are over one percentage point. This underscores GOFENet’s superior capability in capturing multi-scale information. In contrast to traditional CNN-based models like UNet and SegNet, which lack mechanisms for global context, DeeplabV3+ and PSPNet improve performance by expanding the receptive field through dilated convolutions and pyramid pooling. Attention-based methods such as DANet and MANet leverage dual- or multi-attention mechanisms for enhanced global dependency modeling, with MANet achieving the second-best results in many categories. Moreover, Transformer-based architectures like SegFormer, DC-Swin, LMA-Swin, and UNetFormer exhibit strong performance due to their long-range modeling capabilities, with UNetFormer balancing segmentation quality and model efficiency.

The performance variations in GOFENet across different land cover categories can be attributed to the alignment between its architectural components and the intrinsic characteristics of each land type. For large-area, regular, and spatially continuous classes such as irrigated land, dry cropland, arbor woodland, lake, and artificial water, the combination of global modeling capability and the SDFM helps maintain region-level consistency and significantly enhances classification accuracy. For structurally complex categories such as urban residential and industrial land, SWTB embedded within the FEM effectively captures long-range dependencies and spatial structures, resulting in more accurate boundary delineation. In contrast, for small-area categories like Greenhouse and paddy field, the model’s multi-scale fusion strategy and local detail preservation significantly improve the segmentation of small targets. However, for spectrally heterogeneous and boundary-ambiguous categories such as artificial grassland, garden plot, and natural grassland, subtle inter-class differences and feature overlaps may lead to misclassification and relatively lower accuracy.

In addition to quantitative improvements, GOFENet also demonstrates clear qualitative advantages, as illustrated in Figure 12. First, GOFENet demonstrates superior multi-scale object recognition capabilities, effectively identifying small-scale targets such as arbor woodland in the first image and ponds in the third image, which are missed by other models. Second, the boundaries delineated by GOFENet closely align with the reference outlines, as observed with buildings in the first image, irrigated land in the second image, and ponds in the fifth and seventh image. Finally, its segmentation results exhibit remarkable continuity and smoothness, evident in the connectivity of roads and the minimal presence of holes and patch noise in the segmentation output. These advantages confirm the model’s ability to balance local detail preservation with global structural consistency. Figure 13 presents a visual comparison between the object contours and ground truth boxes identified by the current state-of-the-art methods and GOFENet. It is evident that GOFENet not only achieves high segmentation accuracy but also demonstrates strong consistency with the ground truth contours.

4.4.2. Results on the LoveDA Dataset

We submitted the results to the LoveDA semantic segmentation challenge website to acquire benchmark results for the test data. Table 3 presents the comparison results of GOFENet against a range of state-of-the-art semantic segmentation models on the LoveDA dataset. GOFENet achieves the highest overall segmentation performance, with a mean IoU of 51.9%, outperforming both CNN-based methods (e.g., UNet, HRNet, C-PNet) and Transformer-based approaches (e.g., Mask2Former, SwinUperNet, ESDINet). These results demonstrate the strong generalization capability of GOFENet in handling diverse land cover types across complex urban and rural scenarios. Notably, GOFENet achieves the highest IoU in the Agriculture category (63.6%) and demonstrates strong performance in Water (78.1%) and Barren (19.3%). This improvement can be attributed to the SDFM and multi-scale feature aggregation techniques, which enhance the model’s ability to distinguish regions with high spectral homogeneity. However, the model shows relatively lower performance on small-scale targets (e.g., Building) and categories with complex textures and ambiguous boundaries (e.g., Forest). These challenges are primarily due to the inherent structural complexity and edge uncertainty of such classes, which hinder the accurate extraction of local spatial details and fine-grained boundary representations. Comparatively, HRNet slightly surpasses GOFENet in the Road category (57.4% vs. 56.7%), benefiting from its parallel multi-resolution representation and continuous feature fusion, which are particularly effective for preserving fine spatial details and segmenting elongated structures. In the Building category, Mask2Former achieves higher accuracy (56.8%) through its capacity to model both local and global feature dependencies, thereby improving the segmentation of complex architectural shapes and boundaries. For the Forest class, DC-Swin delivers the best performance (47.2%) by leveraging densely connected feature aggregation modules, which enhance the model’s ability to capture rich texture and contextual information. Overall, the performance variations across categories closely reflect each model’s architectural strengths in extracting specific spatial and spectral features.

Figure 14 qualitatively presents the segmentation results of different models. Observations from the first, second, and fifth rows show that the proposed method effectively preserves the structural integrity and separability of individual buildings and forest regions. In contrast, other models tend to merge adjacent buildings into a single connected component, failing to capture clear boundaries between them. Moreover, in the third row, GOFENet provides a more precise segmentation of the Forest category, maintaining its spatial continuity. In the fourth and sixth rows, the delineation of agricultural areas also more closely aligns with the ground truth. These qualitative findings are further supported by Figure 15, where the features identified by GOFENet exhibit pronounced spatial coherence. The model effectively mitigates fragmented segmentation and significantly reduces the presence of small, scattered patches, highlighting its robustness in capturing continuous land cover patterns.

Table 3. Comparison of segmentation results on the LoveDA dataset. The bold text highlights the highest values in each column, while the second-highest values are marked with underlining.

Method	Backbone	IoU per Category (%)							mIoU (%)
Method	Backbone	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU (%)
UNet [19]	ResNet50	43.1	52.7	52.8	73.1	10.3	43.1	59.9	47.8
PSPNet [24]	ResNet50	44.4	52.1	53.5	76.5	9.7	44.1	57.9	48.3
SegNet [72]	ResNet50	43.4	53.0	52.9	76.0	12.7	44.0	54.8	48.1
DeeplabV3+ [23]	ResNet50	43.0	50.9	52.0	74.4	10.4	44.2	58.5	47.6
Segmenter [35]	Vit-Tiny	38.0	50.7	48.7	77.4	13.3	43.5	58.2	47.1
BANet [40]	ResT-Lite	43.7	51.5	51.1	76.9	16.6	44.9	62.5	49.6
DANet [25]	ResNet50	44.8	55.5	53.0	75.5	17.6	45.1	60.1	50.2
LANet [74]	ResNet50	40.0	50.6	51.1	78.0	13.0	43.2	56.9	47.6
SegFormer [36]	MIT-B1	43.0	52.2	53.2	68.6	10.3	45.4	53.1	46.5
FactSeg [73]	ResNet50	42.6	53.6	52.8	76.9	16.2	42.9	57.5	48.9
MANet [71]	ResNet50	38.7	51.7	42.6	72.0	15.3	42.1	57.7	45.7
HRNet [75]	W32	44.6	55.3	57.4	74.0	11.1	45.3	60.9	49.8
DC-Swin [39]	Swin-Tiny	41.3	54.5	56.2	78.1	14.5	47.2	62.4	50.6
TransUNet [78]	ViT-R50	43.0	56.1	53.7	78.0	9.3	44.9	56.9	48.9
SwinUperNet [30]	Swin-Tiny	43.3	54.3	54.3	78.7	14.9	45.3	59.6	50.0
C-PNet [76]	-	44.0	55.2	55.3	78.8	16.0	46.4	58.0	51.8
Mask2Former [77]	Swin-Small	44.8	56.8	55.5	78.6	17.8	46.3	60.0	51.5
ESDINet [79]	ResNet18	41.6	53.8	54.8	78.7	19.5	44.2	58.0	50.1
GOFENet (ours)	ResNet50	45.2	55.6	56.7	78.1	19.3	44.9	63.6	51.9

Figure 14. Examples of semantic segmentation results on the LoveDA dataset.

Figure 15. Visualization comparison of object contours and ground truth boxes identified by the current state-of-the-art methods and GOFENet on the LoveDA dataset.

5. Discussion

5.1. Design of Auxiliary Branch

The segmentation results of the EIODA algorithm, which integrates spectral and edge information, reflect high homogeneity within segments and high heterogeneity between segments. This approach elevates image processing from the pixel level to the level of meaningful geo-objects, enhancing the overall perception of remote sensing images. This study explores the use of segmentation results as input to a neural network, leveraging object-level features to enhance the global information of the model, and designs an auxiliary branch network. The auxiliary encoder shares parameters with the primary encoder in the first layer, motivated by two main considerations: (1) reducing computational load, and (2) strengthening the pixel-to-object-level correspondence in image features. When the object scale is larger than the convolution kernel size, the features extracted by both the primary and auxiliary encoders correspond to pixel-level and object-level features within the window, respectively. Conversely, when the object scale is smaller than the convolution kernel size, meaning multiple objects are present within the window, the pixels in the feature maps obtained from the auxiliary encoder represent cross-object features, which may introduce additional disruptive information. Specifically, if segmented images are directly input into the CNN as if they were original images, the stacking of convolutional layers may progressively amplify the boundary information between objects due to the pixel discontinuities at object boundaries, leading to information distortion and interference with the fused data. In this scenario, the model is unable to learn any valuable information, potentially even degrading its segmentation performance.

Additionally, we explored the use of deformable convolutions to capture features from segmented images. However, the geometric and spatial differences between objects across different images made learning the offset parameters difficult. Following the recommendations for the hybrid dilated convolution (HDC) [80], a series of consecutive dilated convolutions with appropriate dilation rates is constructed to gradually expand the receptive field of the network. It is noteworthy that, to avoid the interference of non-semantic features on the final results, the SDFM is introduced exclusively in the last three layers of the network decoder.

5.2. Model Efficiency Analysis

To evaluate model performance across two datasets, we utilized computation complexity (GFLOPS) and inference speed (FPS) as metrics to analyze model efficiency under consistent operating conditions. Table 4 and Table 5 present the evaluation results of the models on the two datasets, noting that the input image size in Table 4 is 512 × 512, whereas in Table 5 it is 1024 × 1024. A comprehensive analysis reveals that the hybrid models based on CNNs and Transformers exhibit higher computational complexity compared to pure CNN architectures, as well as typically slower inference speeds. Our proposed GOFENet achieves an inference speed of approximately 17 FPS for 512 × 512 pixel images. The high model complexity is attributed to the uncompressed output channels of the ResNet in the skip connection structure at each decoder layer. Consequently, we considered utilizing ResNet18/ResNet34 as backbones and implementing structural simplifications, termed GOFENet-t and GOFENet-s. Table 4 and Table 5 report a substantial reduction in complexity and a significant enhancement in forward inference speed for the simplified models.

Figure 16 visually illustrates the performance trade-offs across the three metrics. Specifically, on the GID, GOFENet-s reduces GFLOPs by approximately 48.5% while only sacrificing 1.3% mIoU, and achieves a 2.5× increase in inference speed. GOFENet-t further reduces GFLOPs by 77.9%, with a 2.7% mIoU drop, and improves speed by approximately 4.5×. On the LoveDA dataset, GOFENet-s achieves comparable mIoU, while reducing GFLOPs by 46.4% and doubling the inference speed. GOFENet-t maintains similar accuracy (−0.7% mIoU), but with a 77.2% reduction in complexity and 3.8× faster inference. These comparisons demonstrate that the lightweight variants provide practical alternatives depending on deployment constraints: GOFENet-s is suitable for tasks that require a balance of accuracy and moderate acceleration, while GOFENet-t is ideal for real-time applications prioritizing speed and efficiency.

Notably, the GOFENet-s model demonstrates a slight improvement on the LoveDA dataset, with mIoU values for both GOFENet and its simplified versions maintaining between 51% and 52%. This is likely due to the lower class count in the dataset, allowing shallower models to effectively learn the primary characteristics of land cover. However, for the GID, the simplified models strike a balance between complexity and segmentation accuracy, likely because the lightweight model finds it challenging to comprehensively capture the relevant features of multiple categories. These experimental results further validate the efficacy of the network architecture and three key modules, confirming that the integration of optimal segmentation results enhances global information acquisition (Table 6).

Before training the network, the EIODA algorithm is employed to perform multi-scale segmentation on each 512 × 512 × 3 image in the datasets. Table 7 presents statistics on object sizes (pixel counts) for the training, validation, and test sets of both datasets, including minimum, maximum, and mean values, along with the average execution time of the EIODA algorithm. Under the same segmentation parameters, images from the GID contain significantly more objects than those from the LoveDA dataset, suggesting that the GID has more complex and discriminative land cover types. This, in turn, reflects the spectral similarity of different land cover classes in the LoveDA dataset, which renders the distinction between various land cover types more difficult. Notably, without considering GPU acceleration and parallel processing, the EIODA algorithm still exhibits relatively fast processing speed.

5.3. Grad-CAM Visualization

To showcase the benefits of our proposed network and assess the effectiveness of the global and local context fusion mechanism, Grad-CAM [81] visualization is used to analyze the attention maps at the ultimate classification layers of the network across both datasets. Two target land cover categories are randomly selected for evaluation. The first row of images presents the attention maps of GOFENet for these categories at different layers (GOFENet-Ln denotes the feature map after up-sampling at layer n of the network). As observed in Figure 17, with the forward pass of decoder, the attention map of roads and buildings becomes increasingly refined and enhanced. Compared to other models, GOFENet achieves a better balance between global and local features, resulting in clearer land cover boundaries and a higher level of attention.

5.4. Advantages and Limitations of Embedded Object Features

According to the experimental results, the deep learning semantic segmentation framework integrating GEOBIA can elevate segmentation accuracy and refine boundary information to some extent. We explore the effect of integrating segmentation maps via the SDFM on the classification of large-scale land cover categories. As illustrated in Figure 8, our analysis focuses on the dominant classes with the largest pixel coverage in the GID and the LoveDA dataset. Specifically, these include Irrigated Land, Dry Cropland, Arbor Woodland, Lake, and River in GID (Figure 18a), and Agriculture, Forest, Water, and Barren in LoveDA (Figure 18b). Overall, the integration of the SDFM enhances segmentation accuracy for large-scale land cover types. In the GID, although a few categories showed slight performance declines due to external factors, these decreases were limited and did not offset the overall improvement. Notably, the Forest class in the LoveDA dataset exhibited a consistent decline across all four model configurations, likely due to its fragmented and discontinuous spatial distribution, which hindered effective spatial fusion. These results indicate that integrating GEOBIA-derived segmentation priors significantly enhances the model’s capacity for global feature extraction, thereby improving its ability to accurately identify and delineate extensive and spatially coherent land cover categories.

However, while object-based image features capture the overall characteristics of an image, they tend to lose detailed information. Furthermore, issues such as object occlusion, discontinuous linear features, and image noise can significantly affect multi-scale segmentation results, thereby impacting the network’s ability to focus on fine-scale objects.

6. Conclusions

To address the challenges of limited global context modeling and the underutilization of GEOBIA-derived object priors in semantic segmentation, we propose GOFENet, a hybrid dual-encoder network that combines CNN and Swin Transformer to jointly capture local spatial details and long-range dependencies. To further enhance the encoding process, the CSM selectively emphasizes informative channels, while the FEM integrates global semantics from the Transformer into the CNN pathway, improving the discriminative power of skip-connected features. In addition, considering that GEOBIA-based segmented objects, although lacking explicit semantics, provide valuable structural context and global understanding of the scene, we introduce an auxiliary encoding branch. This branch employs cascaded atrous convolutions to extract global structural features from the segmentation priors. These features are then integrated with those from the main encoder through the SDFM, facilitating the joint representation of object-level and pixel-level information. This design improves the model’s ability to capture global contextual dependencies and semantic consistency, ultimately leading to more accurate and robust segmentation outcomes. Extensive experiments on two large-scale datasets—GID and LoveDA—demonstrate the effectiveness and generalization ability of GOFENet, achieving mIoU values of 66.02% and 51.92%, respectively. The resulting segmentation maps display clear object boundaries, reduced background noise, and improved preservation of semantic integrity. Notably, GOFENet exhibits a strong advantage in accurately segmenting large, continuous land cover regions, further highlighting its capability in handling high-resolution and complex scene interpretation tasks.

In future work, we aim to further explore strategies to mitigate the uncertainty introduced when integrating segmented images into deep neural networks. Furthermore, to assess the generalizability of our method, we will improve the model and extend its application to other remote sensing tasks.

Author Contributions

Conceptualization, T.H. and J.C.; methodology, T.H.; software, J.C.; validation, T.H. and J.C.; formal analysis, T.H.; writing—original draft preparation, T.H.; writing—review and editing, T.H., J.C., and D.P.; visualization, T.H.; supervision, J.C. and D.P.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the PI Project of Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou) (GML2021GD0809), the National Key Research and Development Program of China (2022YFC3103101), the NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization (U1609202), and the National Natural Science Foundation of China (41376184 and 40976109).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lei, P.; Yi, J.; Li, S.; Li, Y.; Lin, H. Agricultural Surface Water Extraction in Environmental Remote Sensing: A Novel Semantic Segmentation Model Emphasizing Contextual Information Enhancement and Foreground Detail Attention. Neurocomputing 2025, 617, 129110. [Google Scholar] [CrossRef]
Liu, Y.; Zhong, Y.; Shi, S.; Zhang, L. Scale-Aware Deep Reinforcement Learning for High Resolution Remote Sensing Imagery Classification. ISPRS J. Photogramm. Remote Sens. 2024, 209, 296–311. [Google Scholar] [CrossRef]
Wang, Y.; Sun, Y.; Cao, X.; Wang, Y.; Zhang, W.; Cheng, X. A Review of Regional and Global Scale Land Use/Land Cover (LULC) Mapping Products Generated from Satellite Remote Sensing. ISPRS J. Photogramm. Remote Sens. 2023, 206, 311–334. [Google Scholar] [CrossRef]
Zhu, Q.; Guo, X.; Deng, W.; Shi, S.; Guan, Q.; Zhong, Y.; Zhang, L.; Li, D. Land-Use/Land-Cover Change Detection Based on a Siamese Global Learning Framework for High Spatial Resolution Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 63–78. [Google Scholar] [CrossRef]
Zhang, T.; Huang, X. Monitoring of Urban Impervious Surfaces Using Time Series of High-Resolution Remote Sensing Images in Rapidly Urbanized Areas: A Case Study of Shenzhen. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 2692–2708. [Google Scholar] [CrossRef]
Chen, B.; Qiu, F.; Wu, B.; Du, H. Image Segmentation Based on Constrained Spectral Variance Difference and Edge Penalty. Remote Sens. 2015, 7, 5980–6004. [Google Scholar] [CrossRef]
Zhou, Y.; Feng, L.; Chen, Y.; Li, J. Object-Based Land Cover Mapping Using Adaptive Scale Segmentation From ZY-3 Satellite Images. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; IEEE: New York, NY, USA; pp. 63–66. [Google Scholar]
Thenkabail, P.S. Remotely Sensed Data Characterization, Classification, and Accuracies; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Derivaux, S.; Lefevre, S.; Wemmert, C.; Korczak, J. Watershed Segmentation of Remotely Sensed Images Based on a Supervised Fuzzy Pixel Classification. In Proceedings of the 2006 IEEE International Symposium on Geoscience and Remote Sensing, Denver, CO, USA, 31 July–4 August 2006; IEEE: New York, NY, USA, 2006; pp. 3712–3715. [Google Scholar]
Baatz, M.; Schäpe, A. Multiresolution Segmentation: An Optimization Approach for High Quality Multi-Scale Image Segmentation. In Angewandte Geographische Informationsverarbeitung XII; Wichmann Verlag: Karlsruhe, Germany, 2000; pp. 12–23. [Google Scholar]
Tzotsos, A.; Argialas, D. MSEG: A Generic Region-Based Multi-Scale Image Segmentation Algorithm for Remote Sensing Imagery. In Proceedings of the ASPRS 2006 Annual Conference, Reno, NV, USA, 1–5 May 2006. [Google Scholar]
Yang, J.; He, Y.; Caspersen, J. Region Merging Using Local Spectral Angle Thresholds: A More Accurate Method for Hybrid Segmentation of Remote Sensing Images. Remote Sens. Environ. 2017, 190, 137–148. [Google Scholar] [CrossRef]
Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.; Feitosa, R.Q.; Van der Meer, F.; Van der Werff, H.; Van Coillie, F.; et al. Geographic Object-Based Image Analysis–Towards a New Paradigm. ISPRS J. Photogramm. Remote Sens. 2014, 87, 180–191. [Google Scholar] [CrossRef]
Hu, Y.; Chen, J.; Pan, D.; Hao, Z. Edge-Guided Image Object Detection in Multiscale Segmentation for High-Resolution Remotely Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4702–4711. [Google Scholar] [CrossRef]
Shen, Y.; Chen, J.; Xiao, L.; Pan, D. Optimizing Multiscale Segmentation with Local Spectral Heterogeneity Measure for High Resolution Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2019, 157, 13–25. [Google Scholar] [CrossRef]
He, T.; Chen, J.; Kang, L.; Zhu, Q. Evaluation of Global-Scale and Local-Scale Optimized Segmentation Algorithms in GEOBIA With SAM on Land Use and Land Cover. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6721–6738. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015. proceedings, part III 18. pp. 234–241. [Google Scholar]
Xiang, X.; Gong, W.; Li, S.; Chen, J.; Ren, T. TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3123–3136. [Google Scholar] [CrossRef]
Liu, B.; Li, B.; Sreeram, V.; Li, S. MBT-UNet: Multi-Branch Transform Combined with UNet for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2024, 16, 2776. [Google Scholar] [CrossRef]
Ren, X.; Deng, Z.; Ye, J.; He, J.; Yang, D. FCN+: Global Receptive Convolution Makes Fcn Great Again. Neurocomputing 2025, 631, 129655. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Alexey, D. An Image is Worth 16x16 Words: Transformers for Image Recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Bai, Q.; Luo, X.; Wang, Y.; Wei, T. DHRNet: A Dual-Branch Hybrid Reinforcement Network for Semantic Segmentation of Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4176–4193. [Google Scholar] [CrossRef]
Gao, Y.; Luo, X.; Gao, X.; Yan, W.; Pan, X.; Fu, X. Semantic Segmentation of Remote Sensing Images Based on Multiscale Features and Global Information Modeling. Expert Syst. Appl. 2024, 249, 123616. [Google Scholar] [CrossRef]
Yang, X.; Li, S.; Chen, Z.; Chanussot, J.; Jia, X.; Zhang, B.; Li, B.; Chen, P. An Attention-Fused Network for Semantic Segmentation of very-High-Resolution Remote sensing Imagery. ArXiv Comput. Vis. Pattern Recognit. 2021, 177, 238–262. [Google Scholar] [CrossRef]
Zhou, N.; Hong, J.; Cui, W.; Wu, S.; Zhang, Z. A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens. 2024, 16, 1712. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-Like pure Transformer for Medical Image Segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing Transformers and Cnns for Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021. proceedings, Part I 24. pp. 14–24. [Google Scholar]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Ren, D.; Li, F.; Sun, H.; Liu, L.; Ren, S.; Yu, M. Local-Enhanced Multi-Scale Aggregation Swin Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images. Int. J. Remote Sens. 2024, 45, 101–120. [Google Scholar] [CrossRef]
Yao, M.; Zhang, Y.; Liu, G.; Pang, D. SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 3023–3037. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Johnson, B.A.; Ma, L. Image Segmentation and Object-Based Image Analysis for Environmental Monitoring: Recent Areas of Interest, Researchers’ Views on the Future Priorities. Remote Sens. 2020, 12, 1772. [Google Scholar] [CrossRef]
Wei, S.; Luo, M.; Zhu, L.; Yang, Z. Using Object-Oriented Coupled Deep Learning Approach for Typical Object Inspection of Transmission Channel. Int. J. Appl. Earth Obs. Geoinformation 2023, 116, 103137. [Google Scholar] [CrossRef]
Timilsina, S.; Aryal, J.; Kirkpatrick, J.B. Mapping Urban Tree Cover Changes Using Object-Based Convolution Neural Network (OB-CNN). Remote Sens. 2020, 12, 3017. [Google Scholar] [CrossRef]
Guirado, E.; Blanco-Sacristán, J.; Rodríguez-Caballero, E.; Tabik, S.; Alcaraz-Segura, D.; Martínez-Valderrama, J.; Cabello, J. Mask R-CNN and OBIA Fusion Improves the Segmentation of Scattered Vegetation in Very High-Resolution Optical Sensors. Sensors 2021, 21, 320. [Google Scholar] [CrossRef]
Luo, C.; Li, H.; Zhang, J.; Wang, Y. OBViT: A High-Resolution Remote Sensing Crop Classification Model Combining Obia and Vision Transformer. In Proceedings of the 2023 11th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), Wuhan, China, 25–28 July 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Liu, T.; Abd-Elrahman, A. An Object-Based Image Analysis Method for Enhancing Classification of Land Covers Using Fully Convolutional Networks and Multi-View Images of Small Unmanned Aerial System. Remote Sens. 2018, 10, 457. [Google Scholar] [CrossRef]
Liu, T.; Abd-Elrahman, A.; Morton, J.; Wilhelm, V.L. Comparing Fully Convolutional Networks, Random Forest, Support Vector Machine, and Patch-Based Deep Convolutional Neural Networks for Object-Based Wetland Mapping Using Images from Small Unmanned Aircraft System. GIScience Remote Sens. 2018, 55, 243–264. [Google Scholar] [CrossRef]
Fu, T.; Ma, L.; Li, M.; Johnson, B.A. Using Convolutional Neural Network to Identify Irregular Segmentation Objects from Very High-Resolution Remote Sensing Imagery. J. Appl. Remote Sens. 2018, 12, 025010. [Google Scholar] [CrossRef]
Abdollahi, A.; Pradhan, B.; Shukla, N. Road Extraction from High-Resolution Orthophoto Images Using Convolutional Neural Network. J. Indian Soc. Remote Sens. 2021, 49, 569–583. [Google Scholar] [CrossRef]
Lam, O.H.Y.; Dogotari, M.; Prüm, M.; Vithlani, H.N.; Roers, C.; Melville, B.; Zimmer, F.; Becker, R. An Open Source Workflow for Weed Mapping in Native Grassland Using Unmanned Aerial Vehicle: Using Rumex Obtusifolius As A Case Study. Eur. J. Remote Sens. 2021, 54, 71–88. [Google Scholar] [CrossRef]
Tang, Z.; Li, M.; Wang, X. Mapping Tea Plantations from VHR Images Using OBIA and Convolutional Neural Networks. Remote Sens. 2020, 12, 2935. [Google Scholar] [CrossRef]
Hossain, M.D.; Chen, D. Target-Based Building Extraction from High-Resolution RGB Imagery Using GEOBIA Framework and Tabular Deep Learning Model. Geomatica 2024, 76, 100007. [Google Scholar] [CrossRef]
Fu, Y.; Liu, K.; Shen, Z.; Deng, J.; Gan, M.; Liu, X.; Lu, D.; Wang, K. Mapping Impervious Surfaces in Town–Rural Transition Belts Using China’s GF-2 Imagery and Object-Based Deep CNNs. Remote Sens. 2019, 11, 280. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Li, L.; Xu, N.; Liu, F.; Yuan, C.; Chen, Z.; Lyu, X. AAFormer: Attention-attended Transformer for Semantic Segmentation of Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Wang, C.; He, S.; Wu, M.; Lam, S.-K.; Tiwari, P.; Gao, X. Looking Clearer with Text: A Hierarchical Context Blending Network for Occluded Person Re-Identification. IEEE Trans. Inf. Forensics Secur. 2025, 20, 4296–4307. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
Wang, C.; Cao, R.; Wang, R. Learning Discriminative Topological Structure Information Representation for 2D Shape and Social Network Classification Via Persistent Homology. Knowl. Based Syst. 2025, 311, 113125. [Google Scholar] [CrossRef]
Zeng, Q.; Zhou, J.; Tao, J.; Chen, L.; Niu, X.; Zhang, Y. Multiscale Global Context Network for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The Efficient Transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Huang, L.; Yuan, Y.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Interlaced Sparse Self-Attention for Semantic Segmentation. arXiv 2019, arXiv:1907.12273. [Google Scholar] [CrossRef]
Yin, P.; Zhang, D.; Han, W.; Li, J.; Cheng, J. High-Resolution Remote Sensing Image Semantic Segmentation via Multiscale Context and Linear Self-Attention. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9174–9185. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Wang, S.; Li, B.Z.; Khabsa, M.; Fang, H.; Ma, H. Linformer: Self-Attention with Linear Complexity. arXiv 2020, arXiv:2006.04768. [Google Scholar] [CrossRef]
Tong, X.-Y.; Xia, G.-S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Ge, Z. Yolox: Exceeding Yolo Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Sun, L.; Li, L.; Shao, Y.; Jiao, L.; Liu, X.; Chen, P.; Liu, F.; Yang, S.; Hou, B. Which Target to Focus on: Class-Perception for Semantic Segmentation of Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Zhang, X.; Weng, Z.; Zhu, P.; Han, X.; Zhu, J.; Jiao, L. ESDINet: Efficient Shallow-Deep Interaction Network for semantic Segmentation of High-Resolution Aerial Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: New York, NY, USA, 2018; pp. 1451–1460. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual Explanations from Deep Networks Via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Schematic diagram of the integrated framework of GEOBIA and deep learning for image classification. (a) Integration of GEOBIA with CNN-based segmentation results. (b) Object-based patch classification. (c) Feature extraction using deep learning models. Path (1) displays using DL models to extract deep abstract features of segment attributes for classification. Path (2) illustrates combining segment attributes with deep features of images for feature classification. (d) Heterogeneous patch filtering.

Figure 2. Processing flow of the EIODA algorithm. (a) Original remote sensing image; (b) overlay of the EIODA segmentation result on the original image; (c) display of the segmented image, where the pixel values in the R, G, and B channels represent the spectral mean values of the segment.

Figure 3. The structure of SWTB. SWTB applies window-based and shifted window-based multi-head self-attention to capture local and global dependencies. Each attention layer is followed by a feed-forward network, with layer normalization and residual connections to enhance stability and feature representation.

Figure 4. The overall architecture of GOFENet.

Figure 5. Structure of the CSM. The input feature map is split into multiple channel groups and processed by parallel group convolutions with different kernel sizes. Channel attention is applied within each group, followed by concatenation and softmax fusion to integrate both intra-group and cross-group information.

Figure 6. Structure of the FEM. It enhances feature representations by integrating both local and global contextual information. Local spatial cues are captured via horizontal and vertical pooling, while global dependencies are modeled through a SWTB-guided channel attention mechanism. The final output is obtained through adaptive fusion of the two branches.

Figure 7. Structure of the SDFM. It adaptively fuses shallow features (SF) and deep features (DF) via a pixel attention module (PAM). DF is first compressed and added to SF to obtain the initial fused feature AF. In PAM, AF is refined through parallel channel and spatial attention branches to generate the attention map (AM). AF and AM are then fused to produce the pixel-level attention map (PM).

Figure 8. Percentage of each semantic label in the GID and the LoveDA dataset.

Figure 12. Examples of semantic segmentation results on the GID.

Figure 13. Visualization comparison of object contours and ground truth boxes identified by the current state-of-the-art methods and GOFENet on the GID.

Figure 16. Quantitative comparison of GOFENet variants on the GID (left) and the LoveDA dataset (right), showing mIoU, GFLOPs, and FPS. The figure illustrates how each model balances accuracy and efficiency, enabling flexible deployment under varying computational constraints.

Figure 17. Grad-CAM visualization results from the final classification layer of several models, showing the attention maps for the road class in the GID and the urban residential class in the LoveDA dataset. Regions with higher attention values, depicted by warmer colors, correspond to areas where the network demonstrates greater confidence in its classification.

Figure 18. Comparison of large-scale land cover classification accuracy after introducing SDFM across different models (a) on the GID, and (b) on the LoveDA dataset. Results from both datasets demonstrate that SDFM effectively improves the classification accuracy of large-scale and spatially continuous land cover types.

Table 4. Comparison of model complexity, speed, and performance on the GID, with complexity and speed evaluated using a 512 × 512 input.

Method	Backbone	Complexity (GFLOPS)	mIoU (%)	Speed (FPS)
UNet [19]	ResNet50	184.57	59.73	65.56
PSPNet [24]	ResNet50	118.44	60.65	89.24
DeeplabV3+ [23]	ResNet50	153.28	60.37	77.31
DANet [25]	ResNet50	69.70	60.99	91.45
SegFormer [36]	MIT-B1	26.60	61.74	73.07
DC-Swin [39]	Swin-Tiny	100.93	61.08	46.17
MANet [71]	ResNet50	157.02	64.69	65.05
UNetFormer [44]	ResNet18	23.51	63.79	72.75
GOFENet-t	ResNet18	112.98	63.29	74.75
GOFENet-s	ResNet34	262.72	64.72	42.49
GOFENet	ResNet50	510.68	66.02	16.72

Table 5. Comparison of model complexity, speed, and performance on the LoveDA dataset, with complexity and speed evaluated using a 1024 × 1024 input.

Method	Backbone	Complexity (GFLOPS)	mIoU (%)	Speed (FPS)
UNet [19]	ResNet50	373.1	47.8	22.9
PSPNet [24] *	ResNet50	105.7	48.3	52.2
DeeplabV3+ [23] *	ResNet50	95.8	47.6	53.7
Segmenter [35] *	ViT-Tiny	26.8	47.1	14.7
DANet [25]	ResNet50	278.8	50.2	39.7
SegFormer [36]	MIT-B1	106.1	46.5	26.2
MANet [71]	ResNet50	322.6	45.7	21.4
BANet [40] *	ResT-Lite	52.6	49.6	11.5
DC-Swin [39] *	Swin-Tiny	183.8	50.6	23.6
TransUNet [78] *	ViT-R50	803.4	48.9	13.4
SwinUperNet [30] *	Swin-Tiny	349.1	50.0	19.5
GOFENet-t	ResNet18	223.9	51.2	33.9
GOFENet-s	ResNet34	525.4	52.0	18.3
GOFENet	ResNet50	980.8	51.9	8.9

Note: the parameters of the models marked with an asterisk (*) are derived from reference [51].

Table 6. A comparative analysis of the segmentation performance between GOFENet and simplified models.

Model	Backbone	LoveDA								GID
Model	Backbone	Background	Building	Road	Water	Barren	Forest	Agriculture	mIoU (%)	mPA (%)	OA (%)	mIoU (%)
GOFENet-t	ResNet18	45.20	52.62	56.15	78.28	20.86	43.62	61.91	51.24	75.38	82.35	63.29
GOFENet-s	ResNet34	44.31	55.05	56.43	79.04	21.36	45.13	62.82	52.02	76.40	83.38	64.72
GOFENet	ResNet50	45.20	55.63	56.75	78.13	19.28	44.88	63.60	51.92	77.78	84.07	66.02

Table 7. Object size statistics and algorithm execution time on two datasets.

Datasets		Obj. Min	Obj. Max	Obj. Mean	EIODA Average Execution Time
GID	Train	2	2713	1060	4.35 s/image
	Val	5	2655	1100
	Test	3	2633	1080
LoveDA	Train	1	1738	623	1.81 s/image
	Val	1	1461	606
	Test	1	1467	584

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, T.; Chen, J.; Pan, D. GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2025, 17, 2652. https://doi.org/10.3390/rs17152652

AMA Style

He T, Chen J, Pan D. GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images. Remote Sensing. 2025; 17(15):2652. https://doi.org/10.3390/rs17152652

Chicago/Turabian Style

He, Tao, Jianyu Chen, and Delu Pan. 2025. "GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images" Remote Sensing 17, no. 15: 2652. https://doi.org/10.3390/rs17152652

APA Style

He, T., Chen, J., & Pan, D. (2025). GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images. Remote Sensing, 17(15), 2652. https://doi.org/10.3390/rs17152652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of Remote Sensing Images Based on CNN, Transformer, and Hybrid Models

2.1.1. CNN-Based Semantic Segmentation Models

2.1.2. Transformer-Based Semantic Segmentation Model

2.1.3. CNN–Transformer Hybrid Model

2.2. Deep Learning Framework Coupled with GEOBIA

2.3. Attention Mechanism

3. Method

3.1. Multi-Scale Segmentation Optimization Algorithm—EIODA

3.2. Overall Network Structure

3.3. Cross-Channel Selective Module

3.4. Feature Enhancement Module

3.5. Shallow–Deep Feature Fusion Module

4. Results

4.1. Datasets

4.1.1. GID

4.1.2. LoveDA Dataset

4.2. Implementation Details

4.2.1. Experimental Settings

4.2.2. Loss Function

4.2.3. Evaluation Metrics

4.3. Ablation Study

4.3.1. Effect of Integrating SWTB (Baseline)

4.3.2. Effect of Cross-Channel Selective Module

4.3.3. Effect of Feature Enhancement Module

4.3.4. Effect of Shallow–Deep Feature Fusion Module

4.3.5. Joint Effects of Different Modules

4.4. Performance Evaluation and Comparisons with Other Models

4.4.1. Results on GID

4.4.2. Results on the LoveDA Dataset

5. Discussion

5.1. Design of Auxiliary Branch

5.2. Model Efficiency Analysis

5.3. Grad-CAM Visualization

5.4. Advantages and Limitations of Embedded Object Features

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI