MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation

Wang, Rui; Sun, Jimin; Zhou, Kefa; Wang, Jinlin; Bi, Jiantao; Zhang, Qing; Wang, Wei; Qu, Guangjun; Li, Chao; Qiu, Heshun

doi:10.3390/rs18010035

Open AccessArticle

MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation

by

Rui Wang

^1,2

,

Jimin Sun

^2,3,

Kefa Zhou

^1,2,*,

Jinlin Wang

^1,2

,

Jiantao Bi

^1,2,

Qing Zhang

^1,2

,

Wei Wang

^1,2,

Guangjun Qu

^1,2,

Chao Li

⁴ and

Heshun Qiu

^1,2

¹

Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing 100094, China

²

University of Chinese Academy of Sciences, Beijing 100084, China

³

State Key Laboratory of Lithospheric and Environmental Coevolution, Institute of Geology and Geophysics, Chinese Academy of Sciences, Beijing 100029, China

⁴

Institute of Geological Survey, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 35; https://doi.org/10.3390/rs18010035

Submission received: 30 October 2025 / Revised: 5 December 2025 / Accepted: 10 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Third Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A dual-backbone deep learning framework (MarsTerrNet) combining PRB and Swin Transformer achieves superior robustness and accuracy in Martian terrain segmentation.
A feature-guided loss is introduced to encode geological relations and reduce confusion among visually similar terrain types.

What are the implications of the main findings?

The proposed framework enhances geological interpretation of Martian surfaces and improves the consistency of terrain segmentation results.
The method supports future planetary surface analysis and autonomous rover navigation on Mars.

Abstract

Accurate terrain perception is essential for safe rover operations and reliable geotechnical interpretation of Martian surfaces. The heterogeneous scales, colors, and textures of Martian terrain present significant challenges for semantic segmentation. We present MarsTerrNet, a dual-backbone segmentation framework that combines Progressive Residual Blocks (PRB) with a Swin Transformer to jointly capture fine-grained local details and global contextual dependencies. To further enhance discrimination among geologically correlated classes, we design a feature-guided loss that aligns representative features across terrain categories and reduces confusion between visually similar but physically distinct types. For comprehensive evaluation, we establish MarsTerr2024, an extended dataset derived from the Curiosity rover, providing diverse geological scenes for terrain understanding. Experimental results show that MarsTerrNet achieves state-of-the-art performance and produces geologically consistent segmentation results, supporting automated mapping and geotechnical assessment for future Mars exploration missions.

Keywords:

Mars; terrain segmentation; terrain perception; deep learning

1. Introduction

1.1. Background on Martian Terrain Perception

The exploration of Mars is critical for understanding planetary evolution and provides valuable insights into Earth’s geological history, sedimentary processes, and near-surface stability under extreme environments. To date, ten successful soft-landing missions have enabled direct, in situ analysis of the surface, yielding detailed data on composition and geomorphology that are unattainable from orbit. However, the Martian surface is dominated by rocks of diverse sizes, shapes, and colors, resulting from prolonged meteorite impacts and eolian processes [1]. Furthermore, frequent dust storms [2] deposit fine dust that masks the terrain’s inherent colors, textures, and structural features [3,4]. These combined factors obscure terrain boundaries and complicate classification, posing significant challenges for rover-based terrain analysis and environmental perception.

Consequently, achieving high-precision terrain understanding under these conditions is essential for ensuring rover safety, supporting long-term exploration, and securing valuable data. Accurate terrain perception is therefore fundamental for robust rover mobility and comprehensive geotechnical assessment. As illustrated in Figure 1, accurate segmentation of a region depends not only on its consistency with homogeneous areas of the same class (e.g., the blue boxes distributed across the image) but also on its relationships with adjacent heterogeneous regions (other color boxes). Therefore, a model that can simultaneously capture fine-grained local details and long-range contextual relationships is required to achieve precise terrain segmentation under such Martian surface conditions.

1.2. Mars Semantic Segmentation Methods

During the early Mars rover missions, NASA equipped the Curiosity rover with an intelligent sensing system and introduced Rockster [5], an autonomous perception algorithm that segments rocks and localizes subsequent targets by detecting and regrouping edges. This system was later deployed on the Opportunity rover. While such traditional methods represented an important step toward automated rover perception, they soon revealed limitations in handling ambiguous and low-contrast terrain boundaries. The rapid advancement of deep learning has greatly expanded the use of convolutional neural networks (CNNs) across computer vision tasks. Among them, semantic segmentation, the pixel-wise classification of images, has garnered significant attention for enabling fine-grained scene understanding. Semantic segmentation also serves as a foundational component within Mars rover perception systems, enabling long-term operational safety, improved navigational robustness, and maximizing scientific return through precise terrain interpretation. For instance, ref. [6] utilized transfer learning with a VGG-16 architecture to classify four types of Martian rocks. Ref. [7] evaluated six CNN models for automatically detecting and delineating Martian boulders with tracks in HiRISE imagery, demonstrating the robust cross-domain generalization capability of CNNs. Further advancing this trend, ref. [8] employed the Faster R-CNN architecture, enabling effective end-to-end training that adapts to diverse scenarios without increasing model complexity. Despite the constrained understanding of the Martian environment, CNN-based methods have demonstrated promising performance on existing datasets. This task of semantic segmentation requires both high-level semantic consistency and low-level spatial accuracy, necessitating effective multi-scale feature fusion [9,10]. Architectures like U-Net [11] address this by propagating shallow features from the encoder to the decoder via skip connections, concatenating them with deeper features to achieve cross-layer fusion. Owing to its effective balance between capturing fine details and modeling contextual relationships, U-Net has been widely adopted in Martian terrain studies [12,13,14].

However, the inherent locality of convolution operations limits CNNs in modeling long-range dependencies. To overcome this, the Transformer architecture [15] renowned in natural language processing, was introduced to computer vision [16,17]. In contrast to CNNs, Transformers can be less effective at encoding low-level local features, which may impair the model’s ability to discern ambiguous terrain boundaries [18,19]. This is particularly critical in the unstructured environment of Mars.

1.3. Hybrid CNN–Transformer Architectures for Mars Segmentation

Hybrid architectures that jointly integrate the strengths of CNNs and Transformers provide a promising solution for precise Martian terrain segmentation. Exemplifying this approach, Liu et al. [20] proposed the HASS network, which augments the HRNet-W48 backbone with two complementary branches: a global intra-class attention branch to capture consistency among all homogeneous pixels and a local inter-class attention branch to model relationships among neighboring heterogeneous pixels. Ref. [21] proposed MarsNet, a CNN–Transformer framework featuring a VGG16-based encoder followed by a transformer module. A hybrid dilated convolution layer is applied before the decoder to further expand the receptive field, thereby improving the detection and delineation of large rocks. MarsNet also contributed the TWMARS dataset, derived from images taken by the ZhuRong rover’s Navigation and Terrain Cameras (NaTeCam).

To prevent CNN-derived local features from being overwhelmed by subsequent Transformer modules, recent studies have introduced refined fusion mechanisms. RockFormer [22] incorporates a feature refining module between the encoder and decoder. This module suppresses negative features that degrade segmentation while capturing global inter-scale dependencies to produce more robust multi-scale rock features. RockFormer also released two segmentation datasets: MarsData-V2 and SynMars, the latter being synthesized from the virtual terrain of the TianWen-1 landing zone. Building on this, MarsFormer [23] proposes a Feature Enhancement Module that decomposes encoder features into spatial and channel components, suppressing redundancies while accentuating salient local details before they are processed by a window transformer block for global context modeling via window self-attention and cross window interaction. Ref. [24] proposed RockSeg, which employs a multi-scale low-level feature fusion module that integrates information from the ResNet-T backbone into the output feature map to improve segmentation quality. Similarly, EDR-TransUnet [25] employs an enhanced dual relation-aware attention mechanism to capture channel-wise and positional dependencies, mitigating the degradation of local details. EDR modules are strategically inserted between CNN and Transformer blocks in the encoder and within skip connections, where they effectively fuse multi-scale local and global information before transmitting it to the decoder, thereby enhancing overall feature expressiveness and improving the segmentation of boundaries and multi-scale structures.

Concurrently, several studies have focused on improving model efficiency without compromising the quality of multi-scale feature extraction and fusion. Ref. [26] adopted MobileViT as a lightweight backbone encoder for multi-scale information extraction, combined with an effective layer aggregation decoder. Likewise, ref. [27] proposed Light4Mars, a lightweight framework that replaces standard transformer blocks with squeeze window transformers in the encoder and employs an aggregate local attention module within the decoder to maintain accuracy while reducing complexity. The efficacy of Light4Mars was validated on the MarsScapes [28] and SynMars-TW datasets. Taken together, these works outline a clear evolution from performance-driven hybrid CNN–Transformer designs to efficiency-oriented variants. The former typically rely on heavy backbones and sophisticated fusion modules, whereas the latter deliberately reduce complexity while preserving multi-scale feature quality. This trend also reflects the practical goal of making such models suitable for future on-board deployment on Mars rovers, which we further discuss as part of our future work in Section 4. A concise comparison of these methods and their design focuses is summarized in Table 1.

1.4. Summary of Limitations and Our Contributions

Despite recent progress, hybrid architectures still face fundamental limitations. Most existing methods adopt one of two integration schemes: (1) a sequential design, where local features from CNNs are fed into subsequent Transformer modules, or (2) a replacement design, where Transformer blocks substitute convolutional blocks within U-Net–style encoders. However, sequential coupling often leads to Transformer dominance in representation, which can suppress the fine-grained spatial details initially captured by CNNs. In contrast, replacement designs tend to weaken the model’s capacity to preserve low-level structural cues that are critical for delineating subtle features, such as terrain boundaries. Consequently, neither strategy fully leverages the complementary strengths of CNNs and Transformers for joint local–global modeling.

Beyond architectural design, current approaches often treat terrain categories as independent labels, disregarding the inherent structural dependencies among Martian surface materials. In reality, the nine representative Martian terrain types exhibit distinct spatial distributions yet strong inter-class correlations. For instance, Martian Soil, Sands, and Gravel are surface deposits forming a granulometric continuum-from fine-grained Martian Soil with limited mobility to wind-rippled Sands and then to coarse Gravel, resulting in ambiguous boundaries among them. Similarly, Bedrock and Rocks are both lithic units but differ in morphology and spatial configuration: Bedrock forms flat, continuous substrates, whereas Rocks occur as protruding blocks, often leading to confusion at multiple scales. Shadows, generated by illumination effects, are spatially linked with Rocks and therefore difficult to distinguish. In contrast, Tracks, Background, and Unknown are relatively isolated, displaying clearer spatial separations and thus easier to recognize within the terrain mosaic.

These intricate relationships pose a fundamental challenge for Martian terrain segmentation. Failure to explicitly model such dependencies increases the risk of misclassification among visually similar categories. Moreover, these ambiguities are not merely visual artifacts but reflect intrinsic geological continua across sedimentary and lithic units, highlighting the need for segmentation methods that can capture both geological structure and surface appearance with higher consistency.

From a data perspective, most existing annotated datasets are confined to limited rover traverse regions. This limitation results in poor representation of key lithologies, such as newly observed sulfate-rich units. A systematic review of the datasets used in related work is provided in Section 3.1. On the modeling front, while recent research has explored efficient hybrid architectures, a critical gap remains: no current approach successfully integrates geological consistency, representation stability, and computational efficiency under the stringent constraints of on-board deployment. We expand on these challenges in Section 4.

To address these challenges, we develop MarsTerrNet, a dual-backbone U-shaped framework that integrates convolutional and Transformer-based representations to jointly capture local spatial detail and global contextual dependencies. A feature-guided loss is introduced to encode inter-class relationships and reduce ambiguity between geologically correlated terrains, such as Martian Soil–Sands–Gravel and Bedrock–Rocks transitions. In addition, we construct MarsTerr2024, an extended dataset including newly observed sulfate-rich lithologies from the Curiosity rover, providing a comprehensive benchmark for planetary geotechnical analysis. Experimental results show that MarsTerrNet achieves geologically consistent and reliable terrain segmentation, facilitating accurate surface mapping and supporting geotechnical interpretation in future Mars exploration missions.

Our main contributions are as follows:

We design a dual-backbone architecture, named MarsTerrNet, which integrates CNN-based backbone and Swin Transformer to jointly capture fine-grained color, texture, and structural cues, while simultaneously modeling global contextual relationships.
We propose a feature-guided loss function that explicitly encourages the model to capture structured dependencies among terrains, thereby enhancing the separability of highly correlated terrain classes and improving prediction robustness and efficiency.
We introduce MarsTerr2024, an extended dataset to support model training. Experimental results demonstrate that our model achieves best performance on this dataset, confirming its effectiveness and reliability for Martian terrain segmentation.

2. Method

In this section, we first illustrate the general framework of MarsTerrNet, and then introduce PRB, Swin Transformer block and feature-guided loss.

2.1. Overall Framework

We propose MarsTerrNet, a dual-backbone segmentation framework that integrates CNN-based backbone with Swin Transformer, as illustrated in Figure 2. The framework is designed to jointly capture fine-grained spatial details and long-range contextual dependencies, ensuring effective fusion of both information levels for terrain segmentation and geological interpretation. Specifically, the CNN-based backbone extracts detailed spatial patterns such as rock edges and stratified textures, while the Swin Transformer branch employs window-based self-attention to efficiently model global contextual relationships and resolve ambiguous terrain boundaries across varying geological units.

Figure 2. Framework of MarsTerrNet. An input terrain image is processed in parallel by CNN-based backbone (red) and Swin Transformer backbone (green). Multi-scale features are fused through skip connections and progressively decoded (purple) to generate the final terrain segmentation map. Encoder features

S_{l}^{I}

and decoder features

S_{l}^{O}

are the multi-scale features that are aggregated by the feature-guided loss in Figure 3.

Figure 2. Framework of MarsTerrNet. An input terrain image is processed in parallel by CNN-based backbone (red) and Swin Transformer backbone (green). Multi-scale features are fused through skip connections and progressively decoded (purple) to generate the final terrain segmentation map. Encoder features

S_{l}^{I}

and decoder features

S_{l}^{O}

are the multi-scale features that are aggregated by the feature-guided loss in Figure 3.

Figure 3. Illusion of the process of prototype construction, feature aggregation, and alignment. On the left, class prototypes

p_{i}

are shaped under the guidance of relation weights. The top-right shows the aggregation of multi-level encoder–decoder features into a unified representation f. At the bottom-right, representative vectors

v_{i}

derived from f are combined with prototypes

p_{i}

to form a feature bank, which is aligned through

L_{h a l}

.

Figure 3. Illusion of the process of prototype construction, feature aggregation, and alignment. On the left, class prototypes

p_{i}

are shaped under the guidance of relation weights. The top-right shows the aggregation of multi-level encoder–decoder features into a unified representation f. At the bottom-right, representative vectors

v_{i}

derived from f are combined with prototypes

p_{i}

to form a feature bank, which is aligned through

L_{h a l}

.

Features from both backbones are integrated and propagated through a symmetric decoder, where multi-scale information is progressively recovered and refined. At each decoding stage, the feature map from the preceding layer first undergoes a recalibration step via a 1 × 1 convolution, Batch Normalization, and ReLU activation, which adjusts the channel-wise distribution and stabilizes the feature representation. The output is then upsampled and concatenated with the corresponding skip connections from both the CNN and Swin Transformer encoders. A subsequent 3 × 3 convolution, followed by Batch Normalization and ReLU, refines the aggregated features to enhance spatial coherence and recover fine-grained structural details. Through this coarse-to-fine workflow, each decoder stage incrementally reconstructs spatial resolution, enriches textural cues, and preserves boundary integrity, producing segmentation results with clear terrain delineation and geologically consistent surface structures suitable for analysis on Mars.

To further enhance feature discriminability, we introduce the feature-guided loss function, which unifiedly employs a Relation-Aware Prototype Regularization (

L_{r p l}

), a Hierarchical Feature Aggregation loss (

L_{h a l}

), and a Boundary-Weighted Dice loss (

L_{b d i c e}

). These components collectively guide the network optimization from three complementary perspectives: enlarging inter-class margins by leveraging terrain relationships, enforcing consistent alignment between multi-scale features and their class prototypes, and prioritizing the segmentation accuracy of ambiguous boundary regions. The synergy between this tailored loss design and the dual-branch architecture empowers MarsTerrNet to achieve accurate and geologically plausible segmentation of Martian terrain.

2.2. CNN-Based Backbone

To capture the fine details of Martian terrain, we construct a convolutional encoder using a series of PRB. Each PRB integrates local detail extraction with progressive receptive field expansion within a residual framework.As illustrated in Figure 2, the input feature map is processed through two parallel paths. The first branch begins with a 3 × 3 convolution, where the dilation factor increases gradually across the encoder depth and is set to [1,2,3,4,5] for the five blocks. This progressive dilation scheme enables the network to shift its focus from local edges and rock boundaries in shallow layers to broader geological structures in deeper layers. The second branch applies a 1 × 1 convolution as a lightweight channel transformation, keeping the feature representation compact while enhancing its expressiveness. Formally, the two branches can be written as:

\begin{matrix} u & = ReLU (BN ({Conv}_{3 \times 3, d} (x))) \end{matrix}

(1)

\begin{matrix} v & = GELU (BN ({Conv}_{1 \times 1} (x))) \end{matrix}

(2)

The outputs of these two branches are then combined with the residual path and further processed by a depthwise convolution, whose kernel size also grows with depth [5,5,7,7,9]. The output of the block is given by:

y = SiLU (BN ({DWConv}_{k \times k} (u + v + x)))

(3)

This design enhances spatial coherence and amplifies the representation of stratified structures. The residual addition of the input stabilizes gradient flow and facilitates multi-scale feature integration across the network depth. Placing ReLU on the dilated 3 × 3 path yields robust edge responses with stable gradients, GELU on the 1 × 1 path gives smooth channel gating for feature mixing, and SiLU after depthwise filtering behaves like soft spatial gating that strengthens coherent structures. By stacking five PRB modules with progressively increasing dilation rates and kernel sizes, the encoder builds a hierarchical representation of terrain features, enabling a smooth transition from fine textures to large-scale geological formations. This progressive design ensures that the CNN-based backbone captures both edge-level precision and structural details, thereby improving its ability to discriminate between closely related terrain types.

2.3. Swin Transformer Backbone

The second branch of our dual-backbone architecture is built upon the Swin Transformer [29], a design particularly suited for terrain analysis where the integration of subtle local cues and broad spatial dependencies is crucial. While terrain categories may appear similar at a global scale, their discrimination often hinges on fine-grained indicators such as textural variations and boundary definitions. The Swin Transformer, with its hierarchical windowing mechanism, provides an effective solution for capturing these details without sacrificing global contextual understanding.

This branch employs the standard Swin Transformer block design, which alternates between two key operations: Window-based Multi-head Self-Attention (W-MSA) and Shifted Window-based Multi-head Self-Attention (SW-MSA). Compared with the conventional global self-attention used in Vision Transformer, W-MSA restricts attention computation to non-overlapping local windows. This local constraint not only improves sensitivity to fine-grained features but also reduces the quadratic computational cost associated with global attention. To complement this locality, the shifted window design shifts partition boundaries between consecutive layers, thereby enabling cross-window information flow. Through this mechanism, the model effectively integrates long-range dependencies while maintaining the efficiency of localized attention.

In our implementation, each local window contains M × M patches, with M set to 8. This configuration ensures that the input feature map dimensions are divisible by the window size, avoiding the need for cumbersome padding. The formulations of the W-Trans and SW-Trans blocks are as follows:

{\hat{s}}^{l} = W - MSA (LN (s^{l - 1})) + s^{l - 1}

(4)

s^{l} = MLP (LN ({\hat{s}}^{l})) + {\hat{s}}^{l}

(5)

{\hat{s}}^{l + 1} = SW - MSA (LN (s^{l})) + s^{l}

(6)

s^{l + 1} = MLP (LN ({\hat{s}}^{l + 1})) + {\hat{s}}^{l + 1}

(7)

where

s^{l}

represents the output feature of the W-Trans block and

s^{l + 1}

represents the output feature of the SW-Trans block.

2.4. Feature-Guided Loss Function

The segmentation of Martian terrain is complicated by the inherent relational structure among its semantic categories. As previously outlined, Martian Soil, Sand, and Gravel exist along a granulometric continuum with diffuse transitions; Bedrock and Rocks, despite their shared lithic origin, are distinguished by their morphology and spatial configuration; and Shadows exhibit a strong spatial correlation with Rocks due to illumination geometry. In contrast, categories such as Tracks, Background, and Unknown are more semantically distinct. To explicitly model these structured interdependencies, we propose a feature-guided loss function composed of three complementary terms:

L_{r p l}

,

L_{h a l}

, and

L_{b d i c e}

. Rather than enforcing a uniform separation across all classes, this loss supervises the network progressively at each layer, guiding it to exploit the constructed inter-class relationships for more targeted discrimination. This approach enhances the model’s sensitivity to subtle distinctions between closely related terrain types, without enforcing an artificial and rigid separation across all nine categories.

We first construct a set of prototype vectors

{p_{i}}_{i = 1}^{C}

representing the semantic centers of

C = 9

terrain classes. Instead of enforcing uniform separation on a hypersphere as in conventional contrastive learning, we introduce

L_{r p l}

that embeds geological priors into the optimization. Pairs of classes that are naturally continuity (e.g., Martian Soil–Sands–Gravel) are softly penalized, while pairs prone to confusion (e.g., Rocks–Shadows) receive stronger separation constraints.

L_{r p l}

is defined as:

L_{r p l} = \frac{1}{C^{2}} \sum_{i = 1}^{C} \sum_{j = 1}^{C} w_{i j} \cdot \frac{p_{i}^{⊤} p_{j}}{∥ p_{i} ∥_{2} \cdot {∥ p_{j} ∥}_{2}},

(8)

where

w_{i j}

is denotes the relation weight between class i and class j. The relation matrix W =

w_{i j}

is implemented as a set of learnable parameters and is updated jointly with the network via back-propagation. For initialization, we adopt a simple three-level scheme: Martain Soil–Sands–Gravel pairs are set to

1.0

, Bedrock–Rocks and Shadow–Rocks pairs to

0.7

, Background, Tracks, and Unknown categories to

0.2

against all other classes, and the remaining pairs to

0.4

. During training, all entries in W are optimized, enabling adaptive adjustment of inter-class margins while preserving the relative ordering imposed by the initialization. Minimizing

L_{r p l}

enlarges inter-class margins in a relation-aware manner rather than enforcing uniform dispersion.

To strengthen feature–prototype alignment, we further introduce

L_{h a l}

. Features from three shallow stages of the encoder and decoder, denoted as

S_{l}^{I}

and

S_{l}^{O}

, are aggregated with learnable weights

α_{l}

:

f = \sum_{l = 1}^{3} α_{l} \cdot (S_{l}^{I} + S_{l}^{O}), \sum_{l = 1}^{3} α_{l} = 1,

(9)

The aggregated feature f serves as the source for constructing a feature bank. We partition f into regions according to prediction confidence and average each bin to obtain representative vectors

v_{i}

. We set J = 27, corresponding to three confidence bins per class (high, medium, and low confidence) for the nine terrain categories. This three-level partition reflects the typical spatial organization of Martian terrain. Interior regions give high-confidence predictions. Transition zones yield medium-confidence outputs. Boundary regions, where terrain types mix or illumination changes, produce low confidence. Using high/medium/low bins allows the feature bank to represent these distinct areas without over-fragmenting the feature space. In this way, every

v_{i}

is directly derived from f, ensuring that the aggregated information contributes to prototype alignment. These vectors are then aligned with their corresponding class prototypes

t_{i}

, and

L_{h a l}

is defined as:

L_{h a l} = \sum_{i = 1}^{J} (∥ v_{i} - p_{i} ∥_{2}^{2} + 1 - \frac{v_{i}^{⊤} p_{i}}{∥ v_{i} ∥_{2} \cdot {∥ p_{i} ∥}_{2}}) .

(10)

This design allows different encoder–decoder layers to contribute adaptively, enhancing the robustness of feature separation across scales.

Finally, to address boundary regions that are inherently ambiguous in Martian scenes, we adopt

L_{b d i c e}

. Rather than assigning uniform weights to all pixels, this loss highlights uncertain boundary areas through relation matrix

w (x)

:

L_{b d i c e} = \sum_{c = 1}^{C} (1 - \frac{2 \sum_{x} w (x) P_{c} (x) G_{c} (x)}{\sum_{x} w (x) (P_{c} (x) + G_{c} (x))}),

(11)

where

P_{c}

and

G_{c}

represent the predicted and ground truth masks for class c. The spatial weights

w (x)

are derived from ground truth boundaries via distance transforms, defined as

w (x) = 1 + β exp (- D (x) / σ)

, where

D (x)

is the distance to the nearest boundary,

β

regulates the strength of emphasis, and

σ

controls the decay range. Given that the network already integrates PRB backbone and a Swin Transformer backbone with local window mechanisms, both of which enhance sensitivity to fine-grained structures including boundaries, we set

β = 2

as a moderate value. This configuration amplifies boundary-pixel contributions without overwhelming interior regions, yielding stable and balanced supervision during training.

The overall objective is a weighted sum of the three losses:

L_{a l l} = λ_{r p l} L_{r p l} + λ_{h a l} L_{h a l} + λ_{b d i c e} L_{b d i c e},

(12)

with

λ_{r p l} = 1.0

,

λ_{h a l} = 0.5

, and

λ_{b d i c e} = 0.5

. The coefficients

λ_{r p l}

,

λ_{h a l}

, and

λ_{b d i c e}

are treated as hyperparameters. The weights were tuned on the validation set within the range [0, 1]. The configuration (1.0, 0.5, 0.5) was selected for providing the most balanced improvement in both overall segmentation accuracy and boundary quality. A detailed sensitivity analysis of different weight configurations is provided in Section 3.3.

3. Results

3.1. Datasets and Experimental Setting

Several Mars terrain datasets, comprising both real and synthetic imagery, have been made publicly available. To ensure our model is trained on authentic Martian conditions, we utilize only real data for training and validation. Table 2 presents the statistics of some of the available real datasets. Mars32K is a dataset released in 2018 consisting of 32,000 images collected by Curiosity rover from August 2012 to November 2018 without semantic segmentation labels, where each image has a resolution of

560 \times 500

. But Mars32K now is currently inaccessible. AI4Mars [30], developed by NASA’s Jet Propulsion Laboratory (JPL), is the first large-scale dataset designed for training and validating Martian terrain classification models. It aggregates images from the Mars Science Laboratory (MSL) and Mars Exploration Rovers (MER) missions. The MSL subset of AI4Mars includes 17,030 images from the Curiosity rover, alongside data from the Spirit and Opportunity rovers, all at a resolution of 1024 × 1024 and annotated with four categories: Soil, Bedrock, Sand, and Big Rock. The MSL-Seg dataset [28] is derived from AI4Mars-MSL by downsampling images via bilinear interpolation to a resolution of 560 × 500, resulting in 4155 images. It expands the label set to nine categories, introducing Gravel, Tracks, Shadow, Background, and Unknown to address complex and high-risk scenarios encountered by the rover. The MarsScapes dataset [20] comprises 195 stitched panoramic images with widths ranging from 1230 to 12,062 pixels and heights from 472 to 1649 pixels. It extends the AI4Mars labels by adding Gravel, Steep Slope, Sky, and Others. Finally, the MarsData-V2 dataset [22], built upon the original Mars32K, contains 8390 RGB images of 512 × 512 resolution after data augmentation through flipping, rotation, translation, scaling, and shearing.

In this article, we adopt the nine fine-grained semantic categories from MSL-Seg as the baseline and extend them with newly released Curiosity rover observations from 2024. Since its landing in Gale Crater in 2012, Curiosity has traversed from clay-bearing strata into the sulfate-rich regions of Mount Sharp, conducting multiple drillings and documenting a continuous record of geological transitions [31]. This recent phase has captured lithologies absent in earlier datasets, including distinct hydrated sulfates [32,33]. To maintain taxonomic consistency, these sulfate-rich samples are incorporated under the Rocks category. The inclusion of these observations broadens the representativeness of terrain classes and introduces geologically validated contexts that reflect realistic exploration challenges. By integrating both established and newly identified features, our expanded dataset, named MarsTerr2024, captures environmental conditions absent in previous benchmarks, thereby offering greater diversity and novelty for semantic segmentation tasks. A summary of dataset statistics is presented in Table 2.

In addition to the real datasets summarized above, several synthetic datasets have been proposed, including SynMars [23], SynMars-TW [27], and SimMars6K [34]. These datasets are primarily designed for rock segmentation tasks. For instance, SimMars6K offers 6325 pairs of rendered stereo images accompanied by depth maps and rock annotations, supporting stereo depth estimation and rock detection. However, these synthetic datasets adopt task definitions and label spaces that differ from our nine-class terrain taxonomy. Since the objective of this work is to investigate general terrain perception on complex Martian surfaces using real rover imagery, we limit our experiments to real multi-class terrain datasets aligned with the MSL-Seg annotation scheme. We regard it as a promising future direction to construct a MarsTerr-style synthetic dataset with finer-grained terrain labels, drawing on the data generation methodologies of synthetic dataset. Such a dataset would significantly expand the coverage and complexity of Martian terrain scenarios available for research.

3.2. Training Details

MarsTerrNet is a five-stage dual-backbone network. The CNN encoder employs depthwise separable convolutions with progressively increasing kernel sizes ([5, 5, 7, 7, 9]) and dilation rates ([1, 2, 3, 4, 5]) across its five stages to achieve hierarchical receptive field expansion. The CNN branch uses channel dimensions [32, 64, 160, 256, 320], providing sufficient representational capacity at the middle and deep layers. In parallel, the Swin Transformer branch follows the same five-level hierarchy, where overlapping patch embeddings generate downsampled feature maps of sizes

(H / 4, W / 4)

,

(H / 8, W / 8)

,

(H / 16, W / 16)

,

(H / 32, W / 32)

, and

(H / 64, W / 64)

, with hidden dimensions matched to the CNN channels. Each Transformer stage repeats two blocks consisting of W-MSA and SW-MSA layers, with window size fixed at

7 \times 7

, enabling balanced local and global modeling. The decoder mirrors this five-stage structure, where upsampling and

3 \times 3

convolutions progressively restore the resolution to the input size

560 \times 500

. A final

1 \times 1

convolution projects features to the class space, yielding the terrain segmentation map.

All experiments were conducted on an NVIDIA RTX 4090 GPU with a batch size of 4 for 160 k iterations. The AdamW optimizer was used with a learning rate of

1 \times 10^{- 4}

and a weight decay of 0.01.

To evaluate model performance, we adopt several standard metrics: Accuracy (Acc), Precision (Pre), Recall (Rec), F1 Score (F1), and mean IoU (mIoU). The mIoU is reported as the primary benchmark for segmentation quality, calculated as the average IoU across all semantic classes.

3.3. Ablation Studies

As detailed in Table 3, the ablation results reveal significant performance variations across different backbone architectures. The ResNet-34 and standard Transformer configuration consistently underperforms, achieving the lowest scores on all evaluation metrics. This suggests that the limited receptive field of a shallow CNN, combined with the computational inefficiency and lack of inductive bias in standard self-attention, fails to capture the intricate details and long-range context of Martian terrain. In contrast, integrating Swin Transformer with the same ResNet-34 backbone delivers improvements in Acc, F1, and mIoU. These gains confirm the superiority of the hierarchical and shifted-window attention scheme, which excels at capturing multi-scale spatial relationships crucial for terrain segmentation.

Replacing ResNet-34 with our proposed PRB backbone yields further performance gains, evidenced by notable increases in Rec and mIoU. This improvement confirms that the PRB design provides superior boundary sensitivity and richer local texture representation compared to standard convolutional blocks. The optimal performance is achieved by integrating PRB with Swin Transformer, a combination that attains the highest scores in Acc, F1, and mIoU. This result validates that the local feature refinement by PRB effectively complements the global contextual modeling of Swin Transformer. The synergy between these two components establishes an optimal balance between local precision and global semantic reasoning, culminating in more reliable and robust segmentation.

Table 4 reports the effect of different components in the proposed feature-guided loss. The baseline model, trained without relation-guided (

L_{r p l}

+

L_{h a l}

) or boundary-aware (

L_{b d i c e}

) supervision, achieves 63.76% mIoU and 77.22% F1, indicating a limited capacity to differentiate terrain categories with subtle inter-class relationships. Introducing

L_{r p l}

and

L_{h a l}

brings moderate yet consistent gains across metrics, with notable improvements in Rec and mIoU. This suggests that embedded geological priors effectively aid in distinguishing closely related categories. When

L_{b d i c e}

is applied independently, Rec rises to 80.11% and F1 to 78.84%, highlighting its strength in refining indistinct boundaries, though improvements are less evenly distributed across all metrics. The most substantial gains appear when relation-guided and boundary-aware supervisions are combined, with Acc, F1, and mIoU reaching 85.63%, 83.61%, and 67.83%, respectively. These results demonstrate the complementary nature of the two guidance strategies: relation-guided constraints enhance discrimination among correlated terrain types, while boundary weighting enforces sharper edge delineation, jointly yielding more accurate and reliable segmentation.

To determine the optimal balance among the loss components, we conducted a sensitivity analysis on the weighting coefficients

λ_{r p l}

,

λ_{h a l}

, and

λ_{b d i c e}

. The results in Table 5 demonstrate that the choice of these coefficients is crucial for achieving stable and consistent performance gains. We employed a binary search strategy to systematically evaluate representative weighting configurations and identify the most effective combination. The results reveal that assigning excessively large weights to either

λ_{h a l}

or

λ_{b d i c e}

degrades performance, indicating that overemphasis on a single component disrupts the training dynamics and compromises overall segmentation quality. In contrast, setting both

λ_{h a l}

and

λ_{b d i c e}

to 0.5 produces the most balanced outcome. The configuration

(1, 0.5, 0.5)

achieves the highest mIoU of 75.82%, together with consistent gains in accuracy, precision, and recall. Since the performance gap between coefficients of 0.5 and 1 is marginal and no additional benefit emerges from intermediate values such as 0.75 or 0.25, we adopt

(1, 0.5, 0.5)

as the default configuration in subsequent experiments.

Figure 4 illustrates the per-class performance of the complete model under the best loss configuration. For Martian Soil, Sands, and Gravel, the model achieves Acc of 85.53%, 87.75%, and 90.47%, respectively. This reflects both their high prevalence in the dataset and the model’s proficiency in classifying these broad terrain categories. However, a marked variation is observed in their mIoU: Martian Soil remains the most challenging to segment (41.20%), Sand attains a moderate level (53.39%), and Gravel is delineated with the highest (78.28%). These differences arise not only from their granulometric continuity but also from class-specific morphological characteristics. Sands benefits from distinctive aeolian ridges that form recognizable patterns, while gravel is characterized by larger and rougher patches that can be more easily distinguished from surrounding surfaces. Martian Soil, in contrast, often appears as finer and more homogeneous regions with diffuse transitions, making it more prone to misclassification with adjacent categories.

The lithic units, Bedrock and Rocks, present another informative contrast. Bedrock and Rocks achieve Acc of 83.25% and 88.96%, with corresponding mIoU of 44.97% and 69.14%. This performance gap stems from frequent inter-class confusion. Bedrock generally forms extensive and relatively smooth surfaces, yet when layered structures or prominent outcrops appear, these areas may be mistaken for large Rocks. Rocks, by contrast, range from rounded boulders to irregular protrusions with more distinctive morphological cues, which enhances their separability. These characteristics explain why Rocks are recognized with higher accuracy than Bedrock. Shadows remain particularly difficult, with Acc of 80.01% and an mIoU of 49.52%. Their strong dependence on Rocks means that illumination often casts dark regions directly over rocky surfaces, causing parts of Rocks to be mislabeled as shadow. Shadows may also appear in narrow and confined spaces beneath Rocks, further complicating their detection. This intrinsic entanglement between lighting effects and lithic structures makes Shadows–Rocks boundaries especially challenging to resolve.

By contrast, Tracks, Background, and Unknown regions are relatively independent of other terrains and therefore exhibit more balanced performance. Tracks achieve 93.69% Acc and 88.49% mIoU, while Background and Unknown reach over 90% Acc, with mIoUs of 82.56% and 67.01%, respectively. These results suggest that their clear visual distinction and minimal overlap with other categories make them easier to identify with consistency.

To further investigate the impact of the proposed loss function on easily confusable categories, we perform an in-depth analysis focusing on six representative terrain classes: Martian Soil, Sands, Gravel, Bedrock, Rocks, and Shadows. These classes constitute three typical confusion groups in Martian terrain segmentation, as previously outlined. Table 6 summarizes the per-class mIoU for MarsTerrNet trained without

L_{r p l}

and

L_{h a l}

and for the full model using both terms.

The relation-guided supervision brings consistent mIoU gains across all six categories. Martian Soil, Sands, and Gravel show mIoU improvements of 4.51, 2.66, and 2.37, respectively. These results indicate that the loss function helps the model better distinguish terrains with gradual changes in grain size and texture. Martian Soil, which sometimes appears in transitional zones between Sands and Gravel, benefits the most in this group. By comparison, Sands and Gravel already exhibit clearer morphological features such as dune ridges or small blocky structures. Thus their baseline mIoU is higher and the gains from relation-guided loss are smaller. For Bedrock and Rocks, mIoU rises by 5.79 and 1.61. This shows that the model improves at separating flat, covered Bedrock from Rocks. Bedrock poses a greater classification challenge. Its smoother texture and frequent partial coverage reduce its visual salience compared to the more prominent Rocks, so its mIoU drops more without

L_{r p l}

+

L_{h a l}

.

These targeted improvements directly explain the global performance gains observed earlier. The resulting mIoU align well with the geological relationships modeled by our loss function. This correspondence between performance and geological knowledge shows that the proposed loss function improves more than just the overall metrics. It specifically reduces errors caused by intrinsic terrain continuity and illumination effects, rather than by annotation noise or limited model capacity. These findings underscore the value of explicitly incorporating geological dependencies into the segmentation model.

Overall, the distribution of per-class Acc and mIoU aligns closely with the geological priors embedded in our loss design. Categories characterized by inherent continuity or strong spatial correlation remain challenging to segment precisely. In contrast, visually and spatially distinct categories are identified with significantly higher reliability. This correspondence between performance patterns and geological knowledge demonstrates that the proposed relation-guided loss provides supervision that is not only effective but also consistent with domain characteristics. Furthermore, the results indicate that the primary segmentation challenges arise from intrinsic terrain continuity and illumination effects, rather than annotation artifacts or model capacity limitations, underscoring the importance of explicitly modeling geological dependencies in Martian terrain analysis.

Figure 5 visualizes the feature maps from five network stages for four representative inputs. With relation-guided supervision, the model takes advantage of geological cues to improve perception of terrain types that are otherwise prone to confusion. As the network depth increases, the feature responses become increasingly refined, showing enhanced activation in semantically coherent regions and sharper suppression of ambiguous boundaries. This evolving representation culminates in final predictions that are both accurate and geologically consistent.

Figure 5a illustrates a complex setting where Rock are tightly surrounded by Bedrock. These conditions typically cause boundary leakage, yet the deeper stages exhibit strong responses precisely at the contact zones, indicating that the model learns to focus on ambiguous interfaces where misclassification is most likely. In Figure 5b, the rock in the lower right corner has a relatively flat surface, making it visually similar to Bedrock. Without relation guidance and contextual information, such areas are often merged into a single bedrock region. With supervision, however, the network correctly delineates the outer extent as Rocks while assigning the inner shaded portion to Shadows, thereby avoiding misclassification. Another noteworthy aspect in this case is the Martian Soil distributed along the boundary between Sands and adjacent terrains. Its irregular orientation and diffuse transitions typically make it prone to confusion. The feature maps show that the network consistently attends to these regions across stages, gradually reinforcing the separation between Martian Soil and neighboring categories. This behavior reflects the effect of relation-guided in enlarging the margins between geologically continuous categories, while boundary weighting sharpens attention to uncertain interfaces. Together, these mechanisms allow the model to maintain accurate recognition even under challenging conditions where morphological cues are ambiguous. In Figure 5c, a Bedrock block protrudes from the lower right, with its cast shadow providing clear evidence of elevation. Correct identification of such raised structures is critical for rover safety, as they pose potential hazards to navigation. The network successfully separates the elevated Bedrock from the surrounding Rocks and delineates the associated shadow boundary, thereby avoiding confusion between illumination effects and lithic features. In the lower left, a small patch of Martian Soil is squeezed between two Bedrock regions. Despite its limited size and the dominance of neighboring units, this area is preserved as an independent category. Above it lies an extensive Gravel deposit whose coarse texture might otherwise obscure fine distinctions, yet the network maintains clear separation. Figure 5d involves the rover body, labeled as unknown. This category is relatively independent and less confusable with geological classes, allowing the model to maintain stable recognition of unknown while directing most of its attention to class boundaries between terrains.

To examine how the proposed relation-guided loss influences the organization of the feature space, we visualize the final-layer decoder embeddings using t-SNE. Figure 6 compares the distributions obtained from the model trained without

L_{r p l}

and

L_{h a l}

and from the full model. Without relation-guided supervision, the embeddings of Martian Soil, Sands, and Gravel show substantial overlap, reflecting their inherent granulometric continuity and the difficulty of separating these categories through appearance alone. Bedrock and Rocks also form entangled clusters, consistent with their similar surface morphology. Shadows lie close to and partially overlap with Rocks in the embedding space, which agrees with their strong spatial dependence on rocky surfaces and the tendency to be confused with Rocks under complex illumination.

With relation-guided supervision, the class-wise clusters become significantly more compact and better separated. Martian Soil, Sands, and Gravel move from a blended distribution to three clearly identifiable groups. Bedrock and Rocks become more distinct, even though these categories remain geologically related. Shadows form a tight and coherent cluster, showing that the model learns a more stable representation of illumination effects. The overall structure demonstrates that the proposed loss improves not only numerical metrics but also the organization of the embedding space by increasing intra-class compactness and enlarging margins between closely related terrains. The improvements observed after applying feature-guided loss therefore provide direct evidence that it imposes a more discriminative feature structure, consistent with the geological dependencies.

To further evaluate the generalization capability of MarsTerrNet, we perform a cross-dataset validation using the publicly available MSL-Seg. This dataset adopts the same nine-class and image dimensions as MarsTerr2024, rendering it suitable for cross-domain evaluation. In this setup, all models are trained exclusively on the MSL-Seg training split and are subsequently evaluated on the MarsTerr2024 test set without any fine-tuning.

To mitigate potential data leakage, only images sourced from MSL-Seg are used during training, whereas the MarsTerr2024 test split comprises more recent observations from the Curiosity rover that do not overlap with the MSL-Seg training set. These newer samples encompass sulfate-rich samples and exhibit more complex structural patterns, introducing a considerable domain shift relative to the earlier terrains captured in MSL-Seg.

Table 7 summarizes the cross-dataset evaluation results. As anticipated, all methods experience a performance drop compared to their in-dataset scores, which can be attributed to the greater variability and the presence of novel lithologies in MarsTerr2024. SegFormer suffers the most significant degradation, with mIoU and F1 decreasing by 9.20% and 10.58% respectively, revealing its limited robustness when generalizing to data from later mission phases. MarsNet demonstrates comparatively better stability, exhibiting only moderate reductions across all metrics.

MarsTerrNet achieves the best overall performance in this challenging setting, attaining 78.23% Acc, 76.52% F1, and 62.87% mIoU. Crucially, it also exhibits the smallest performance degradation, with only a 3.48% decrease in mIoU. This notably smaller drop suggests that the relation-guided learning strategy effectively enhances the model’s adaptability to unfamiliar geological contexts, increased intra-class variance within the Rocks category, and more complex boundary conditions. These findings confirm that the improvements offered by MarsTerrNet are not dataset-specific and sustain effectiveness under realistic domain shifts encountered in cross-dataset transfer.

3.4. Comparison with State-of-the-Art Methods

To provide a comprehensive assessment of MarsTerrNet, we conducted comparative experiments on the MarsTerr2024 dataset with four groups of representative models: CNN-based method [9,35,36,37], U-Net-based method [11,38,39,40], Transformer-based architectures [17,41], and U-shaped Transformer models [21,29,42,43,44]. Quantitative results are summarized in Table 5, where five standard evaluation metrics are reported.

The quantitative results in Table 8 highlight the distinct strengths of different model categories. U-Net-based methods consistently outperform CNN-based approaches, with Acc improving from 69.73% on DANet to 73.54% on MultiResUnet and mIoU increasing from 46.72% to 50.28%. This gain is mainly arises from effective multi-scale feature fusion, crucial for recognizing geological structures across diverse spatial scales. Transformer-based models further advance performance, pushing mIoU from 50.28% on NI-U-Net++ to 56.36% on SegFormer, demonstrating the importance of long-range dependency modeling for separating terrains that are spatially adjacent and visually similar, such as Bedrock and Rocks. U-shaped Transformer models integrate local feature extraction with global context modeling and achieve balanced performance across metrics. For instance, SwinUpperNet attains 80.12% Acc and 61.86% mIoU, while MarsNet reaches 81.83% Acc and 63.08% mIoU. Building upon this foundation, MarsTerrNet introduces relation-guided learning to strengthen interactions among terrains. It achieves the highest scores across all metrics, with Acc of 84.05%, mIoU of 66.32%, Pre of 81.86, Rec of 82.95%, and F1 of 82.40%. The concurrent lead in both Pre and Rec signifies its enhanced capability to capture fine-grained details while maintaining semantic consistency, enabling robust segmentation within the complex and unstructured landscapes of Mars.

Qualitative comparisons in Figure 7 provide further insights into model performance under challenging scenarios. In Figure 7a, the Tracks are embedded within extensive Sands regions, making their separation inherently difficult. CNN- and U-Net-based methods often merge the Tracks into the surrounding Sands, while Transformer-based models yield more consistent boundaries but still fail to capture the narrow structures. MarsTerrNet delineates the Tracks boundaries more clearly, reducing the tendency of misclassification. In Figure 7b, the smooth surfaces of Rocks and Bedrock exhibit high visual similarity, leading to frequent confusion. The elongated strip of Martian Soil between Sands and other categories presents an additional challenge: CNN- and U-Net-based models largely miss this region, while Transformer-based models preserve part of the Martian Soil distribution, but with indistinct boundaries. MarsTerrNet produces more continuous and well-defined segmentation, closer to the ground truth. Figure 7c contains numerous Rocks of varying sizes interspersed with small Bedrock patches. CNN- and U-Net-based models tend to over-segment, producing fragmented predictions, whereas Transformer-based methods often underrepresent smaller details. MarsTerrNet maintains a balance between large-scale structure and small-scale details, reducing both fragmentation and omission. In Figure 7d, the small patch of Martian Soil and the visually similar Bedrock and Rocks in the lower-right corner are difficult to distinguish. DeepLabv3+, U-Net3+ and SegFormer show noticeable confusion in these regions, while MarsTerrNet provides a clearer separation with sharper boundaries. In Figure 7e, distant Gravel regions exhibit a gray tone similar to that of Rocks, leading to frequent misclassifications. CNN- and U-Net-based models often fragment this region into scattered predictions, while Transformer-based models tend to over-smooth it, reducing category distinctiveness. MarsTerrNet produces comparatively more coherent results and mitigates part of the confusion, though some misclassification with gravel remains.

In summary, the comparative experiments illustrate how different model families address the complexities of Martian terrain. CNN- and U-Net-based models capture fine details but struggle when surface categories exhibit similar textural or color characteristics. Transformer-based approaches enforce stronger global semantic consistency, though often at the expense of local information. U-shaped Transformer architectures mitigate this trade-off by integrating both local and global cues. Building on these insights, MarsTerrNet introduces relation-guided learning to enhance discrimination among closely related categories. Quantitative and qualitative results confirm that our framework achieves higher Acc and a better balance between local precision and global consistency, essential for robust segmentation of Mars diverse and unstructured geological surfaces.

4. Discussion

Previous studies have demonstrated the advantages of integrating CNN or U-Net architectures with Transformer blocks, enabling joint use of local texture cues and global context for Martian terrain analysis. However, most existing hybrid approaches still combine these two types of information in a relatively loose manner and treat terrain categories as independent labels. Consequently, the local stream is underutilized in resolving fine boundary details, the Transformer stream fails to explicitly encode geological structures, and the models remain limited in capturing the fine-grained terrain relationships inherent in planetary surface evolution. MarsTerrNet addresses these issues through a dual-backbone architecture combined with explicit modeling of inter-class terrain relations. Structurally, the progressive-dilation PRB backbone and the Swin Transformer branch are better suited to Martian scenes than previous hybrids: the PRB enhances edge sensitivity and stratified texture representation through depth-wise filtering and progressively expanding receptive fields, while the window-based and shifted-window attention in Swin Transformer captures long-range spatial organization without sacrificing local contrast. Ablation studies confirm that this combination achieves the highest Acc and mIoU among all tested variants, indicating that local and global cues are not only incorporated but also effectively fused for segmenting complex rover imagery.

The proposed feature-guided loss incorporates geological knowledge instead of treating classes as independent.

L_{r p l}

embeds terrain coupling relationships into the feature space,

L_{h a l}

aligns multi-scale features, and the

L_{b d i c e}

term strengthens uncertain interfaces. The weight configuration is supported by an ablation study on the loss weights, which shows that it provides stable gains without overemphasizing any single component. Cross-dataset evaluation from MSL-Seg to MarsTerr2024 shows that MarsTerrNet experiences smaller performance degradation than SegFormer and MarsNet, suggesting that relation-guided learning enhances adaptability to new lithologies and more complex terrain structures encountered in later mission phases. Furthermore, per-class analysis on six easily confused terrain categories, along with t-SNE visualizations of feature distributions, confirms that the loss not only improves overall metrics but also specifically reduces errors stemming from intrinsic terrain interrelationships. Together with the best Acc and mIoU achieved on MarsTerr2024, these results demonstrate that the combined dual-backbone and feature-guided loss design leads to a more reliable understanding of surface environments under realistic and heterogeneous Martian conditions.

A limitation of the dual-backbone architecture is its higher parameter and slower inference compared with lightweight baselines for high-frequency or real-time use. Although it improves fine-scale detail and long-range terrain relations, it increases memory usage and latency, making it less suitable for online or onboard deployment under strict resource constraints. Future work will therefore explore model compression, simplified architectures, and more efficient inference to improve practicality in resource-limited settings.

Currently, our dataset is primarily focused on a limited set of landing sites, which restricts the diversity of lithologies and surface processes that the model can reliably handle. In future work, we plan to incorporate multi-mission data from rovers such as Curiosity, Perseverance, and Zhurong to broaden the range of geological settings and imaging conditions. From a modeling perspective, we aim to progress beyond the current closed-set supervised learning framework toward an open-world geological perception model for long-term rover operations. Such a model should not only recognize known geological units but also be capable of detecting and flagging novel or rare terrain structures encountered on board. This capability would enable timely reporting of previously unseen features to human scientists and reduce the risk of missing scientifically valuable targets.

5. Conclusions

In this work, we introduce MarsTerrNet, a U-shaped dual-backbone framework for Martian terrain segmentation that integrates Progressive Residual Blocks with a Swin Transformer. The network synergistically integrates the PRB and Swin Transformer backbone, enabling the joint capture of fine-grained spatial details and homogeneous contextual dependencies. To address the inherent ambiguity among terrain categories, we design a feature-guided loss that combines relation-aware prototype regularization (

L_{r p l}

), hierarchical feature aggregation (

L_{h a l}

), and boundary-weighted Dice loss (

L_{b d i c e}

). This composite loss model inter- and intra-class relations explicitly and emphasizes uncertain boundaries, guiding the network toward geologically consistent discrimination. Furthermore, we extend the MarsTerr2024 dataset to nine terrain categories and conduct extensive experiments against state-of-the-art baselines. The results confirm that MarsTerrNet achieves consistently superior performance across evaluated metrics. Future work will explore modeling terrain evolution to enhance the understanding of dynamic surface processes and support the identification of scientifically valuable sites for Martian exploration.

Author Contributions

Conceptualization, R.W. and K.Z.; methodology, R.W.; investigation and data curation, G.Q., C.L. and H.Q.; writing—original draft preparation, R.W.; writing—review and editing, J.S., K.Z. and J.W.; supervision, J.S., K.Z. and J.B.; project administration, Q.Z. and W.W.; funding acquisition, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant No. XDA0430103.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to institutional data-sharing policies, ongoing research and further analysis based on the same annotated dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Golombek, M.P.; Trussell, A.; Williams, N.; Charalambous, C.; Abarca, H.; Warner, N.H.; Deahn, M.; Trautman, M.; Crocco, R.; Grant, J.A.; et al. Rock Size-Frequency Distributions at the InSight Landing Site, Mars. Earth Space Sci. 2021, 8, e2021EA001959. [Google Scholar] [CrossRef]
Viúdez-Moreiras, D.; Newman, C.; De la Torre, M.; Martínez, G.; Guzewich, S.; Lemmon, M.; Pla-García, J.; Smith, M.; Harri, A.M.; Genzer, M.; et al. Effects of the MY34/2018 global dust storm as measured by MSL REMS in Gale crater. J. Geophys. Res. Planets 2019, 124, 1899–1912. [Google Scholar] [CrossRef]
Ehlmann, B.L.; Edwards, C.S. Mineralogy of the Martian surface. Annu. Rev. Earth Planet. Sci. 2014, 42, 291–315. [Google Scholar] [CrossRef]
Siljeström, S.; Czaja, A.D.; Corpolongo, A.; Berger, E.L.; Li, A.Y.; Cardarelli, E.; Abbey, W.; Asher, S.A.; Beegle, L.W.; Benison, K.C.; et al. Evidence of Sulfate-Rich Fluid Alteration in Jezero Crater Floor, Mars. J. Geophys. Res. Planets 2024, 129, e2023JE007989. [Google Scholar] [CrossRef]
Burl, M.C.; Thompson, D.R.; de Granville, C.; Bornstein, B.J. Rockster: Onboard rock segmentation through edge regrouping. J. Aerosp. Inf. Syst. 2016, 13, 329–342. [Google Scholar] [CrossRef]
Li, J.; Zhang, L.; Wu, Z.; Ling, Z.; Cao, X.; Guo, K.; Yan, F. Autonomous Martian rock image classification based on transfer deep learning methods. Earth Sci. Inform. 2020, 13, 951–963. [Google Scholar] [CrossRef]
Bickel, V.T.; Conway, S.J.; Tesson, P.A.; Manconi, A.; Loew, S.; Mall, U. Deep learning-driven detection and mapping of rockfalls on Mars. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2831–2841. [Google Scholar] [CrossRef]
Bampis, L.; Gasteratos, A.; Boukas, E. CNN-based novelty detection for terrestrial and extra-terrestrial autonomous exploration. IET Cyber-Syst. Robot. 2021, 3, 116–127. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Furlán, F.; Rubio, E.; Sossa, H.; Ponce, V. Rock detection in a Mars-like environment using a CNN. In Proceedings of the Pattern Recognition: 11th Mexican Conference, MCPR 2019, Querétaro, Mexico, 26–29 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 149–158. [Google Scholar]
Lee, C. Automated crater detection on Mars using deep learning. Planet. Space Sci. 2019, 170, 16–28. [Google Scholar] [CrossRef]
Ogohara, K.; Gichu, R. Automated segmentation of textured dust storms on mars remote sensing images using an encoder-decoder type convolutional neural network. Comput. Geosci. 2022, 160, 105043. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, San Diego, CA, USA, 2–7 December 2017. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Red Hook, NY, USA, 6–14 December 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional positional encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
Liu, H.; Yao, M.; Xiao, X.; Cui, H. A hybrid attention semantic segmentation network for unstructured terrain on Mars. Acta Astronaut. 2023, 204, 492–499. [Google Scholar] [CrossRef]
Lv, W.; Wei, L.; Zheng, D.; Liu, Y.; Wang, Y. MarsNet: Automated rock segmentation with transformers for Tianwen-1 mission. IEEE Geosci. Remote Sens. Lett. 2022, 20, 1–5. [Google Scholar] [CrossRef]
Liu, H.; Yao, M.; Xiao, X.; Xiong, Y. Rockformer: A u-shaped transformer network for martian rock segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Liu, H.; Yang, H.; Fu, Y. Marsformer: Martian rock semantic segmentation with transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600612. [Google Scholar] [CrossRef]
Fan, L.; Yuan, J.; Niu, X.; Zha, K.; Ma, W. RockSeg: A Novel Semantic Segmentation Network Based on a Hybrid Framework Combining a Convolutional Neural Network and Transformer for Deep Space Rock Images. Remote Sens. 2023, 15, 3935. [Google Scholar] [CrossRef]
Jia, Y.; Wan, G.; Li, W.; Li, C.; Liu, J.; Cong, D.; Liu, L. EDR-TransUnet: Integrating Enhanced Dual Relation-Attention with Transformer U-Net For Multi-scale Rock Segmentation on Mars. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4601416. [Google Scholar] [CrossRef]
Dai, Y.; Zheng, T.; Xue, C.; Zhou, L. SegMarsViT: Lightweight mars terrain segmentation network for autonomous driving in planetary exploration. Remote Sens. 2022, 14, 6297. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Cui, H.; Fu, Y. Light4Mars: A lightweight transformer model for semantic segmentation on unstructured environment like Mars. ISPRS J. Photogramm. Remote Sens. 2024, 214, 167–178. [Google Scholar] [CrossRef]
Li, J.; Zi, S.; Song, R.; Li, Y.; Hu, Y.; Du, Q. A stepwise domain adaptive segmentation network with covariate shift alleviation for remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Swan, R.M.; Atha, D.; Leopold, H.A.; Gildner, M.; Oij, S.; Chiu, C.; Ono, M. Ai4mars: A dataset for terrain-aware autonomous driving on mars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1982–1991. [Google Scholar]
Fraeman, A.A.; Edgar, L.A.; Rampe, E.B.; Thompson, L.M.; Frydenvang, J.; Fedo, C.M.; Catalano, J.G.; Dietrich, W.E.; Gabriel, T.S.; Vasavada, A.; et al. Evidence for a diagenetic origin of Vera Rubin ridge, Gale crater, Mars: Summary and synthesis of Curiosity’s exploration campaign. J. Geophys. Res. Planets 2020, 125, e2020JE006527. [Google Scholar] [CrossRef]
Clark, J.; Sutter, B.; McAdam, A.; Lewis, J.; Franz, H.; Archer, P.; Chou, L.; Eigenbrode, J.; Knudson, C.; Stern, J.; et al. Environmental changes recorded in sedimentary rocks in the clay-sulfate transition region in Gale Crater, Mars: Results from the Sample Analysis at Mars-Evolved Gas Analysis instrument onboard the Mars Science Laboratory Curiosity Rover. J. Geophys. Res. Planets 2024, 129, e2024JE008587. [Google Scholar] [CrossRef]
Bennett, K.A.; Fox, V.K.; Bryk, A.; Dietrich, W.; Fedo, C.; Edgar, L.; Thorpe, M.T.; Williams, A.J.; Wong, G.M.; Dehouck, E.; et al. The Curiosity rover’s exploration of Glen Torridon, Gale crater, Mars: An overview of the campaign and scientific results. J. Geophys. Res. Planets 2023, 128, e2022JE007185. [Google Scholar] [CrossRef]
Ma, C.; Li, Y.; Lv, J.; Xiao, Z.; Zhang, W.; Mo, L. Automated rock detection from Mars rover image via Y-shaped dual-task network with depth-aware spatial attention mechanism. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–18. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Kuang, B.; Wisniewski, M.; Rana, Z.A.; Zhao, Y. Rock segmentation in the navigation vision of the planetary rovers. Mathematics 2021, 9, 3048. [Google Scholar] [CrossRef]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar]

Figure 1. (a) shows a sample from MarsTerr2024, and (b) presents its corresponding label. In this unstructured scene, the segmentation of bedrock (lower left, blue box) is influenced not only by its consistency with other bedrock regions but also by its relationships with heterogeneous areas marked by boxes of other colors. Notably, the surface features of bedrock are more similar to rocks, while differing clearly from other terrains. (c–e) Martian Soil, Sands, and Gravel, respectively, showing increasing particle size from fine to coarse. Their surface are similar, making them difficult to distinguish, and the boundaries are blurred.

Figure 4. Comparison of per-class Acc (left column) and mIoU (right column) for Martian terrain categories. The vertical axis denotes percentage values, and the horizontal axis lists terrain classes.

Figure 5. Visualization of feature maps in MarsTerrNet. (a–d) are 4 examples. The first column shows the input images, the next five columns present multiscale feature maps from the five network stages, and the last column displays the ground truth.

Figure 6. t-SNE visualization of final-layer decoder embeddings. (Left): model trained without

L_{r p l}

+

L_{h a l}

. (Right): model trained with

L_{r p l}

+

L_{h a l}

. Clusters become more compact and better separated with the proposed feature-guided loss function.

Figure 6. t-SNE visualization of final-layer decoder embeddings. (Left): model trained without

L_{r p l}

+

L_{h a l}

. (Right): model trained with

L_{r p l}

+

L_{h a l}

. Clusters become more compact and better separated with the proposed feature-guided loss function.

Figure 7. Visualization of results in MarsTerrNet. (a–e) are five examples. The first column shows the input images, the second column displays the ground truth, and the next five columns present the results from the five models.

Table 1. Summary of representative hybrid architectures. Backbone Type denotes the encoder backbone employed in each work, while Architecture Type specifies how CNN and Transformer components are combined within the encoder. Sequential Hybrid refers to serial coupling of the two modules, whereas Replacement Hybrid indicates that Transformer blocks replace the CNN blocks within U-Net–like encoder structures.

Method	Backbone Type	Architecture Type	Performance-Driven/ Efficiency-Oriented
HASS [20]	CNN + Attention	Sequential Hybrid	Per.-driven
MarsNet [21]	CNN + Transformer	Sequential Hybrid
RockFormer [22]	U-Net + Transformer	Replacement Hybrid
MarsFormer [23]	U-Net + Transformer	Replacement Hybrid
RockSeg [24]	U-Net - Based	-
EDR-TransUnet [25]	U-Net + Transformer	Sequential Hybrid
SegMarsViT [26]	U-Net + Transformer	Replacement Hybrid	Eff.-oriented
Light4Mars [27]	U-Net + Transformer	Replacement Hybrid	Eff.-oriented

Table 2. Dataset comparison. The last column in the table indicates whether the dataset contains newly released sulfate data.

Datasets	Year	Classes	Annotated Images	Image Size	Sulfate
Mars32K	2018	-	32,000	$560 \times 500$	×
AI4Mars-MSL	2021	4	17,030	$1024 \times 1024$	×
MSL-Seg	2022	9	4155	$560 \times 500$	×
MarsScapes	2022	8	195	1230~12,062 × 472~1649	×
MarsData-V2	2023	2	8390	$512 \times 512$	×
MarsTerr2024	-	9	6000	$560 \times 500$	✓

Table 3. Ablation study of dual-backbone configurations.

Modules	Acc	Pre	Rec	F1	mIoU
ResNet-34 +Transformer	73.85	72.47	71.43	71.95	58.43
ResNet-34 + Swin Transformer	75.17	75.24	73.82	74.52	60.54
PRB + Transformer	76.45	75.01	75.44	75.22	61.21
PRB + Swin Transformer	78.23	76.62	77.83	77.22	63.76

Table 4. Ablation study on different loss configurations.

$L_{rpl}$ + $L_{hal}$	$L_{bdice}$	Acc	Pre	Rec	F1	mIoU
-	-	78.23	76.62	77.83	77.22	63.76
✓	-	82.36	80.51	81.27	80.90	65.37
-	✓	80.87	77.64	80.11	78.84	64.06
✓	✓	85.63	82.62	84.63	83.61	67.83

Table 5. Ablation study on different loss weightings. The bold text in the table indicates that the corresponding results in this group are the best.

$λ_{rpl}$	$λ_{hal}$	$λ_{bdice}$	Acc	Pre	Rec	F1	mIoU
1	1	0.5	84.05	83.54	83.06	83.23	68.05
1	0.5	1	86.46	84.59	81.15	82.81	68.41
1	0.5	0.5	88.32	86.91	85.38	86.14	71.82
0.5	0.5	0.5	85.93	84.54	84.35	84.44	70.68
0.5	1	0.5	83.79	82.44	83.73	83.08	68.57
0.5	1	1	82.56	80.32	81.29	80.80	65.49
0.5	0.5	1	83.12	81.86	82.95	82.40	66.32

Table 6. Per-class mIoU on six easily confused terrain categories with and without the relation-guided loss terms

L_{r p l}

and

L_{h a l}

. The uparrow indicates that the values in this group have improved compared to the without

L_{r p l}

and

L_{h a l}

condition.

Table 6. Per-class mIoU on six easily confused terrain categories with and without the relation-guided loss terms

L_{r p l}

and

L_{h a l}

. The uparrow indicates that the values in this group have improved compared to the without

L_{r p l}

and

L_{h a l}

condition.

Setting	Martian Soil	Sands	Gravel	Bedrock	Rocks	Shadows
w/o $L_{r p l}$ + $L_{h a l}$	37.69	50.73	75.91	39.18	67.93	45.26
w/ $L_{r p l}$ + $L_{h a l}$	41.20 ( $↑ 4.51$ )	53.39 ( $↑ 2.66$ )	78.28 ( $↑ 2.37$ )	44.97 ( $↑ 5.79$ )	69.14 ( $↑ 1.61$ )	49.52 ( $↑ 4.26$ )

Table 7. Cross-dataset evaluation: models trained on MSL-Seg and tested on the non-overlapping test split of MarsTerr2024. The downarrow indicates that the values in this group have decreased compared to the results obtained from training on MarsTerr2024.

Model	Acc	F1 Score	mIoU
SegFormer [17]	67.45 ( $↓ 8.97$ )	60.31 ( $↓ 10.58$ )	47.16 ( $↓ 9.20$ )
MarsNet [21]	75.64 ( $↓ 6.19$ )	72.93 ( $↓ 7.19$ )	57.41 ( $↓ 5.67$ )
MarsTerrNet	78.23 ( $↓ 4.33$ )	76.52 ( $↓ 5.88$ )	62.87 ( $↓ 3.48$ )

Table 8. Comparison of different segmentation models on MarsTerr2024.

Model	Method	Acc	mIoU	Pre	Rec	F1 Score
CNN-based	FCN [9]	65.47	41.58	63.27	47.42	54.23
	PSP-Net [35]	66.89	42.06	64.09	50.52	56.49
	DeepLabv3+ [36]	68.44	44.25	66.95	53.68	58.34
	DANet [37]	69.73	46.72	67.56	53.59	59.76
U-Net-based	U-Net [11]	70.09	43.65	72.17	48.36	57.92
	U-Net3+ [38]	72.39	45.26	71.85	50.34	59.18
	NI-U-Net++ [39]	71.67	50.28	70.82	55.29	62.08
	MultiResUnet [40]	73.54	48.39	76.77	53.79	63.26
Transformer-based	SETR [41]	74.53	52.25	73.21	63.29	67.87
Transformer-based	SegFormer [17]	76.42	56.36	75.69	66.59	70.89
U-shaped Transformer-based	TransUnet [42]	78.44	56.19	76.25	70.53	73.28
	Swin-Unet [43]	79.05	58.74	77.89	74.53	76.18
	SwinUpperNet [29]	80.12	61.86	79.38	78.11	78.75
	UCT-TransNet [44]	60.68	62.29	80.53	81.95	81.20
	MarsNet [21]	81.83	63.08	83.17	77.36	80.12
	MarsTerrNet	84.05	66.32	81.86	82.95	82.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.; Sun, J.; Zhou, K.; Wang, J.; Bi, J.; Zhang, Q.; Wang, W.; Qu, G.; Li, C.; Qiu, H. MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation. Remote Sens. 2026, 18, 35. https://doi.org/10.3390/rs18010035

AMA Style

Wang R, Sun J, Zhou K, Wang J, Bi J, Zhang Q, Wang W, Qu G, Li C, Qiu H. MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation. Remote Sensing. 2026; 18(1):35. https://doi.org/10.3390/rs18010035

Chicago/Turabian Style

Wang, Rui, Jimin Sun, Kefa Zhou, Jinlin Wang, Jiantao Bi, Qing Zhang, Wei Wang, Guangjun Qu, Chao Li, and Heshun Qiu. 2026. "MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation" Remote Sensing 18, no. 1: 35. https://doi.org/10.3390/rs18010035

APA Style

Wang, R., Sun, J., Zhou, K., Wang, J., Bi, J., Zhang, Q., Wang, W., Qu, G., Li, C., & Qiu, H. (2026). MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation. Remote Sensing, 18(1), 35. https://doi.org/10.3390/rs18010035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MarsTerrNet: A U-Shaped Dual-Backbone Framework with Feature-Guided Loss for Martian Terrain Segmentation

Highlights

Abstract

1. Introduction

1.1. Background on Martian Terrain Perception

1.2. Mars Semantic Segmentation Methods

1.3. Hybrid CNN–Transformer Architectures for Mars Segmentation

1.4. Summary of Limitations and Our Contributions

2. Method

2.1. Overall Framework

2.2. CNN-Based Backbone

2.3. Swin Transformer Backbone

2.4. Feature-Guided Loss Function

3. Results

3.1. Datasets and Experimental Setting

3.2. Training Details

3.3. Ablation Studies

3.4. Comparison with State-of-the-Art Methods

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI