MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer

Li, Chu; Jia, Yutong; Wan, Gang; Ma, Qifang; Liu, Jia; Wang, Yang; Wang, Biao; Liu, Jia; Wei, Zhanji

doi:10.3390/rs18111794

Open AccessArticle

MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer

by

Chu Li

^1,2

,

Yutong Jia

^1,2,*

,

Gang Wan

^1,2

,

Qifang Ma

^1,2

,

Jia Liu

¹,

Yang Wang

^1,2

,

Biao Wang

³

,

Jia Liu

⁴ and

Zhanji Wei

¹

Space Information Academic, Space Engineering University, Beijing 101407, China

²

Key Laboratory of Intelligent Processing and Application Technology of Satellite Information, Beijing 100192, China

³

State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

⁴

Key Laboratory of Planetary Science and Frontier Technology, Institute of Geology and Geophysics, Chinese Academy of Sciences, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1794; https://doi.org/10.3390/rs18111794

Submission received: 9 May 2026 / Revised: 22 May 2026 / Accepted: 25 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Advances in Exploring the Moon, Mars, and Asteroids Based on In Situ and Remote Sensing Measurements (Second Edition))

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

We proposed the Mars Adaptive Frequency Transformer (MAFT), a lightweight network building upon AFFormer with AKConv and EMCA, which achieves 88.90% Intersection over Union (IoU) with only 2.97 M parameters and 15.49 G floating-point operations (FLOPs), surpassing all compared lightweight and Mars-specific segmentation models.
We constructed the TWMARS-V2 dataset with fine-grained annotations, addressing the high omission rate of small rocks in existing datasets and establishing a robust evaluation benchmark.

What are the implications of the main findings?

With a high inference speed of 35.25 frames per second (FPS) and low computational cost, MAFT is highly suitable for deployment on resource-constrained onboard hardware, enabling real-time obstacle avoidance for future Mars rovers.
The network’s robustness under dust coverage and complex textures supports automated rock size and morphology statistics through a practical measurement workflow.

Abstract

The segmentation of rocks on the Martian surface is crucial for navigation and obstacle avoidance by Mars rovers. However, frequent dust storms degrade rock surface textures, and the wide range of rock scales—from sub-meter to ten-meter—further complicates segmentation, especially under the strict computational constraints of rover hardware. This paper proposes a lightweight network named MAFT, specifically designed for Martian rock segmentation. The network builds upon the Adaptive Frequency Transformer (AFFormer) and constructs an improved backbone termed the Improved Adaptive Frequency Transformer (IAFFormer). By replacing the traditional self-attention mechanism with a frequency-domain approach, it captures global feature dependencies while reducing the computational complexity from quadratic to linear. The spatially isolated 1 × 1 convolutions in the pixel descriptor module are further replaced with Adaptive Kernel Convolution (AKConv), enabling the backbone to dynamically adjust its sampling positions to conform to the irregular and diverse morphologies of Martian rocks. An Enhanced Multidimensional Convolutional Attention (EMCA) module is introduced as the decoding structure. By integrating max-pooling in the squeeze stage and adaptive dilated convolutions in the excitation stage, EMCA strengthens the boundary perception and long-range dependency modeling of dust-covered rocks without increasing the parameter count. Additionally, we constructed a dataset of Martian rocks for the Zhurong rover (TWMARS-V2) and conducted experiments using a synthetic dataset (SynMars) and a real dataset (MarsData-V2). Experimental results demonstrate that MAFT achieves the highest segmentation accuracy among all compared methods, with only 2.97 M parameters and 15.49 G FLOPs. On the TWMARS-V2 dataset, Pixel Accuracy (PA) reaches 98.17%, and IoU reaches 88.90%.

Keywords:

Martian rock segmentation; lightweight semantic segmentation; adaptive frequency transformer; adaptive kernel convolution (AKConv); enhanced multidimensional convolutional attention (EMCA); TWMARS-V2

1. Introduction

Mars is a key target for extraterrestrial exploration. The task of segmenting rocks on their surface is directly related to the safety obstacle avoidance of Mars rovers and the efficiency of geological research [1,2,3,4]. The accurate rock segmentation can provide information about the distribution of obstacles in path planning [5]. It can also support astrobiological analysis, for instance, the ExoMars mission uses semantic segmentation techniques for identifying the distribution characteristics of organic carbon-containing mineral indicators and mineral assemblages, and guiding drilling and sampling for tracing the ancient Martian environment [6]. Furthermore, in shallow subsurface drilling missions, the accurate rock segmentation plays a critical role in analyzing the exposure of rocks at crater wall outcrops for supporting subsurface sampling targets [7]. However, Martian rocks often exhibit multi-scale variation in size from sub-meter to ten-meter levels [8], blurring of surface texture caused by perennial dust storms, and irregular morphologies. Additionally, the computational resources for detectors are limited, e.g., the peak power of the Perseverance rover’s CPU is less than 5 W [9], which severely constrains the applicable segmentation algorithms.

In recent years, autonomous technologies for deep space exploration have made advancements in many fields, including intelligent interpretation of remote sensing targets, with related technologies demonstrating strong robustness in engineering practices. Traditional methods for Martian rock segmentation primarily rely on shallow visual features. For example, the Mars Boulder Automatic Recognition System (MBARS) recognizes boulders in HiRISE images through shadow segmentation and ellipse fitting [10]. However, these approaches exhibit significant limitations in the extreme Martian environment. Under dust coverage or low-light conditions, the degradation of local texture features severely compromises their environmental adaptability.

Similar segmentation challenges arise on other planetary bodies. On the Moon, Bickel et al. [11] applied a CNN-based detector to LROC NAC imagery for automated rockfall identification, demonstrating that deep learning can scale to global lunar surveys despite extreme illumination contrast and shadow occlusion. Related work on Chang’E imagery [12] applied deep and transfer learning for lunar crater identification, further demonstrating the effectiveness of deep learning architectures for geological feature recognition under challenging planetary surface conditions. These parallel research lines indicate that multi-scale representation, boundary refinement under low contrast, and efficient global context modeling are common technical challenges in planetary and geological segmentation. The components proposed in this work are not Mars-specific in design, and could in principle be transferred to such adjacent tasks through retraining, leaving a quantitative cross-domain evaluation as future work.

Convolutional neural networks (CNNs) have substantially improved planetary rock segmentation. Encoder–decoder architectures, such as U-Net, along with multi-scale parsing networks, such as DeepLabV3 [13], have been widely applied to planetary rock detection. However, these approaches rely on convolutions with fixed geometric structures, which struggle to simultaneously capture the fine details of small rocks and the overall contours of large rocks on the Martian surface. Moreover, their dependence on local texture features makes them prone to misclassifying dust-covered rocks as background. Although recent lightweight CNN designs have partially addressed the computational constraints, they still lack global context modeling capabilities. Inspired by the successful applications in the field of natural language processing, Vision Transformer (ViT) has also achieved significant breakthroughs in the field of computer vision by capturing global context through self-attention [14]. The integration of CNNs and Transformers has proven effective in compensating for the Transformer’s limitations in extracting fine-grained local features, and such hybrid architectures have also been explored for Martian rock segmentation.

However, the quadratic computational complexity O(N²d) of self-attention remains a critical bottleneck. The standard ViT-Base requires 86 M parameters [15], and Swin Transformer requires 236 G FLOPs [16], far exceeding the capacity of low-power embedded GPUs on Mars rovers. Existing strategies to reduce computational cost, including window partitioning [17], sequence reduction [18] and token merging [19], provide only limited relief and may compromise global or local semantic information. Recent efforts such as Light4Mars [20] have pushed the boundaries of extreme lightweight design by reducing the parameter count to as low as 2.57 M. However, such aggressive model compression severely restricts the representational capacity of the network. This limitation leads to a noticeable degradation in segmentation accuracy, particularly in the recognition of small rocks and dust-obscured boundaries.

Despite these advances, current approaches for Martian rock segmentation still face the following challenges:

Martian rocks present extreme scale variations as shown in Figure 1b,f, alongside highly irregular morphologies. Current CNNs relying on fixed receptive fields struggle to maintain this fine-to-global spatial awareness simultaneously
Perennial dust storms severely degrade surface textures as shown in Figure 1c,g. This creates a fundamental difficulty: CNNs tend to confuse texture-degraded rocks with the surrounding sand, while standard Transformers capture global context but lack the local inductive bias needed to delineate blurred boundaries.
Acquiring labeled data of Martian surface rocks is difficult, and publicly available annotated datasets remain scarce. The existing annotated datasets related to the rocks of the Zhurong rover have low completeness, with a high omission rate in the annotation of small rocks.
The embedded systems of Mars rovers are subject to stringent constraints on power consumption and computing capabilities. However, existing models struggle to achieve an optimal balance between computational efficiency and segmentation precision. High-accuracy Transformer-based models designed for rock segmentation, such as MarsFormer, entail large parameter counts and per-inference computational loads that surpass the capacity of onboard computing units. Conversely, extreme lightweight architectures, such as Light4Mars, sacrifice essential representational capacity and fail to reliably detect critical but subtle obstacles. Therefore, a principled framework that effectively balances deployment efficiency with high-accuracy segmentation remains lacking.

In this article, we propose a lightweight Martian rock segmentation framework, namely MAFT, which abandons the redundant connections of traditional U-shaped encoder–decoder architectures and uses a streamlined hybrid architecture, with an improved adaptive frequency Transformer (IAFFormer) serving as the backbone network.

We identify that the spatially isolated descriptors produced by pointwise convolutions in AFFormer’s pixel descriptor module are insufficient for distinguishing rocks from spectrally similar sandy backgrounds under dust-degraded conditions. To address this, we integrate AKConv [21] into the pixel descriptor module, enabling shape-aware local feature aggregation that conforms to the irregular contours of Martian rocks. In MAFT, the shape and distribution of convolution kernels are adjusted dynamically by learning the prior distribution of rock morphologies, thereby improving the accuracy of target detection. In addition, we propose the Enhanced Multidimensional Convolutional Attention (EMCA) module to address the specific challenges of Martian rock segmentation. While EMCA retains the efficient three-branch topology of the original MCA [22], we introduce two critical internal modifications to the transformation logic rather than simply adopting the baseline.

First, in the squeeze stage, we integrate max-pooling alongside average and standard deviation pooling. Standard average pooling computes the mean activation over a spatial region, which tends to smooth out sharp transitions; this causes high-contrast rock boundaries—already weakened by Martian dust—to be further suppressed. Max-pooling preserves the strongest activation within each region, thereby retaining salient boundary cues essential for separating dust-obscured rocks from terrain. This modification is essential for capturing sharp rock boundaries that are often blurred by Martian dust. Second, and most importantly, we redesign the excitation stage. Instead of standard convolutions, we implement Adaptive Dilated Convolutions. This mechanism expands the receptive field along the channel and spatial dimensions without increasing parameters, enabling the network to capture long-range dependencies required for recognizing rocks with extreme scale variations. These enhancements allow EMCA to effectively distinguish low-texture rocks from the sandy background. In order to verify the effectiveness and robustness of MAFT, experiments are conducted on three datasets: self-annotated Zhurong Utopia Planitia rock dataset TWMARS-V2, SynMars [23] that simulates Tianwen-1 scenarios, and MarsData-V2 dataset [24] containing real Curiosity rover scenarios, to verify the robustness of the network.

Our main contributions are summarized below:

We propose MAFT, a lightweight framework for Martian rock segmentation that combines adaptive convolution and enhanced attention with a frequency-domain Transformer backbone. With only 2.97 M parameters and 15.49 G FLOPs, MAFT achieves the highest segmentation accuracy among all compared methods.
We construct an improved backbone termed IAFFormer by building upon the AFFormer architecture and replacing the fixed-grid 1 × 1 convolutions in the pixel descriptor module with AKConv. Standard pointwise convolutions produce spatially isolated descriptors that lack local context, making them unreliable for distinguishing rocks from spectrally similar sandy backgrounds under dust-degraded conditions. AKConv enables shape-aware local feature aggregation through dynamically adjusted sampling positions, yielding geometrically adaptive descriptors that better conform to irregular rock contours.
We design the EMCA module with a triple-branch structure for simultaneous channel, height, and width attention, integrating hybrid pooling and adaptive dilated convolutions to improve boundary discrimination under dust occlusion.
We release TWMARS-V2, an improved version of the TWMARS dataset with exhaustive re-annotation covering all visible rock instances, providing a more complete benchmark for Martian rock segmentation research.

Figure 1. Representative Martian surface images illustrating key segmentation challenges. The base map is derived from MOLA data [25]. The pentagram and circle mark the Zhurong and Curiosity landing sites, respectively. Red circles highlight segmentation-challenging regions. (a) Regional context of the Zhurong landing site. (b–d) Images from the TWMARS-V2 dataset: (b) multi-scale rocks; (c) dust-degraded rock textures; (d) overlapping rocks with ambiguous boundaries. (e) Regional context of the Curiosity landing site. (f–h) Images from the MarsData-V2 dataset [24]: (f) overlapping rocks of varying sizes; (g) dust-covered rocks with low background contrast; (h) small-scale rocks on sandy terrain.

2. Related Works

2.1. Semantic Segmentation of Martian Rocks

The semantic segmentation of Martian rocks is a pixel-level classification task fundamental to autonomous planetary exploration [26]. Early approaches relied on handcrafted features and classical machine learning. The Rockster algorithm, based on the Canny edge detector, was deployed in NASA’s AEGIS system for onboard rock detection [27], while SVMs combined with super-pixel grouping and contextual features were used for rock boundary extraction. These methods perform adequately under favorable imaging conditions but degrade substantially when rock textures are obscured by dust or shadow.

The introduction of deep learning brought marked improvements. Encoder–decoder frameworks, including U-Net [28] and NI-U-Net++ [29], were subsequently adapted for Martian rock segmentation, while DeepLabV3+ was applied to Martian skyline parsing [30]. To incorporate global context, MarsNet [31] appended Transformer encoders after CNN layers, partially alleviating the limited receptive fields inherent to purely convolutional designs.

More recent research has focused on designing lightweight and task-specific architectures for Martian rock and terrain segmentation. Along the CNN-centric trajectory, MarsSeg [32] employed Mini-ASPP for efficient multi-scale feature parsing, LBNet [33] adopted a bilateral lightweight structure to balance spatial detail and semantic representation, and Rocknet [34] further explored real-time Martian rock segmentation through cross-dimensional channel attention, dilated convolution, and feature fusion. Along the Transformer-centric and hybrid trajectory, MarsFormer [23] introduced Feature Enhancement Modules coupled with window Transformer blocks, RockFormer [24] tailored a ViT-based architecture to mitigate dust-induced texture degradation, and EDR-TransUnet [35] fused edge-enhanced representations with a Transformer-based decoder for improved boundary delineation. Pursuing extreme model compression, Light4Mars [20] reduced the parameter count to approximately 2.57 M while maintaining acceptable accuracy, though highly compressed models may still face challenges in preserving fine-grained rock boundaries and small-object details.

In addition to RGB-based architecture, recent studies have explored multimodal and onboard-oriented Martian segmentation. DepthFormer [36] incorporated stereo-derived depth information as an additional modality for semantic segmentation of Martian surface images, showing the value of geometric cues in weak-textured scenes. LisseMars [37] extended lightweight Martian semantic segmentation to Mars helicopter imagery by integrating window movable attention, convolutional feedforward modeling, dynamic polygon convolution, and multi-scale fusion for onboard-oriented perception. These works indicate that recent Martian segmentation research is moving toward lightweight deployment, multimodal perception, and improved local–global feature modeling.

Unlike the above studies, the present work focuses on RGB-based Martian rock segmentation under a fully supervised setting and aims to jointly improve frequency-domain global modeling, morphology-adaptive local sampling, and boundary-sensitive multidimensional attention within a lightweight framework. Therefore, a lightweight RGB-based Martian rock segmentation framework that jointly considers global dependency modeling, morphology-adaptive local feature aggregation, and boundary-sensitive feature refinement remains worth further investigation.

2.2. Adaptive Frequency Transformer

ViT leverages the self-attention mechanism of Transformer, which is good at capturing global relationships, and exhibits immense potential in the image domain. However, the exceptional performances of both architectures critically depend on their substantial parameters and required computational resources, thus limiting their deployment on resource-constrained devices. AFFormer [38] is a novel and lightweight Transformer architecture that incorporates the Frequency Similarity Kernel (FSK) for effectively reducing the computational complexity from quadratic to linear time complexity. The specific computation is as follows:

F_{i, j} = e k_{i} v_{i}^{T} / \sum_{j = 1}^{n} e^{k_{i}}

(1)

where

k_{i}

represents the frequency component of the key matrix K,

v_{i}

represents the frequency component of the value matrix V, and

F_{i, j}

is the frequency similarity kernel. Matrices K and V are obtained through linear normalization of the input matrix.

To further reduce the computational burden, AFFormer introduces a local clustering strategy before Transformer-based processing. Given an input feature map

F \in R^{H \times W \times C}

, a low-resolution grid

M \in R^{h \times w \times C}

is initialized as a set of local cluster centers, where

h \times w ≪ H \times W

. For each cluster center M(s), the corresponding α × α spatial neighborhood is denoted as

ω_{s}

, and the number of pixels in this neighborhood is

n = α \times α

. The cluster center is obtained by aggregating the pixel-level features within

ω_{s}

through weighted summation:

M (s) = \sum_{i = 1}^{n} w_{s, i} x_{s, i}

(2)

where

3 \times 3

denotes the feature of the i-th pixel in the neighborhood

η

, and

w_{s, i}

denotes the corresponding aggregation weight. In this work, α is set to three, and therefore each cluster center aggregates information from a

3 \times 3

local neighborhood. This clustering operation compresses the spatial resolution from

H \times W

to

h \times w

, allowing the subsequent Transformer operations to be performed on a compact representation and thereby reducing the computational burden.

In order to augment the capability of the network in discerning the boundaries of image categories, the architectural framework of AFFormer encompasses two pivotal modules: Transformer-based Prototype Learning (PL) and Convolution-based Pixel Descriptor (PD). In the PL module, the Frequency Similarity Kernel (FSK) generates an Adaptive Frequency Filter (AFF). The AFF module integrates a Dynamic Low-pass Filter (DLF) and a Dynamic High-pass Filter (DHF) designed to distill low-frequency and high-frequency spatial domain information from varying frequency bands. The DLF module manipulates spatial information via average pooling and channel grouping, which facilitates the extraction of multi-scale features. Bilinear interpolation is done to reinstate the frequency information by ensuring congruence in the dimensions and fine details of the regenerated image with the original, as shown in the formula below:

D_{m}^{l f} (v^{m}) = R_{s \times s} (v^{m})

(3)

where R signifies the operation of adaptive average pooling,

s \times s

represents the dimensions of the pooling windows employed throughout the pooling process, which are of varying sizes, and

v^{m}

is the collection of values associated with the m-th pooling window within FSK.

The DHF leverages convolutional kernels of diverse dimensions to capture high-frequency information as elaborated in Equation (4). The

η

denotes DO-Conv.

D_{n}^{h f} (v_{n}) = η_{k \times k} (v_{n})

(4)

The finally obtained information frequency is expressed by Equation (5), where

D_{h}^{f c}

represents the frequency similarity kernel with H groups to achieve an enhanced frequency component correlation,

D_{m}^{l f}

represents the dynamic low-pass filter with M groups,

D_{n}^{h f}

represents the dynamic high-pass filter with N groups, and

| |

.

| |

denotes concatenation.

A F F (M) = | | D_{h}^{f c} (M) | |_{H} + | | D_{m}^{l f} (M) | |_{M} + | | D_{n}^{h f} (M) | |_{N}

(5)

In contrast to conventional Transformer models, AFFormer operates with a modest parameter count of 3 million, making it a suitable candidate for low-power deployment scenarios such as planetary rovers.

2.3. Martian Rock Data

Only a few publicly available datasets exist for Martian rock segmentation. The existing Martian surface datasets can be divided into three distinct categories.

The first is synthetic datasets, which are generated using simulation software or algorithms to mimic the real rocks on the Martian surface. Ma et al. created a simulated dataset, SimMars6k [39]. Thompson [40] generated synthetic images of Martian terrain using the ROAMS rover simulator suite. The second is simulation datasets, which are generated by simulating Martian environments on the surface of the Earth. Niekum [41] used a dataset from the Atacama Desert to simulate rocks on the Martian surface for the purpose of segmentation of the rocks. The third is real datasets containing actual surface features of Mars. NASA released the Mars32k dataset [42], which covers various geographical features of Mars. Xiao created the first rock segmentation dataset, called Marsdata [43]. Liu et al. released an improved version of the MarsData-V2 dataset [24]. Lv et al. established the first small-scale Martian rock dataset based on the images captured by the Zhurong rover in the Utopia Planitia, called TWMARS [31].

3. Methodology

In this section, we initially introduce an overview of the MAFT architecture, followed by a detailed exposition of the Adaptive Kernel Convolution (AKConv) module we cited, as well as our Enhanced Multidimensional Convolutional Attention (EMCA) mechanism.

3.1. Overall Framework

The architecture of our proposed MAFT network is presented in Figure 2a. Given a rock image with the shape of

H \times W \times 3

, this network can distinguish between rocks and the background through pixel-level classification. Instead of traditional U-shaped architecture, we used a simplified end-to-end architecture. This structure was primarily divided into four key stages.

In the first stage, we extracted features from the given Martian rock image of the shape

X \in R^{H \times W \times C}

. The image first passes through a standard

3 \times 3

convolutional layer to extract local features, followed by AKConv, which more finely extracts and segments the semantic information of different scales according to the size of the rock. We introduce this type of convolution in the next section.

In the second stage, we employed an IAFFormer as the core network serving as a lightweight Transformer to reduce computational load and achieve functional integration. We first defined a grid M serving as the initial low-resolution representation of the feature map, where each node of M acted as a local clustering core and processed only the data in its neighboring area during the initial stage. The initialization of each cluster center follows the weighted aggregation strategy described in Equation (2), which compresses the high-resolution feature map into a compact low-resolution representation for efficient subsequent processing. The feature information is denoted

M \in R^{h \times w \times C}

and fed into the parallel heterogeneous architecture (PHA), where prototype learning (PL) uses an adaptive feature fusion mechanism for prototype learning by iteratively optimizing the clustering centers. The PL branch is retained as the global modeling component of MAFT because its frequency-domain operations are particularly well-suited to dust-degraded Martian imagery: the Dynamic Low-pass Filter preserves the low-frequency semantic structure of large rocks whose textures have been homogenized by dust, while the Dynamic High-pass Filter retains high-frequency boundary cues needed to delineate small rocks from spectrally similar sandy backgrounds. Moreover, PL operates on spatially compressed cluster centers rather than the full-resolution map, enabling global dependency modeling at linear computational cost—a critical property for rover onboard deployment. The updated features are denoted as

M^{'} \in R^{h \times w \times C}

. The pixel descriptor (PD) module receives the updated semantic information from the PL module. It integrates the abstract semantic information from

M^{'}

with the unclustered pixel-level semantic information X in the original feature set, resulting in the final merged value

X^{'} \in R^{H \times W \times C}

. The complementary design of PL and PD ensures that global frequency-domain semantics and local geometry-aware details are jointly preserved. In this stage, the regular convolutions in PD were replaced with AKConv. The effectiveness of this design is validated in Section 4.4.3, where replacing IAFFormer with ResNet-50 leads to fragmented segmentation of large rocks, confirming the necessity of frequency-domain prototype learning for global semantic consistency.

In the third stage, we incorporated an Enhanced Multi-dimensional Convolutional Attention named EMCA module. Unlike conventional attention mechanisms that often focus on a single dimension, EMCA employs a parallel three-branch architecture to simultaneously infer attention weights across channel, height, and width dimensions. To better adapt to Martian imagery, we improve the squeeze transformation by integrating max-pooling along with average and standard deviation pooling. This design enables the network to capture high-frequency boundary information of rocks more effectively, thereby enhancing feature discriminability under scenarios with dust occlusion and complex morphologies.

3.2. Adaptive Kernel Convolution (AKConv)

In response to the diversity of the rock shapes, we propose to adjust the shape of the convolutional kernel so as to effectively modify the receptive field area and enhance the segmentation accuracy. Based on previous research, we introduced AKConv with adjustable shapes, the specific architecture of which is shown in Figure 2b. This approach has significant advantages in addressing traditional convolutional limitations and further reducing the number of parameters and computational burden. The employed core algorithm is the initial sampling coordinate algorithm of AKConv. This algorithm initially considers the size of the target by generating a set of initial coordinates to define the sampling positions of the convolutional kernel. It dynamically adjusts the coordinates based on the features in the image, thereby altering the convolutional sampling strategy, allowing the parameters to be set to any value. When the parameter is set to 5, the various shapes of AKConv are shown in Figure 3.

To illustrate why AKConv is better suited than standard convolution for Martian rock segmentation, Figure 4 compares the two sampling strategies on three representative scenes. In panels (b), (e), and (h), fixed-grid sampling distributes points uniformly regardless of rock geometry, inevitably sampling across both rock and sandy background near irregular boundaries. Panels (c), (f), and (i) show the conceptual adaptive sampling of AKConv: sampling points concentrate around irregular boundaries in the isolated-rock scene, distribute around scattered targets in the small-rock scene, and align with adjacent rock boundaries in the clustered-rock scene. This comparison illustrates how adaptive sampling generates geometry-aware descriptors better suited to Martian rock morphologies.

The specific process of AKConv is as follows: The input is

x

. For a given convolution kernel size of k, the base integer B is calculated. Regular and irregular sampling coordinates are generated and concatenated to form

P_{n}

. The meshgrid operation is performed for regenerating the grid. The Offset is obtained after performing the offset convolution operation

P_{C onv}

.

P_{0}

represents the initial sampling coordinates. According to the adjusted sampling coordinates, the feature map is resampled, and new features are obtained by performing the convolution operation.

B = \sqrt{k}

(6)

P_{n} = m e s h g r i d (| \frac{k}{B} |, k \mod B)

(7)

O f f s e t = P_{C onv} (x)

(8)

P = P_{0} + P_{n} + O f f s e t

(9)

A K C o n v (x) = C o n v (s a m p l e (x, p))

(10)

In the image preprocessing stage, image x with a shape of

H \times W \times 3

first undergoes a

3 \times 3

ordinary convolution to capture its features more optimally. Next, AKConv is used to dynamically adjust the shape of the convolution kernel for extracting the adjusted features M. The formulas are as follows:

f (x) = C o n v_{3 \times 3} (x)

(11)

M = A K C o n v (f (x))

(12)

After the preprocessed features M enter IAFFormer, the

1 \times 1

convolution block in PD is changed to AKConv, whose specific framework is shown in Figure 2c. IAFFormer can better use the feature information of different scales in enhancing the recognition ability of rocks of multiple scales. Here, Bn represents the batch normalization processing. The formula is as follows:

F = M + B n (A K C o n v (A F F (M)))

(13)

3.3. Enhanced Multi-Dimensional Convolutional Attention (EMCA)

The motivation for introducing adaptive dilated convolutions stems from the inherent limitations of standard convolutions in the Martian rock segmentation task. Standard convolutions with fixed kernel sizes have limited receptive fields, making them insufficient to capture the full context of large rocks while simultaneously preserving fine details of small rocks. Simply stacking more layers to enlarge the receptive field would significantly increase the parameter count, contradicting our lightweight design goal. Adaptive dilated convolutions address this dilemma by dynamically adjusting the dilation rate, without substantially increasing the parameter count, thus enabling multi-scale feature extraction within a single layer.

Dust coverage and low contrast often weaken rock–background boundaries in Martian images. As a result, feature responses associated with small or partially buried rocks may be suppressed during global pooling or local convolution. EMCA is introduced to enhance boundary-sensitive and multi-scale feature responses while maintaining a lightweight structure. The attention mechanism can significantly enhance the distribution of feature representations and attenuate irrelevant background features of insignificant features at the same time. We add an enhanced Multidimensional convolutional attention (EMCA) mechanism to the convolutional network to improve the model’s recognition rate of rocks covered by sand and dust. Specifically, inspired by MCA [22], we use three branches of channel, spatial height, and width to inform the CNN model of what to pay attention to. Meanwhile, in order to further optimize the aggregation method of cross-dimensional feature responses to adapt to the characteristics of Martian images, we add max-pooling to the squeeze transformation. Since CBAM [44] has verified that max-pooling effectively gathers distinctive object features, we integrate it with average and standard deviation pooling through adaptive learnable weights. We directly use the pooled features to generate channel weights without considering the local interaction between channels. This can greatly reduce the amount of computation and also reduce the complexity of the model.

As shown in Figure 5, EMCA performs specific transformations to output the corresponding tensors. Given the image

F \in R^{C \times H \times W}

output from the IAFFormer, where C represents the number of channels, H represents the height of the image, and W represents the width of the image. In the first branch, F is rotated 90 degrees counterclockwise along H to obtain F₁. In the second branch, F is rotated 90 degrees counterclockwise along W to obtain F₂. In the third branch, F remains unchanged with its original features through the identity mapping function to generate F₃. The three types of features respectively undergo squeeze transformation (T_sq) to obtain their respective outputs. Then, they pass through the activation function

σ

and then through element-wise multiplication. Finally, the three branch outputs are averaged after inverse transformation to produce the final output. The formulas are as follows:

{\hat{F}}_{i} = T_{s q} (F_{i}), i \in [1, 2, 3]

(14)

A_{i} = σ ({\hat{F}}_{i}), F_{i}^{'} = A_{i} \otimes {\hat{F}}_{i}, F_{i}^{″} = P M_{H}^{- 1} (F_{i}^{'})

(15)

As shown in Figure 6, in the squeeze transformation, we have three pooling methods, namely average pooling, standard pooling, and max pooling. The input feature map obtains three different channel-wise feature statistics, which are

{\hat{F}}_{i}^{a ν g} = [{\hat{f}}_{1}^{a ν g}, {\hat{f}}_{2}^{a ν g}, \dots {\hat{f}}_{m}^{a ν g}]

,

{\hat{F}}_{i}^{s t d} = [{\hat{f}}_{1}^{s t d}, {\hat{f}}_{2}^{s t d}, \dots {\hat{f}}_{m}^{s t d}]

,

{\hat{F}}_{i}^{\max} = [{\hat{f}}_{1}^{\max}, {\hat{f}}_{2}^{\max}, \dots {\hat{f}}_{m}^{\max}]

, the mathematical expressions of the three pooling methods are as follows:

{\hat{f}}_{m}^{a v g} = \frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} {\hat{f}}_{m} (x, y)

(16)

{\hat{f}}_{m}^{s t d} = \sqrt{\frac{1}{H \times W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} {({\hat{f}}_{m} (x, y) - {\hat{f}}_{m}^{a v g})}^{2}}

(17)

{\hat{f}}_{m}^{m a x} = \max_{x = 1, y = 1}^{H, W} {{\hat{f}}_{m} (x, y)}

(18)

The results of the three pooling operations are fed into an adaptive combination mechanism to generate new values, where

α

,

β

, and

γ

are three trainable floating-point parameters greater than 0 and less than 1.

{\hat{F}}_{i} = T_{s q} ({\hat{F}}_{i}) = \frac{1}{3} \otimes ({\hat{F}}_{i}^{a v g} \oplus {\hat{F}}_{i}^{s t d} \oplus {\hat{F}}_{i}^{\max}) \oplus α \otimes {\hat{F}}_{i}^{a v g} \oplus β \otimes {\hat{F}}_{i}^{s t d} \oplus γ \otimes {\hat{F}}_{i}^{\max}

(19)

Through the above operations, we can obtain the outputs of the three branches respectively, and then simply average them to get the final enhanced output feature map. PMH stands for a 90-degree counterclockwise rotation along the H-axis, PMH-1 for the inverse transformation, PMW for a 90-degree counterclockwise rotation along the W-axis, and PMW-1 for its inverse transformation. IM represents the feature mapping function.

F^{″} = \frac{1}{3} \otimes (F_{W}^{″} \oplus F_{H}^{″} \oplus F_{C}^{″})

(20)

F_{W}^{″} = P M_{H}^{- 1} (σ (T_{s q} (P M_{H} (F))) \otimes P M_{H} (F))

(21)

F_{H}^{″} = P M_{W}^{- 1} (σ (T_{s q} (P M_{W} (F))) \otimes P M_{W} (F))

(22)

F_{C}^{″} = I M (σ (T_{s q} (I M (F))) \otimes I M (F))

(23)

Finally, the feature

F^{″}

is passed through a depthwise separable convolution for classification, obtaining the final label map.

4. Experimental Results and Analysis

In this section, we assess the performance of the proposed MAFT, introduce the datasets used, evaluation metrics, ablation studies, and outcomes of the experiments. For all the experiments, the maximum number of training epochs was set to 200, and the dimensions of all the input Martian rock images to the network were set to

512 \times 512

. We compared this network with common semantic segmentation networks and existing Martian rock semantic segmentation networks in terms of parameters and performance.

4.1. Datasets

We constructed three datasets for Martian rock segmentation experiments:

TWMARS-V2: As shown in Figure 7, TWMARS is the inaugural dataset crafted from images of Martian rocks captured by the Zhurong rover of China. The initial dataset, however, accounted only for the rocks exceeding

30 \times 30

pixels. Most of the images captured by the navigation camera of the Zhurong rover are of sand dunes and small rocks. These smaller rocks significantly impact the traversal of rovers on Mars. Consequently, a comprehensive re-annotation of the original 336 images was conducted to incorporate rocks of all visible sizes. Each image was independently annotated by two annotators using the LabelMe tool for high-precision contour labeling. When disagreements occurred, a third annotator reviewed the annotations and made the final decision. Rocks were systematically distinguished from sandy backgrounds through a multi-criteria assessment that integrated texture patterns, shadow gradients, and morphological continuity. Specifically, any geological formation with visually discernible textural features readily perceivable by human observers was annotated, thereby ensuring exhaustive coverage of all target entities relevant to rover navigation. The 336 annotated images were subsequently expanded to 1344 samples by employing image enhancement functions from OpenCV [45], including geometric transformations and photometric adjustments to simulate the variable environmental conditions on Mars. Each image has a resolution of

512 \times 512

pixels.

MarsData-V2: As shown in Figure 8a, these images were captured by the Mastcam camera of the Curiosity rover on Mars from August 2012 to November 2018. Utilizing fine-grained boundary annotations to describe the Martian rocks, we selected 1800 images that were similar to the rocks found in the Utopia Planitia region to serve as a supplementary training dataset for our experiments, which consists of 1350 training samples, 300 validation samples, and 150 test samples. The images were also resized to

512 \times 512

pixels.

SynMars: As shown in Figure 8b, we have chosen a simulated dataset generated from a simulated environment created in Blender, with both internal and external camera parameters based on the Tianwen-1 probe. The undulating terrain and other features effectively simulate Martian rocks. The original images are 1024 × 1024 pixels, with 22,500 samples for training, 5000 for validation, and 2500 for testing. We have resized the images to 512 × 512 for experimental verification.

4.2. Evaluation Metrics

To evaluate the precision of MAFT in rock segmentation, we employed four assessment metrics to evaluate segmentation performance: Intersection over Union (IoU), Precision, Pixel Accuracy (PA), and F₁.

I o U = \frac{T P}{T P + F P + F N}

(24)

Precision = \frac{T P}{F P + T P}

(25)

P A = \frac{T P + T N}{T P + F P + T N + F N}

(26)

F_{1} = 2 \times \frac{\Pr e c i s i o n \times Re c a l l}{\Pr e c i s i o n + Re c a l l}

(27)

In the context of binary classification, True Positives (TP) denote the count of positive instances identified correctly as positive, False Positives (FP) represent the count of negative instances mistakenly classified as positive, True Negatives (TN) are the instances that are correctly recognized as negative, and False Negatives (FN) are the positive instances that are erroneously classified as negative.

Besides the above, we computed the count of Floating Point Operations (FLOPs) and size of parameters (Params) for each model. In addition to these metrics, we also computed the number of frames (FPS).

4.3. Implementation Details

We use the Adam optimizer, which adapts the learning rate per parameter and has been shown to perform well in segmentation tasks involving high-dimensional feature spaces. In remote sensing image segmentation, it outperforms optimizers such as SGD [46], Adagrad [47], and RMSProp [48], which is consistent with the experimental setups reported by [39,49]. Regarding specific parameter settings, the momentum is set to the default value of 0.9. Referring to [31], the learning rate is configured as 0.0003 to mitigate the risk of gradient explosion. The weight decay coefficient is determined to be 0.01 through validation on the validation set. For the loss function, Martian rock segmentation is essentially a classification task, and the cross-entropy loss function has significant advantages over loss functions like mean squared error (MSE) in classification tasks [50]. Thus, the cross-entropy loss function is finally adopted.

The model structure and training configurations are as follows: The network integrates four improved AFFormer modules, with the embedding dimensions of each stage being [64, 128, 256, 512] in sequence. It takes RGB images as input and outputs single-channel grayscale images; the maximum number of training iterations is set to 200. The experimental environment is a device equipped with the Ubuntu 20.04 operating system, two GeForce RTX 3090 GPUs with 32 GB of memory, implemented based on the PyTorch 11.8 framework, and dependent on CUDA 11.3 and Python 3.9.

4.4. Ablation Experiments

In this section, to validate the contribution of each proposed component, systematic ablation experiments are conducted on the TWMARS-V2 dataset. All experiments follow the single-variable principle: only one component is modified at a time while keeping all other settings identical, and each model is retrained from scratch to ensure fair comparison. The original AFFormer architecture without any modification serves as the unified baseline throughout all experiments.

4.4.1. Component Ablation Analysis

To evaluate the individual and combined effects of the AKConv and EMCA modules, five comparative experiments are designed in this section, as presented in Table 1. The original AFFormer is defined as baseline (a), which obtains an IoU of 84.32% with 3.02 M parameters and 14.68 G FLOPs. Variant (b) integrated with AKConv increases the IoU from 84.32% to 86.58%, and meanwhile reduces the parameter count from 3.02 M to 2.97 M. This performance gain stems from the capability of AKConv to dynamically adjust sampling positions according to rock morphological traits. Compared with the fixed-grid convolution in the original AFFormer, it enables more efficient multi-scale feature extraction.

Variant (c), equipped with only EMCA, achieves an IoU of 86.13%, validating that the multi-dimensional attention mechanism effectively enhances the distinction between low-texture rocks and sandy backgrounds through fused average, standard deviation, and max pooling. Variant (d) adopts AKConv with the original MCA module instead of the proposed EMCA, yielding an IoU of 87.52%, lower than the complete MAFT. This comparison supports the benefit of the enhanced EMCA design, where hybrid pooling preserves salient boundary responses and adaptive dilated convolution enlarges the receptive field without substantially increasing parameters.

The complete MAFT, variant (e), delivers the highest IoU of 88.90% and F1 of 94.12%, surpassing all partial variants. The total IoU improvement over the baseline reaches 4.58 percentage points, exceeding the sum of individual gains from AKConv and EMCA, which suggests that the two modules provide complementary improvements.

4.4.2. Computational Cost of Each Module

Table 2 reports the progressive computational cost introduced by each key component. The AFFormer baseline runs at 14.68 G FLOPs. Adding AKConv brings this to 15.22 G, an increase of 0.54 G. EMCA on top of that adds another 0.27 G, putting the full MAFT at 15.49 G FLOPs. In other words, a 4.58-point IoU improvement comes at a cost of just 0.81 G extra FLOPs, roughly a 5.5% overhead. For perspective, the standard ViT demands 385.46 G FLOPs and 144.06 M parameters yet only manages 80.75% IoU. MAFT cuts the FLOPs by a factor of 25 and the parameter count by a factor of 48, while still delivering an 8.15-point higher IoU. These results indicate that the proposed modules introduce only limited computational overhead while providing clear performance gains.

4.4.3. Qualitative Ablation Results

To further verify each component’s contribution, we evaluate three degraded variants alongside the full MAFT. Each variant removes one key component, and the configurations and quantitative results are reported in Table 3. Visual comparisons are shown in Figure 9.

MAFT-1 replaces IAFFormer with ResNet-50, causing the most severe degradation. The IoU drops from 88.90% to 82.15%, and the Recall falls to 91.18%. As shown in Figure 9, large rocks are fragmented into disconnected pieces due to the loss of frequency-domain global modeling.

MAFT-2 replaces AKConv with standard 3 × 3 convolutions. The IoU decreases to 86.35%, and the Precision drops to 91.60%. Figure 9 reveals jagged boundaries around irregularly shaped rocks, as the rigid convolution grid cannot adapt to diverse rock contours.

MAFT-3 removes EMCA from the decoder. The IoU decreases to 86.58%, and the Recall drops to 93.80%. Figure 9 shows that dust-covered rocks with weak boundaries are missed or merged with the background, as the network loses multi-dimensional attention refinement.

The removal of IAFFormer causes an IoU reduction approximately three times larger than removing AKConv or EMCA individually, indicating that frequency-domain modeling is the most critical component. AKConv and EMCA provide complementary refinements in boundary precision and target recall, respectively.

4.5. Comparison with State-of-the-Art Methods

To verify the effectiveness of the method we proposed, we compared it with advanced semantic segmentation models and rock segmentation models. We conducted performance comparisons and visual comparisons on three types of datasets.

4.5.1. Quantitative Performance Comparison

We conducted a comprehensive comparative evaluation of the proposed model against representative Martian rock segmentation models and mainstream semantic segmentation architectures, with results shown in Table 4 and Table 5. The compared models include: task-specific Martian rock segmentation models, MarsNet and NI-U-Net++; convolution-based models, EMO [51], FastFCN [52], DeepLabv3+ [53], PIDNet-S [54], MobileViT-S [55], MobileNetV3 [56], UNet-based semantic segmentation models U-Net [57] and PSPNet [58], and Transformer-based models SegFormer [18], ViT, and Swin Transformer [16].

As discussed in Section 2.1, additional task-specific methods, including MarsFormer, RockFormer, MarsSeg, Light4Mars, LBNet, and Rocknet, have recently been proposed for Martian rock segmentation. Due to the unavailability of source codes or pre-trained weights for most of these methods, fully reproducible comparisons under identical settings could not be conducted. For reference, an indirect comparison based on published metrics indicates that MAFT achieves a favorable accuracy–efficiency trade-off: MAFT attains 92.80% IoU with 2.97 M parameters, whereas Light4Mars reports approximately 72% mIoU with 2.57 M parameters on a different dataset, and MarsFormer exceeds 20 M parameters. Controlled comparisons will be conducted as open-source implementations become available.

On TWMARS-V2, MAFT achieves 93.43% precision and 88.90% IoU, outperforming all methods compared. Among task-specific Martian rock segmentation models, MarsNet reaches 92.58% precision and 84.56% IoU, while NI-U-Net++ lags behind at 82.63% precision and 70.14% IoU. Traditional CNN-based models perform worse: FastFCN achieves only 85.36% precision and 78.54% IoU, and UNet reaches 80.54% precision and 72.26% IoU. Among lightweight real-time models, PIDNet-S and MobileViT-S offer competitive inference speed but fall short in accuracy, with IoU values of 82.49% and 84.16%, respectively, trailing MAFT by 6.41 and 4.74 percentage points. The Swin Transformer achieves the highest PA among compared methods at 97.87%, but its IoU of 86.20% still falls 2.70 points below MAFT.

On MarsData-V2, MAFT achieves 98.18% precision and 96.62% IoU, surpassing MarsNet by 4.89 and 6.36 percentage points, and NI-U-Net++ by 5.51 and 7.57 percentage points, respectively. Among CNN-based models, DeepLabV3+ performs best at 94.30% IoU, followed by PSPNet at 92.11% and EMO at 90.24%. The lightweight PIDNet-S and MobileViT-S show limited cross-domain generalization, achieving only 75.83% and 77.62% IoU. Among Transformer-based models, Swin Transformer leads at 93.38% IoU, with SegFormer and ViT at 91.80% and 91.30%, all notably below MAFT.

4.5.2. Computational Complexity Comparison

Table 5 summarizes the parameters, FLOPs, and inference speed of different methods under the same input resolution of 512 × 512. Params and FLOPs are reported once for each model because they are mainly determined by the network architecture and input size, while GPU FPS is reported on the three test sets. CPU FPS is additionally reported on TWMARS-V2 to provide a controlled reference for CPU-based inference efficiency.

Overall, MAFT achieves a favorable balance between segmentation accuracy and computational efficiency. On MarsData-V2, MAFT achieves 96.62% IoU with 15.49 G FLOPs and 35.94 FPS, outperforming SegFormer by 4.82 percentage points in IoU while requiring approximately 23.8% fewer FLOPs. On TWMARS-V2, MAFT obtains 88.90% IoU with only 2.97 M parameters and 15.49 G FLOPs. Although its FLOPs are slightly higher than some ultra-lightweight models such as MobileNetV3, MAFT provides a substantially higher IoU, indicating a better accuracy–efficiency trade-off for Martian rock segmentation.

To further assess CPU-side efficiency, we report inference speed on an Intel Core i7-12800HX under the PyTorch framework. Since latency is affected by model architecture, input resolution, hardware platform, and implementation framework, these results are intended for relative comparison rather than direct representation of space-qualified onboard processors. MAFT achieves 8.46 FPS on CPU, outperforming MarsNet, MobileViT-S, ViT, and SegFormer. In particular, MarsNet drops to 2.2 FPS due to its large parameter count and high FLOPs, while MobileViT-S decreases to 4.3 FPS because self-attention operations are less efficient on the CPU. These results suggest that the lightweight design of MAFT provides practical inference advantages on CPU-based resource-constrained platforms, while its actual onboard performance still requires validation on space-qualified hardware.

To provide a more intuitive understanding of the accuracy–efficiency trade-off, we present a scatter plot of IoU versus FLOPs for all evaluated models on the TWMARS-V2 dataset in Figure 10. Each model is represented by a distinct marker according to its category, with MAFT highlighted as a red star. It can be clearly observed that MAFT is located in the upper-left region of the plot, which represents the ideal zone characterized by high IoU and low computational cost. Specifically, MAFT achieves the highest IoU of 88.90% while requiring only 15.49 G FLOPs and 2.97 M parameters. These results indicate that MAFT achieves a favorable accuracy–efficiency balance for resource-constrained Martian rock segmentation.

4.5.3. Cross-Dataset Generalization Analysis

A notable observation from Table 4 is that MAFT maintains consistently superior performance across three datasets with fundamentally different imaging characteristics. TWMARS-V2 contains real images captured by the Zhurong rover’s navigation camera under natural Martian dust and shadow conditions. MarsData-V2 comprises real images from the Curiosity rover’s Mastcam, which were acquired over a six-year span from 2012 to 2018 and therefore include diverse seasonal and illumination variations. SynMars is a physics-based synthetic dataset rendered in Blender with controlled variations in terrain geometry and illumination angles. The fact that MAFT achieves the highest IoU on all three datasets—88.90% on TWMARS-V2, 96.62% on MarsData-V2, and 92.80% on SynMars—without any dataset-specific tuning indicates its robustness to heterogeneous Martian imaging conditions.

The data augmentation pipeline includes brightness, contrast, and blur variations that partially simulate photometric degradation under different solar angles and dust levels. A systematic evaluation under temporally continuous dust and illumination changes remains an open direction for future work.

4.5.4. Visualization Comparison

To qualitatively evaluate the segmentation behavior of different methods, we compare representative prediction results on the TWMARS-V2, MarsData-V2, and SynMars test sets. In the binary evaluation maps, white, black, red, and green denote true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), respectively.

As shown in Figure 11, MAFT produces notably fewer error regions than the methods compared on the TWMARS-V2 test set. In Figure 11a,b, the scenes contain numerous small rocks against a sandy background of similar color and texture. SegFormer and ConvNeXt exhibit extensive green FN regions, indicating that most small rock instances are missed. MarsNet shows fewer missed detections but introduces red FP artifacts in the background. In contrast, MAFT preserves small rock structures while suppressing background misclassification.

This behavior is consistent with the role of AKConv, which enables adaptive sampling to capture fine-grained boundaries at small scales. In Figure 11c,d, rocks are partially obscured by dust, producing ambiguous rock-background boundaries. MarsNet and ConvNeXt incorrectly classify portions of the sandy background as rocks, visible as red FP regions, likely due to the lack of effective boundary-aware feature refinement. MAFT avoids such errors, which we attribute to the EMCA module that enhances boundary discrimination under low-texture conditions through multi-dimensional attention. In Figure 11e, the scene contains complex terrain composed of rocks with diverse shapes, sizes, and textures. MAFT demonstrates the most complete rock coverage with minimal false positives, indicating that the combination of IAFFormer global context modeling and AKConv local adaptive sampling provides robust segmentation across diverse rock morphologies.

Figure 12 presents the visualization results on the MarsData-V2 test set. In Figure 12a, the scene contains dust-covered rocks where surface texture is heavily degraded. MarsNet produces large missed-detection regions visible as green FN areas, while MAFT maintains more complete rock masks. This improvement is consistent with the benefit of the EMCA hybrid pooling strategy, which retains high-activation boundary cues that are otherwise smoothed out by average pooling alone. In Figure 12b–e, which feature larger rocks with clearer boundaries, all methods perform reasonably well, but MAFT consistently generates more continuous boundary predictions with fewer fragmented FP or FN artifacts.

Figure 13 shows the visualization results on the SynMars test set. Compared with the real rover datasets, SynMars contains a larger number of small, scattered rock instances, and the contrast between rocks and the sandy background is often low. Under such conditions, SegFormer and Swin Transformer miss many small instances, producing extensive green FN regions. ConvNeXt generates scattered red FP artifacts in background areas. MAFT preserves more small rock structures while maintaining relatively clean background predictions. Some false positives still occur when rock-like background textures closely resemble true rock regions, indicating that complex small-rock distributions remain challenging for all methods. Nevertheless, MAFT achieves the most balanced behavior between recall and precision among all compared methods.

4.5.5. Violin Plot Comparison

As depicted in Figure 14, we employed violin plots to illustrate the stability of MAFT. As shown in Figure 14, violin plots are used to analyze the stability of IoU scores across five experimental runs on the TWMARS-V2 dataset. Compared with a single average value, the violin plot provides additional information about the distribution and variability of the results. The compact distribution of MAFT indicates that its performance remains stable across repeated runs, while its average IoU is higher than that of the compared methods.

4.6. Analysis of Morphological Parameters of Martian Rocks

The quantity and abundance of Martian rocks are key geological indicators that reveal impact events on the Martian surface and the cumulative effects of related ejecta [60]. To achieve accurate quantitative analysis of Martian rocks, we follow the photogrammetric approach of Wang et al. [61] and implement a rock measurement workflow adapted to the NaTeCam parameters of the Zhurong rover, applied to the segmentation outputs of MAFT. The workflow proceeds as follows: first, typical site samples are selected from the dataset to generate Digital Orthoimage Maps (DOMs) of the Martian surface (Figure 15); the DOMs are imported into the MAFT model for rock semantic segmentation; finally, an interactive measurement tool developed based on the internal and external parameters of NaTeCam is used to quantify the morphological parameters of rocks in the segmentation results.

The morphological parameters of rocks are defined as follows: the diameter is the maximum distance between the left and right endpoints in the direction perpendicular to the image pointing direction, and the height is the maximum elevation difference between the top of the rock and the surrounding terrain. Based on the above workflow, this study statistically analyzed the distribution characteristics of the number of rocks in the dataset with their diameters and heights, and the results are presented in the form of histograms as shown in Figure 16.

We can find that most rocks in the study area have diameters concentrated around 10 cm, with heights mainly distributed in the range of 2–5 cm. This distribution pattern is closely related to the geological evolution history of Utopia Planitia and may correspond to frequent small to medium-sized impact events during the late Amazonian period on Mars. These statistics are consistent with the expected ejecta size distribution for the Utopia Planitia region and may serve as a reference for rover path planning.

5. Conclusions

In this paper, we propose MAFT, a lightweight network specifically designed for Martian rock segmentation. MAFT integrates IAFFormer, AKConv, and EMCA to balance global context modeling, morphology-adaptive local feature extraction, and boundary-sensitive feature refinement. The IAFFormer backbone captures global dependencies through frequency-domain feature interaction with only 2.97 M parameters. AKConv is introduced to replace fixed convolutional sampling with adaptive sampling positions, improving the representation of irregular and multi-scale Martian rocks. EMCA further enhances feature discrimination for dust-covered and low-texture rocks through multidimensional attention and hybrid pooling.

We also construct TWMARS-V2 by re-annotating Zhurong rover images with denser rock masks, especially for small rocks that were omitted in the original annotations. Extensive experiments are conducted on TWMARS-V2, MarsData-V2, and SynMars. The results show that MAFT outperforms the compared baselines in IoU across the three datasets while maintaining a low parameter count and competitive computational cost. These results indicate that MAFT provides a favorable accuracy–efficiency trade-off for RGB-based Martian rock segmentation.

Certain challenging cases remain unresolved. Rocks that are nearly fully buried under regolith are difficult to detect because only limited texture and shape cues are visible. Large-area shadow occlusion also remains challenging, as near-uniform pixel intensities weaken boundary gradients and may cause missed detections or merged adjacent rocks. These limitations are common to RGB-only segmentation methods. Regarding deployment feasibility, although MAFT achieves the highest CPU inference speed among all methods compared with only 2.97 M parameters, a considerable gap remains between current desktop-level test conditions and the stringent constraints of space-qualified rover processors. Future work will address both limitations: on the sensing side, we will explore multimodal data fusion incorporating depth or infrared cues; on the deployment side, we will investigate model quantization, knowledge distillation, and hardware-aware compilation to enable real-time onboard inference under strict power budgets.

Author Contributions

Conceptualization, C.L., Y.J., G.W., Q.M. and Z.W.; Methodology, C.L.; Software, C.L.; Validation, C.L. and Y.J.; Formal analysis, C.L., Y.W. and J.L. (Jia Liu 1); Investigation, C.L., Y.J., G.W., Q.M. and J.L. (Jia Liu 2); Resources, Y.J. and B.W.; Data curation, C.L., Q.M. and Y.W.; Writing—original draft, C.L., Y.J., G.W., Q.M., J.L. (Jia Liu 1), Y.W., B.W., J.L. (Jia Liu 2) and Z.W.; Writing—review & editing, Y.J., G.W., Z.W. and B.W.; Visualization, C.L. and Y.J.; Supervision, Y.J. and Z.W.; Project administration, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. The data are available at https://github.com/milkct/TWMARS-V2 (accessed on 13 May 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.; Yao, M.; Xiao, X.; Cui, H. A hybrid attention semantic segmentation network for unstructured terrain on Mars. Acta Astronaut. 2023, 204, 492–499. [Google Scholar] [CrossRef]
Feng, W.; Ding, L.; Zhou, R.; Xu, C.; Yang, H.; Gao, H.; Liu, G.; Deng, Z. Learning-Based End-to-End Navigation for Planetary Rovers Considering Non-Geometric Hazards. IEEE Robot. Autom. Lett. 2023, 8, 4084–4091. [Google Scholar] [CrossRef]
Rogers, A.D.; Aharonson, O.; Bandfield, J.L. Geologic context of in situ rocky exposures in Mare Serpentis, Mars: Implications for crust and regolith evolution in the cratered highlands. Icarus 2009, 200, 446–462. [Google Scholar] [CrossRef]
Garvin, J.; Edgett, K.; Dotson, R.; Fey, D.; Herkenhoff, K.; Hallet, B.; Kennedy, M. Quantitative Relief Models of Rock Surfaces on Mars at Sub-millimeter Scales from Mars Curiosity Rover Mars Hand Lens Imager (MAHLI) Observations: Geologic Implications. Microsc. Microanal. 2017, 23, 2146–2147. [Google Scholar] [CrossRef]
Huang, G.; Yang, L.; Cai, Y.; Zhang, D. Terrain classification-based rover traverse planner with kinematic constraints for Mars exploration. Planet. Space Sci. 2021, 209, 105371. [Google Scholar] [CrossRef]
Changela, H.G.; Chatzitheodoridis, E.; Antunes, A.; Beaty, D.; Bouw, K.; Bridges, J.C.; Capova, K.A.; Cockell, C.S.; Conley, C.A.; Dadachova, E.; et al. Mars: New insights and unresolved questions. Int. J. Astrobiol. 2021, 20, 394–426. [Google Scholar] [CrossRef]
Fassett, C.I. Analysis of impact crater populations and the geochronology of planetary surfaces in the inner solar system. J. Geophys. Res. Planets 2016, 121, 1900–1926. [Google Scholar] [CrossRef]
Golombek, M.; Rapp, D. Size-frequency distributions of rocks on Mars and Earth analog sites: Implications for future landed missions. J. Geophys. Res. Planets 1997, 102, 4117–4129. [Google Scholar] [CrossRef]
Gerdes, L.; Azkarate, M.; Sánchez-Ibáez, J.R.; Joudrier, L.; Perez-del-Pulgar, C.J. Efficient autonomous navigation for planetary rovers with limited resources. J. Field Robot. 2020, 37, 1153–1170. [Google Scholar] [CrossRef]
Hood, D.R.; Sholes, S.F.; Karunatillake, S.; Fassett, C.I.; Ewing, R.C.; Levy, J. The Martian Boulder Automatic Recognition System, MBARS. Earth Space Sci. 2022, 9, e2022EA002410. [Google Scholar] [CrossRef]
Bickel, V.T.; Aaron, J.; Manconi, A.; Loew, S.; Mall, U. Impacts drive lunar rockfalls over billions of years. Nat. Commun. 2020, 11, 2862. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Zhao, H.; Bruzzone, L.; Benediktsson, J.A.; Liang, Y.; Liu, B.; Zeng, X.; Guan, R.; Li, C.; Ouyang, Z. Lunar impact crater identification and age estimation with Chang’E data by deep and transfer learning. Nat. Commun. 2020, 11, 6358. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Sun, H.; Wang, Y.; Wang, X.; Zhang, B.; Xin, Y.; Zhang, B.; Cao, X.; Ding, E.; Han, S. MAFormer: A transformer network with multi-scale attention fusion for visual recognition. Neurocomputing 2024, 595, 127828. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Cui, H.; Fu, Y. Light4Mars: A lightweight transformer model for semantic segmentation on unstructured environment like Mars. ISPRS J. Photogramm. Remote Sens. 2024, 214, 12. [Google Scholar] [CrossRef]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. arXiv 2023. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition. Eng. Appl. Artif. Intell. 2023, 126, 107079. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Liu, H.; Yang, H.; Fu, Y. MarsFormer: Martian Rock Semantic Segmentation with Transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600612. [Google Scholar] [CrossRef]
Liu, H.; Yao, M.; Xiao, X.; Xiong, Y. RockFormer: A U-Shaped Transformer Network for Martian Rock Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600116. [Google Scholar] [CrossRef]
Smith, D.E.; Zuber, M.T.; Frey, H.V.; Garvin, J.B.; Head, J.W.; Muhleman, D.O.; Pettengill, G.H.; Phillips, R.J.; Solomon, S.C.; Zwally, H.J. Mars Orbiter Laser Altimeter: Experiment summary after the first year of global mapping of Mars. J. Geophys. Res. Planets 2001, 106, 23689–23722. [Google Scholar] [CrossRef]
Qiao, W.; Zhao, Y.; Xu, Y.; Lei, Y.; Wang, Y.; Yu, S.; Li, H. Deep learning-based pixel-level rock fragment recognition during tunnel excavation using instance segmentation model. Tunn. Undergr. Space Technol. 2021, 115, 104072. [Google Scholar] [CrossRef]
Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, PAMI-8, 679–698. [Google Scholar] [CrossRef]
Furlán, F.; Rubio, E.; Sossa, H.; Ponce, V. Rock Detection in a Mars-Like Environment Using a CNN. In Pattern Recognition, 11th Mexican Conference, MCPR 2019, Querétaro, Mexico, 26–29 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 149–158. [Google Scholar]
Kuang, B.; Wisniewski, M.; Rana, Z.A.; Zhao, Y. Rock Segmentation in the Navigation Vision of the Planetary Rovers. Mathematics 2021, 9, 3048. [Google Scholar] [CrossRef]
Ebadi, K.; Coble, K.; Kogan, D.; Atha, D.; Schwartz, R.; Padgett, C.; Hook, J.V. Semantic Mapping in Unstructured Environments: Toward Autonomous Localization of Planetary Robotic Explorers. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 5–12 March 2022; pp. 1–10. [Google Scholar]
Lv, W.; Wei, L.; Zheng, D.; Liu, Y.; Wang, Y. MarsNet: Automated Rock Segmentation with Transformers for Tianwen-1 Mission. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3506605. [Google Scholar] [CrossRef]
Li, J.; Chen, K.; Tian, G.; Li, L.; Shi, Z. MarsSeg: Mars Surface Semantic Segmentation with Multi-level Extractor and Connector. IEEE Trans. Geosci. Remote Sens. 2024, 63, 4501012. [Google Scholar]
Wei, P.; Sun, Z.; Tian, H. LBNet: A Lightweight Bilateral Network for Semantic Segmentation of Martian Rock. IEEE Access 2024, 12, 182137–182144. [Google Scholar] [CrossRef]
Wei, P.; Sun, Z.; Tian, H. Rocknet: Lightweight network for real-time segmentation of Martian rocks. J. Real-Time Image Process. 2025, 22, 41. [Google Scholar] [CrossRef]
Jia, Y.; Wan, G.; Li, W.; Li, C.; Liu, J.; Cong, D.; Liu, L. EDR-TransUnet: Integrating Enhanced Dual Relation-Attention with Transformer U-Net for Multiscale Rock Segmentation on Mars. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4601416. [Google Scholar] [CrossRef]
Ma, Y.; Li, Z.; Wu, B.; Duan, R. DepthFormer: Depth-enhanced transformer network for semantic segmentation of the Martian surface from rover images. Earth Space Sci. 2025, 12, e2024EA003812. [Google Scholar] [CrossRef]
Lin, B.; Wang, F.; Li, Q.; Zheng, B.; Yao, M.; Xiao, X.; Qi, Y.; Cui, H.; Huang, X. LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter. Aerospace 2025, 12, 1049. [Google Scholar] [CrossRef]
Dong, B.; Wang, P.; Wang, F. Head-Free Lightweight Semantic Segmentation with Linear Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Ma, C.; Li, Y.; Lv, J.; Xiao, Z.; Zhang, W.; Mo, L. Automated Rock Detection from Mars Rover Image via Y-Shaped Dual-Task Network with Depth-Aware Spatial Attention Mechanism. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4600418. [Google Scholar] [CrossRef]
Thompson, D.R.; Castano, R. Performance Comparison of Rock Detection Algorithms for Autonomous Planetary Geology. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 3–10 March 2007; pp. 1–9. [Google Scholar]
Niekum, S. Reliable Rock Detection and Classification for Autonomous Science. Master’s Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2008. [Google Scholar]
Wang, C.; Zhang, Z.; Zhang, Y.; Tian, R.; Ding, M. GMSRI: A Texture-Based Martian Surface Rock Image Dataset. Sensors 2021, 21, 5410. [Google Scholar] [CrossRef] [PubMed]
Xiao, X.; Yao, M.; Liu, H.; Wang, J.; Zhang, L.; Fu, Y. A Kernel-Based Multi-Featured Rock Modeling and Detection Framework for a Mars Rover. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3335–3344. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zelinsky, A. Learning OpenCV—Computer Vision with the OpenCV Library. IEEE Robot. Autom. Mag. 2009, 16, 100. [Google Scholar] [CrossRef]
Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
De, S.; Mukherjee, A.; Ullah, E. Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration. arXiv 2018, arXiv:1807.06766. [Google Scholar] [CrossRef]
Golombek, M.P.; Trussell, A.R.; Williams, N.R.; Charalambous, C.; Abarca, H.; Warner, N.H.; Deahn, M.; Trautman, M.R.; Crocco, B.; Grant, J.A.; et al. Rock Size-Frequency Distributions at the InSight Landing Site, Mars. Earth Space Sci. 2021, 8, e2021EA001959. [Google Scholar] [CrossRef]
Han, X.; Papyan, V.; Donoho, D.L. Neural collapse under mse loss: Proximity to and dynamics on the central path. arXiv 2021, arXiv:2106.02073. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking Mobile Block for Efficient Attention-based Models. arXiv 2023. [Google Scholar] [CrossRef]
Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. arXiv 2019. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R. Searching for MobileNetV3. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022. [Google Scholar] [CrossRef]
Di, K.; Xu, B.; Peng, M.; Yue, Z.; Liu, Z.; Wan, W.; Li, L.; Zhou, J. Rock size-frequency distribution analysis at the Chang’E-3 landing site. Planet. Space Sci. 2016, 120, 103–112. [Google Scholar] [CrossRef]
Wang, B.; Gou, S.; Di, K.; Wan, W.; Peng, M.; Zhao, C.; Zhang, Y.; Xie, B. Rock size-frequency distribution analysis at the Zhurong landing site based on Navigation and Terrain Camera images along the entire traverse. Icarus 2024, 413, 116001. [Google Scholar] [CrossRef]

Figure 2. Architecture of the proposed MAFT network. (a) Overall framework of MAFT. (b) Detailed structure of the AKConv module. (c) Internal architecture of the IAFFormer backbone, comprising the Prototype Learning (PL) and Pixel Descriptor (PD) branches.

Figure 3. Six examples of AKConv sampling patterns when the kernel size is set to 5, illustrating the ability to generate diverse non-rectangular receptive fields. (a) Vertical “F”-shaped pattern; (b) Diagonal pattern; (c) “C”-shaped pattern; (d) “T”-shaped pattern; (e) Multi-branched pattern; (f) Cross-shaped pattern.

Figure 4. Schematic comparison of standard fixed-grid convolution and AKConv adaptive sampling on Martian rock scenes. Panels (a,d,g) show original images; panels (b,e,h) show fixed-grid sampling; and panels (c,f,i) show conceptual AKConv sampling. Green points denote adaptive sampling positions.

Figure 5. Overall framework of the EMCA module. The three parallel branches correspond to the height branch (H-branch), width branch (W-branch), and channel branch (C-branch). Different colors distinguish branches and feature tensors for visualization only and do not indicate semantic classes.

Figure 6. Flow chart of the squeeze transformation within each EMCA branch. The input tensor is processed by three parallel pooling operations: average pooling, standard deviation pooling, and max pooling. Their outputs are adaptively fused using learnable weights α, β, and γ. Different colors distinguish pooling branches and intermediate tensors for visualization only and do not indicate semantic classes.

Figure 7. Comparison between TWMARS and TWMARS-V2 annotations. (a) Original Martian rock images captured by the Zhurong rover. (b) Original TWMARS annotation masks, where white indicates annotated rock regions and black indicates unlabeled background or omitted small rocks. (c) Improved TWMARS-V2 annotation masks, where white denotes rocks already labeled in TWMARS and yellow denotes newly annotated small rocks that were omitted in the original TWMARS.

Figure 8. Sample images and corresponding Ground Truth (GT) annotations. (a) MarsData-V2: real images from the Curiosity rover’s Mastcam camera. (b) SynMars: synthetic images rendered in Blender with Tianwen-1 camera parameters.

Figure 9. Visual ablation comparison on three datasets. Columns show the input image, ground truth, complete MAFT, and three ablated variants. MAFT-1 replaces IAFFormer with ResNet-50, MAFT-2 replaces AKConv with standard 3 × 3 convolution, and MAFT-3 removes EMCA. Red circles highlight representative regions where the complete MAFT better preserves small rocks, irregular boundaries, or dust-obscured rock structures.

Figure 10. Scatter plot of IoU versus FLOPs on the TWMARS-V2 dataset. Different markers represent CNN-based, UNet-based, Transformer-based, Martian rock segmentation, and the proposed models. The upper-left region indicates higher accuracy with lower computational cost.

Figure 11. Visualization comparison on the TWMARS-V2 test set, (a–e) are TWMARS-V2 data. Columns show the original image, ground truth, and binary evaluation maps for SegFormer, ConvNeXt, Swin Transformer, MarsNet, and MAFT. White = TP, black = TN, red = FP, and green = FN.

Figure 12. Visualization comparison on the MarsData-V2 test set, (a–e) are MarsData-V2 data. Columns show the original image, ground truth, and binary evaluation maps for SegFormer, ConvNeXt, Swin Transformer, MarsNet, and MAFT. White = TP, black = TN, red = FP, and green = FN.

Figure 13. Visualization comparison on the SynMars test set, (a–e) are SynMars data. Columns show the original image, ground truth, and binary evaluation maps for SegFormer, ConvNeXt, Swin Transformer, MarsNet, and MAFT. White = TP, black = TN, red = FP, and green = FN.

Figure 14. IoU values of all models across five experimental runs on the TWMARS-V2 dataset (the five-pointed star represents the mean IoU).

Figure 15. Digital Orthoimage Map (DOM) generated from stereo image pairs captured by the Zhurong rover’s Navigation and Terrain Camera (NaTeCam).

Figure 16. Relationships between rock quantity and rock diameter (a), as well as rock height (b), in TWMARS-V2.

Table 1. Component ablation results on the TWMARS-V2 dataset. Variant (a) is the original AFFormer without any modification, serving as the baseline. Variant (b) adds AKConv to both the preprocessing stage and the pixel descriptor module. Variant (c) adds only EMCA as the decoder. Variant (d) uses AKConv together with the original MCA attention module instead of EMCA. Variant (e) is the complete MAFT model integrating both AKConv and EMCA. × denotes that the corresponding module is not included in the model, while √ denotes that the module is included.

Variant	AKConv	EMCA	Params (M)	FLOPs (G)	IoU (%)	PA (%)	Pre (%)	F1 (%)
(a)	×	×	3.02	14.68	84.32	96.91	90.15	91.50
(b)	√	×	2.97	15.22	86.58	97.49	91.83	92.81
(c)	×	√	3.00	14.96	86.13	97.35	91.52	92.55
(d)	√	MCA	2.97	15.37	87.52	97.83	92.41	93.34
(e) MAFT	√	√	2.97	15.49	88.90	98.17	93.43	94.12

Table 2. Progressive computational cost on the TWMARS-V2 dataset.

Configuration	Params (M)	FLOPs (G)	Δ FLOPs (G)	IoU (%)
AFFormer baseline	3.02	14.68	-	84.32
AFFormer + AKConv	2.97	15.22	+0.54	86.58
ViT	144.06	385.46	-	80.75
AFFormer + AKConv + EMCA/MAFT	2.97	15.49	+0.27	88.90

Table 3. Quantitative and qualitative ablation results of degraded variants on the TWMARS-V2 dataset. √ indicates the original module is retained; × indicates removal or replacement. In MAFT-1, IAFFormer is replaced by ResNet-50. In MAFT-2, AKConv is replaced by standard 3 × 3 convolutions. In MAFT-3, EMCA is removed.

Method	AKConv	EMCA	IAFFormer	IoU (%)	F1 (%)	Recall (%)
MAFT-1	√	√	×	82.15	90.23	91.18
MAFT-2	×	√	√	86.35	92.68	93.76
MAFT-3	√	×	√	86.58	92.81	93.80
MAFT	√	√	√	88.90	94.12	94.82

Table 4. Quantitative evaluation of different methods on three types of datasets (Metrics in %).

Model	Methods	TWMARS-V2				MarsData-V2				SynMars
Model	Methods	Pre	IoU	PA	F1	Pre	IoU	PA	F1	Pre	IoU	PA	F1
Martian rock Methods	NI-U-Net++	82.63	70.14	78.49	82.45	92.67	89.05	91.20	94.21	87.59	83.15	94.52	90.80
Martian rock Methods	MarsNet	92.58	84.56	88.17	91.63	93.29	90.26	91.53	94.88	93.46	91.59	92.10	95.61
CNN-based	EMO-5M	89.67	80.24	83.49	89.04	93.75	90.24	92.34	94.87	90.64	84.66	85.59	91.69
	FastFCN	85.36	78.54	82.17	87.98	90.28	87.56	89.52	93.37	86.54	82.29	84.58	90.28
	DeepLabV3+	91.70	78.71	83.29	88.09	97.56	94.30	95.18	97.07	93.40	89.33	91.28	94.36
	PIDNet-S	89.58	82.49	86.30	90.40	86.14	75.83	84.23	86.25	88.72	80.14	85.11	88.98
	MobileViT-S	91.24	84.16	87.20	91.40	87.80	77.62	84.89	87.40	90.34	81.93	85.93	90.07
	MobileNetV3	90.12	81.59	84.25	89.86	97.09	92.47	93.46	96.09	93.88	92.47	93.59	96.09
UNet-based	UNet	80.54	72.26	76.58	83.90	91.59	81.52	88.56	89.82	88.17	76.75	82.58	86.85
UNet-based	PSPNet	92.48	81.59	88.54	89.86	96.39	92.11	94.23	95.89	92.00	84.27	89.56	91.46
Transformer-based	SegFormer	92.16	83.91	97.53	91.25	94.83	91.80	92.68	96.27	92.33	83.49	88.17	91.00
	ViT	89.26	80.75	97.95	89.35	93.91	91.3	92.13	95.45	90.61	81.42	83.56	89.76
	Swin Transformer	91.3	86.20	97.87	92.59	96.51	93.38	97.53	96.58	91.30	83.40	85.25	90.95
	MAFT	93.43	88.90	98.17	94.12	98.18	96.62	98.64	98.28	94.37	92.80	92.93	96.27

Table 5. Computational complexity and inference speed of different methods. GPU FPS is measured on an RTX 3090. CPU FPS is measured on an Intel Core i7-12800HX using the TWMARS-V2 dataset with 512 × 512 input resolution.

Model	Methods	Params (M)	FLOPs (G)	FPS(GPU)			FPS(CPU)
Model	Methods	Params (M)	FLOPs (G)	TWMARS-V2	MarsData-V2	SynMars	TWMARS-V2
Mars Rocks-Methods	NI-U-Net++	13.45	44.60	19.30	19.00	19.20	3.65
Mars Rocks-Methods	MarsNet	33.21	240.38	31.21	30.80	30.50	2.20
CNN-based	EMO-5M	10.28	16.05	12.40	12.10	12.21	2.32
	FastFCN	68.71	60.52	10.61	10.40	10.54	1.90
	PIDNet-S	3.67	15.65	31.62	32.23	31.92	6.50
	MobileViT-S	5.60	15.70	31.90	32.80	32.54	4.30
	MobileNetV3	3.28	11.60	30.53	30.26	30.28	7.60
	ConvNeXt [59]	122.10	100.58	4.24	4.18	4.26	0.13
UNet-based	UNet	7.75	18.08	5.90	5.75	5.80	1.12
UNet-based	PSPNet	58.95	234.90	12.46	12.89	12.35	2.22
Transformer-based	SegFormer	3.72	20.34	9.64	9.90	9.51	0.6
	ViT	144.06	385.46	1.59	1.62	1.60	0.06
	Swin Transformer	58.95	241.66	25.53	26.11	25.34	1.4
	MAFT	2.97	15.49	35.25	35.94	35.96	8.46

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, C.; Jia, Y.; Wan, G.; Ma, Q.; Liu, J.; Wang, Y.; Wang, B.; Liu, J.; Wei, Z. MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer. Remote Sens. 2026, 18, 1794. https://doi.org/10.3390/rs18111794

AMA Style

Li C, Jia Y, Wan G, Ma Q, Liu J, Wang Y, Wang B, Liu J, Wei Z. MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer. Remote Sensing. 2026; 18(11):1794. https://doi.org/10.3390/rs18111794

Chicago/Turabian Style

Li, Chu, Yutong Jia, Gang Wan, Qifang Ma, Jia Liu, Yang Wang, Biao Wang, Jia Liu, and Zhanji Wei. 2026. "MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer" Remote Sensing 18, no. 11: 1794. https://doi.org/10.3390/rs18111794

APA Style

Li, C., Jia, Y., Wan, G., Ma, Q., Liu, J., Wang, Y., Wang, B., Liu, J., & Wei, Z. (2026). MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer. Remote Sensing, 18(11), 1794. https://doi.org/10.3390/rs18111794

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAFT: A Lightweight Network for Martian Rock Segmentation Based on an Adaptive Frequency Transformer

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Semantic Segmentation of Martian Rocks

2.2. Adaptive Frequency Transformer

2.3. Martian Rock Data

3. Methodology

3.1. Overall Framework

3.2. Adaptive Kernel Convolution (AKConv)

3.3. Enhanced Multi-Dimensional Convolutional Attention (EMCA)

4. Experimental Results and Analysis

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Ablation Experiments

4.4.1. Component Ablation Analysis

4.4.2. Computational Cost of Each Module

4.4.3. Qualitative Ablation Results

4.5. Comparison with State-of-the-Art Methods

4.5.1. Quantitative Performance Comparison

4.5.2. Computational Complexity Comparison

4.5.3. Cross-Dataset Generalization Analysis

4.5.4. Visualization Comparison

4.5.5. Violin Plot Comparison

4.6. Analysis of Morphological Parameters of Martian Rocks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI