LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter

Lin, Boyu; Wang, Fei; Li, Qingeng; Zheng, Bo; Yao, Meibao; Xiao, Xueming; Qi, Yifan; Cui, Hutao; Huang, Xiangyu

doi:10.3390/aerospace12121049

Open AccessArticle

LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter

by

Boyu Lin

^1,†

,

Fei Wang

^1,†,

Qingeng Li

¹,

Bo Zheng

²,

Meibao Yao

³

,

Xueming Xiao

^1,*,

Yifan Qi

¹,

Hutao Cui

⁴ and

Xiangyu Huang

⁵

¹

CVIR Lab, Changchun University of Science and Technology, Changchun 130022, China

²

Shanghai Aerospace Control Technology Institute, Shanghai 201109, China

³

School of Artificial Intelligence, Jilin University, Changchun 130012, China

⁴

School of Astronautics, Harbin Institute of Technology, Harbin 150001, China

⁵

Beijing Institute of Control Engineering, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Aerospace 2025, 12(12), 1049; https://doi.org/10.3390/aerospace12121049

Submission received: 9 October 2025 / Revised: 14 November 2025 / Accepted: 21 November 2025 / Published: 25 November 2025

(This article belongs to the Section Astronautics & Space Science)

Download

Browse Figures

Versions Notes

Abstract

With the continuous deepening of Mars exploration missions, the Mars helicopter has become a key platform for acquiring high-resolution near-ground imagery. However, accurate semantic segmentation of the Martian surface remains challenging due to complex terrain morphology, sandstorm interference, and the limited onboard computational resources that restrict real-time processing. Existing models either introduce high computational overhead unsuitable for deployment on Mars aerial platforms or fail to jointly capture fine-grained local texture and global contextual structure information. To address these limitations, we propose LisseMars, a lightweight semantic segmentation network designed for efficient onboard perception. The model integrates a Window Movable Attention (WMA) module for enhanced global context extraction and a multi-convolutional feedforward module (CFFN) to strengthen local detail representation. A Dynamic Polygon Convolution (DPC) module is further introduced to improve segmentation performance on geometrically heterogeneous objects, while a Group Fusion Module (GFM) enables effective multi-scale semantic integration. Extensive experiments are conducted on both real Tianwen-1 Mars helicopter imagery and synthetic datasets. The results show that our method achieved a mean IoU of 78.56% with only 0.12 MB of model parameters, validating the effectiveness of the proposed framework. The real-time performance of proposed method on edge device deployment further demonstrate potential application for real Mars airborne missions.

Keywords:

Mars exploration; semantic segmentation; transformer model

1. Introduction

In Mars exploration, researchers are seeking new technologies to enable smooth navigation and the discovery of scientific objects across the complex Martian terrain [1,2,3]. The success of the Mars helicopter “Ingenuity” [4,5] has reached a milestone as the first powered flight on another planet. It provides unprecedented perspectives and data for the Mars rover. The exploration of the Martian surface is crucial and indispensable to the Mars exploration mission. By processing image data, different scientific objects on the surface of Mars can be quickly identified [6,7] and therefore will effectively optimise the trajectory planning and exploration of the Mars rover [8,9,10].

With the continuous development of deep learning, neural networks have achieved remarkable success in the field of computer vision, particularly in semantic segmentation [11], object detection [12,13,14,15], 3D reconstruction [16,17,18] and object classification [19,20,21] tasks. For extraterrestrial exploration missions, semantic segmentation technology is crucial as it enables accurate identification and distinction of various surface objects, such as rocks, sand, and gravel. This capability not only aids in enhancing the navigation accuracy of Mars rovers but also has significant implications for effectively discovering scientific objects. However, Mars has diverse and complex unstructured terrain with harsh sandstorm environments, making it challenging to clearly distinguish the objects from the surrounding ground. As a result, it becomes essential for semantic segmentation networks to capture global contextual information so as to accurately explore the scene. Some studies [22,23] introduce pyramid structures embedded in CNNs to capture global contextual information at different scales. However, such methods typically extract local features from CNNs and then aggregate to global information, rather than directly encoding the global context. Consequently, in case of complex Martian scenes, CNN-based models often struggle to capture clear global information.

A type of solution is to design the transformer model [24], which can sufficiently extract global information. However, transformer-based methods still have some limitations, especially in the Martian environment. The original transformer architecture relies primarily on global contextual information and pays relatively limited attention to local details. The surface of Mars is scattered with various objects, which requires more precise processing of object locations and associated complex details, making the limited local focus of transformers even more problematic. To address this challenge, some researchers [25,26] have introduced convolutional position encoding and relative position encoding to improve the model’s ability to capture spatial and sequential dependencies. However, these methods may not fully express the spatial structure of the objects, which could adversely affect the model’s performance in complex, dynamic environments.

The deployment of sophisticated models on Mars rovers and helicopters is significantly constrained by their limited computational resources, necessitating a careful trade-off between model complexity and accuracy. Existing CNN-based or heavy Transformer-based models often fail to jointly preserve fine local boundaries and global contextual terrain cues, and many are unsuitable for deployment on Mars aerial platforms due to limited onboard computational resources. To this end, in this paper we proposed LisseMars, a lightweight Transformer-based segmentation network tailored for the unstructured, low-texture, and scale-varying Martian environments. LisseMars is built upon a U-shaped encoder–decoder architecture with skip connections. For the encoder, we proposed a Window Movable Attention (WMA) mechanism, which combines the local window partitioning with cross-window interaction to adaptively balance global context modeling and local detail preservation. This design will significantly enhance the robustness of semantic segmentation against the low-contrast and complex geological textures of Mars environment. Furthermore, to better handle rocks with geometrically heterogeneous structures, we design a Dynamic Polygon Convolution (DPC) that adjusts its sampling geometry according to local shape variations, allowing the network to more precisely characterize uneven, non-uniform object boundaries. For the decoder, we employ an enhanced Group Fusion Module (GFM), derived from UperNet [27], to hierarchically integrate multi-scale features and maintain semantic consistency across regions of varying size. As illustrated in Figure 1, the proposed LisseMars achieves a favorable balance between segmentation accuracy and model efficiency, demonstrating clear advantages over existing state-of-the-art methods and strong potential for real deployment in Mars exploration missions. The main contributions are as follows:

(1): We proposed a lightweight semantic segmentation model (LisseMars) for the Mars helicopter, which can effectively segment objects on Mars.
(2): We designed a novel Window Movable Attention Module (WMA), which enhances feature extraction and effectively promote the fusion of global and local information.
(3): We developed a Dynamic Polygon Convolution (DPC) that effectively focuses on irregular and uneven local features, enabling accurate segmentation of polygonal structures.

Figure 1.

m I o U

values of current SOTA models on SynMars-Air. Our proposed LisseMars model outperforms all the baselines.

Figure 1.

m I o U

values of current SOTA models on SynMars-Air. Our proposed LisseMars model outperforms all the baselines.

2. Related Works

2.1. Deep Learning for Mars

Perceiving unstructured environments on Mars has always been a key factor in autonomous navigation for current Mars missions [28,29]. For the Mars helicopter, accurately assessing the environment and identifying scientific objects are prerequisites for future successfully completing large-scale Mars exploration missions. This can provide crucial supports for the rovers’ navigation and path planning, thereby improving overall exploration efficiency [30,31]. Edge-based detection [32] and region-based detection [33] have been widely applied in the past. However, due to the complex and dynamic environment of Mars, relying solely on traditional methods is no longer sufficient to meet the requirements of future Mars exploration missions. With the rapid advancement of artificial intelligence, deep learning is gaining increasing attention [17].

Compared with classical semantic segmentation methods, CNN-based methods have significant advantages in feature extraction. Ref. [13] developed a lightweight deep learning framework named mini-SDD to detect various types of landforms on Mars. Ref. [34] using multi-scale dilated convolution in the extended path, integrating with the U-Net network for accurate segmentation of objects on Mars. Ref. [35] proposed Mobile-DeepRFB, a lightweight segmentation framework for Martian terrain classification.

Although existing studies have made notable progress in Martian terrain segmentation, several limitations remain insufficiently addressed. Traditional CNN-based approaches struggled to capture multi-scale contextual relationships and long-range dependencies, which are crucial for delineating geometrically heterogeneous rock boundaries and low-texture surface. To overcome these issues, recent works have attempted to integrate Transformers with convolutional architectures. SegMarsViT [36] combined ViT module with CNN model to jointly extract local and global features; MarsFormer [37] enhanced remote contextual modeling while preserving local details. RockFormer [38] designed a layered encoder–decoder structure for precise rock segmentation. HASS [39] incorporated multi-scale attention modules to improve the segmentation consistency. Light4Mars [40] further designed a lightweight window attention mechanism for Mars rover to characterize complex, unstructured surface. However, despite these advancements, current methods still face significant challenges, including insufficient fine-grained geometric modeling, limited robustness in extreme multi-scale scenarios, high computational and memory overhead due to heavy global-attention modules, and difficulty achieving lightweight onboard deployment on Mars aerial or rover platforms with strict power and hardware constraints.

2.2. Mars Terrain Datasets

A high-quality dataset is the foundation of deep learning methods. Currently, there exist various types of unstructured terrain datasets. For Mars, there are three types of datasets: the real data captured by Mars rovers or helicopter, the physical simulation data, and synthetic simulation data. Ref. [41] established the first large-scale Mars landscape dataset, AI4Mars, using real images from Curiosity mission. Ref. [42] created the first real dataset for Martian rock segmentation, MarsData, and later refined it to MarsDataV2 [38]. Ref. [43] developed the first Mars panoramic dataset, called MarsScapes.

Physical simulation and synthetic simulation datasets also play a crucial role in supplementing real Martian data. These datasets provide controlled conditions and allow the generation of diverse scenarios, which are difficult or impossible to capture on Mars. Ref. [44] verified the feasibility of Mars helicopter navigation by simulating Martian terrain in the desert. Ref. [37] proposed a publicly available synthetic Martian surface dataset, SynMars, for rock segmentation and extended to SynMars-TW dataset [40] based on images of Zhurong Rover.

The aforementioned datasets have propelled advances in the progress of Mars exploration by Mars rovers. However, the viewpoint between the Rover and the helicopter is quite different, which leads to entirely distinct observations of the surrounding environment and objects. In the case of occlusion and partial visibility, the segmentation networks designed for Mars rovers may not be suitable for Mars helicopter. Differences in sensor types and data processing methods can lead to an incomplete understanding of the object’s position and size, affecting object segmentation. Currently, there are still few real Mars images captured from a helicopter. To this end, Ref. [45] proposed the first aerial perspective dataset of Mars, SynMars-Air. Sample images from these datasets are shown Figure 2.

3. Method

In this section, we first introduce the overall framework of our lightweight semantic segmentation model, LisseMars. Then, we provide a detailed description of the proposed Window Moving Attention Module (WMA), the Improved Feedforward Network (CFFN), and the Dynamic Polygon Convolution (DPC), as well as the Group Fusion Module (GFM).

3.1. Overrall Framework

LisseMars model balanced modularity and flexibility, aiming to provide a scalable system architecture. As shown in Figure 3, it followed the encoder-decoder structure of U-Net. The input image

X \in H \times W \times 3

is first fed into the encoder, which includes four improved self-attention modules and a dynamic polygon convolution module (DPCM). Each improved attention module consists of a WMA and CFFN structure, which is repeated multiple times to obtain four hierarchical feature maps:

X_{i} \in R^{H_{i} \times W_{i} \times C_{i}}

.

Subsequently, the obtained feature

X_{i}

is processed through the Group Fusion Module (GFM), which plays a crucial role in fusing and enhancing multi-scale feature maps. The high-dimensional feature layer

X_{4}

undergoes pooling operations at multiple scales in the High Dimensional Pooling Fusion Module and then merges with features from other layers in the GFM to obtain different output features:

F_{i} \in R^{H_{i} \times W_{i} \times C_{i}}

. Finally, these features

F_{i}

are sampled and fused, and the predicted semantic segmentation map is output.

3.2. Window Movable Attention (WMA)

The working principle of WMA is illustrated in Figure 4. In order to reduce computational costs while maintaining high accuracy, we adopted the window partitioning strategy to extract feature

X_{i}

from feature X and project it onto the Q tensor, the K tensor, and the V tensor. During the process again,

W_{h}

and

W_{w}

represent the window pooling sizes in the vertical and horizontal directions, while C is the number of channels. The specific steps are as follows:

\begin{matrix} Q & = L N (\frac{X_{i} W^{Q}}{W_{h} W_{w}} \sum_{i = 0}^{W_{h} - 1} \sum_{j = 0}^{W_{w} - 1} Q^{C} (i, j)) \\ K & = L N (\frac{X_{i} W^{K}}{W_{h} W_{w}} \sum_{i = 0}^{W_{h} - 1} \sum_{j = 0}^{W_{w} - 1} K^{C} (i, j)) \\ V & = L N (X W^{V}) \end{matrix}

(1)

In this context,

L N

denotes Layer Normalization, while i and j represent the indices in the horizontal and vertical directions of the feature map, respectively. As shown in Figure 5c, we applied deep convolutions for feature aggregation to the Q, K, and V tensors to aggregate more global context. This enables the model to focus on the relationships between adjacent elements in the input sequence, thereby enhancing its ability to capture spatial features. The main process is as follows:

\begin{matrix} Q & = D P C N (r_{i}, Q) + D P C N 1 (r_{i}, Q) + Q \\ K & = D P C N (r_{i}, K) + D P C N 1 (r_{i}, K) + K \\ V & = D P C N (r_{i}, V) + D P C N 1 (r_{i}, V) + V \end{matrix}

(2)

DPCN represents the application of deep convolution to input X, utilizing contextual information from

r_{i}

. This operation captures complex spatial relationships and enhances the local features of each component effectively. This dual-branch mechanism preserves and amplifies significant information within Q, K, and V while dynamically capturing the contextual details of surrounding elements. The complete workflow of WMA is depicted in Figure 6, and self-attention is defined as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{h}}}) V

(3)

Figure 5. Different implementation forms of local attention. (a) Swin extracts local features by dividing the image into windows and achieves feature fusion by shifting windows. (b) Slide Transformer [46] extracts local information through window partitioning and convolution. (c) Our model extracts local information through window partitioning and fuses multi-scale information extracted through convolution.

Figure 6. Structure of Window Movable Attention.

3.3. Convolutional Feedforward Neural Network

The extraction and expression of local features provide detailed contextual information, which is crucial for processing subsequent global features. In the WMA, effectively fusing global and local information from feature maps can significantly improve segmentation performance. Ref. [47] has demonstrated that convolutions of various sizes can capture multi-scale features. However, the Martian surface is complex, with numerous small objects, making it essential to preserve important local and multi-scale information for effective segmentation. Inspired by this challenge, we designed the Convolutional Feedforward Neural Network (CFFN) module, which better guides feature fusion while preserving the locality of the fused features.

As shown in the Figure 7, we use a 1 × 1 convolution instead of a linear layer, which effectively performs a linear transformation of features while preserving spatial integrity. Using a 1 × 1 convolution, we can increase the network’s non-linear capacity without increasing computational complexity, thus improving feature fusion. Deep convolutions with varying kernel sizes are used to extract multi-scale features. It effectively captures detailed information at multiple scales. Combining varied kernel sizes and receptive fields enhances feature learning, thus improving the identification and segmentation of small objects on the surface of Mars. The specific steps are as follows:

\begin{matrix} x_{1} & = L C o n v_{1 \times 1} (x_{i n}) \\ x_{2} & = G E L U (x_{1}) \\ x_{3} & = G P C N (x_{2}, r_{i}, k_{i}) + x_{2} \\ x_{4} & = G E L U (x_{3}) \\ x_{o u t} & = L C o n v_{1 \times 1} (x_{4}) \end{matrix}

(4)

where

x_{i n}

is the output of the WMA module,

L C o n v

represents point-wise convolution using 1 × 1 convolution, GELU represents a nonlinear activation function, and GPCN stands for grouped convolution, where

r_{i}, k_{i}

represents group convolution with different convolution kernels and stride sizes.

3.4. Dynamic Polygon Convolution Module

To enhance the adaptability of convolutional kernels to complex and geometrically heterogeneous shapes, we designed the DPCM module with a Dynamic Polygon Convolution (DPC) cooperation, as shown in Figure 8.

Inspired by deformable convolutions [48] and serpentine convolutions [49], in our DPC, only the central coordinate of a standard convolution kernel is fixed; the remaining kernels are configured into irregular polygonal shapes by the Convex Hull algorithm [50], as shown in Figure 9. It is utilized to define the polygonal boundaries and to verify the interior membership of spatial points. If a point is outside the irregular polygon, it will be moved back to the base coordinate. Such an irregular receptive field can significantly improve the feature extraction capability for diverse complex objects on Mars.

Under such, the process of DPCM can be represented as follows:

\begin{matrix} x_{1} & = C o n v e x H U l l (K, X) \\ x_{2} & = D P C (x_{1}) \\ x_{3} & = G P C N (X) \\ x_{4} & = C o n v_{1 \times 1} (X) \\ x_{o u t} & = C o n v_{1 \times 1} (C a t (x_{2} + x_{3} + x_{4} + X)) \end{matrix}

(5)

where K denotes the center coordinates of convolution

K_{i} = (x_{i}, y_{i})

, and X is the input feature. By fusing features under different channels through 1 × 1 convolution, multiple scales through group convolution, and irregular receptive fields through DPC, DPCM will accurately identify complex and irregular geometric targets in the Martian terrain.

3.5. Group Fusion Module

Feature fusion combined low-level detailed information with high-level semantic features, which can effectively improve the model’s performance on complex tasks. For accurate segmentation of Martian rocks, it preserves fine-grained structures while simultaneously improving the overall contextual understanding. However, most existing feature fusion methods focus on separately extracting internal information from each individual scale, which may introduce noisy or irrelevant features. In fact, it has been demonstrated in [51] that not all feature maps obtained through feature fusion contribute positively to the final prediction of the decoder. To mitigate the instability caused by feature fusion and achieve a lightweight design while preserving accuracy, we proposed the Group Fusion Module (GFM), as shown in Figure 10. By leveraging multi-scale information obtained through high-dimensional pooling and group fusion, the GFM module can efficiently increase the capability of feature representation, which is particularly important for object segmentation on complex Martian scenes.

The denoted feature

X_{i} (i \in 1, 2, 3, 4)

is obtained through the encoder. We first performed pooling fusion on the high-dimensional feature

X_{4}

:

\begin{matrix} P_{S_{i}} (x_{4} i n) & = A d a p t i v e A v g P o o l 2 d (x, S_{i}) (i = 1, 2, 3) \\ Y & = ⨁_{i = 1}^{n} {\hat{P}}_{S_{i}} \\ x_{4} & = G P C N (Y, r_{1}, k_{1}) \end{matrix}

(6)

where

S_{i}

represents different adaptive average pooling scales, and the obtained feature

x_{4}

is grouped and fused with other features

X_{i}, (i = 1, 2, 3)

. The

F_{i}

(i = 1, 2, 3, 4), here, represents the output from different channels.

3.6. Architecture Details and Variants

To balance performance and accuracy, we designed three lightweight models: LisseMars-T, LisseMars-S, and LisseMars-B. The corresponding network structure parameters are listed in Table 1. Given an input image with a size of

H \times W \times 3

, the output feature map size obtained through Patch Embedding is

(H / 2^{i}) \times (W / 2^{i})

(

i \in 1, 2, 3, 4

). Each model comprises five stages, including four stages of WMA and an intermediate DPCM layer. The output feature maps at each stage have sizes [32, 64, 128, 256], denoted as

F_{1}, F_{2}, F_{3}, F_{4}

.

4. Experiments

In this section, we will describe the experimental setup and conduct an extensive analysis to validate the performance of the LisseMars on three datasets: SynMars-Air, MarsScapes, and the TianWen dataset.

4.1. Dataset

The SynMars-Air dataset is the first synthetic aerial viewpoint dataset of Mars helicopter released by [45]. It consists of 8 categories: Gravel, SmallRock, Big rock, Bedrock, Ridge, Sand, Soil, and Sky, as shown in Figure 11. Totally, 11,700 images with semantic annotation are divided into training with 9400 images, testing, and validation; each has 1150 images.

MarsScapes [43] is a region-level object annotation panoramic dataset created from images collected by the Curiosity rover, as shown in Figure 12. The dataset contains 9 categories: Big Rock, Soil, Gravel, Bedrock, Ridge, Sand, Sky, Rover, and Unknown. Totally 13,002 images for training, 4192 for testing, and 3608 for validation.

The TianWen dataset is composed of Martian images collected by NaTeCam of the Zhurong rover in the Tianwen-1 mission (Courtesy of NAOC/GRAS: https://www.nssdc.ac.cn/nssdc_zh/html/task/tianwen1.html (accessed on 10 January 2025)), as shown in Figure 13. We labeled all the rocks in 1194 images with a resolution of 512 × 512, and split them into a training set with 896 images and a testing set with 298 images.

4.2. Implement Details

Our model is implemented based on the mmsegmentation framework for semantic segmentation. We trained the model using an NVIDIA GeForce RTX 3090 GPU and deployed it on the JETSON XAVIER NX platform. The batch sizes for training LisseMars-T, LisseMars-S, and LisseMars-B were set to 15, 10, and 5, respectively. The AdamW optimizer is used with a learning rate of

2 \times 10^{- 4}

.

Metrics like Params (Parameters) and mIoU (Mean Intersection over Union) are used to evaluate the performance of the proposed method. Params evaluates the network’s parameter count, while mIoU measures segmentation accuracy. The mIOU calculation formula is as follows:

\begin{matrix} I o U & = \frac{T P}{T P + F P + F N} \\ m I o U & = \frac{1}{N} \sum_{i = 1}^{N} {IoU}_{i} \end{matrix}

(7)

where TP, FN, and FP represent the number of samples correctly classified as positive, the number of positive samples misclassified as negative, and the number of negative samples misclassified as positive, respectively.

4.3. Comparative Experiment

4.3.1. Semantic Segmentation on SynMars-Air

To better validate the superiority of our framework, we compared LisseMars with the most advanced semantic segmentation algorithms currently available. For fairness, other algorithms adopt the mmsegmentation framework and are evaluated based on the Upernet as the decoder. In Table 2, we separated the results of SOTA backbones into three parts sorted by parameters. Our LisseMars-B model achieved the highest mIoU of 87.37%, with relatively fewer parameters compared to other algorithms. Furthermore, it demonstrated exceptional performance in the segmentation of Gravel (0 to 10 pixels) and SmallRock (10 to 100 pixels), reaching 36.48% and 77.65%, respectively, which are 19.87% and 10.27% higher than the second-best Light4Mars model. Gravel and Smallrock are the main objects distributed on the surface of Mars. Accurate identification of them would help the Mars rover proactively plan obstacle-avoidance paths, effectively preventing them from damaging the rover’s wheels while also providing valuable support for scientific exploration.

We visualized the segmentation results of these methods on the SynMars-Air test set, as shown in Figure 14. Other SOATs failed to identify gravel (red pixels) accurately, and our method has better recognition results for both large and small objects. The red boxes in Figure 14 highlight the significant differences between methods. Specifically, our model accurately segments the objects, as shown in the middle two rows of Figure 14, while other methods either fail or predict incorrect segmentation results. when handling low-contrast objects, as seen in the first and last rows of Figure 14, our model successfully segments the objects and captures their boundaries with high precision, whereas the other methods fail to do so.

4.3.2. Semantic Segmentation on MarsScapes

To verify the robustness of our proposed LisseMars, we conducted comparative experiments on the real Mars dataset, MarsScapes. As listed in Table 3, although test accuracy decreased, our Base model (LisseMars-B) still achieved the most competitive mIoU of 66.81%, outperforming other state-of-the-art methods by 0.26% and 0.5% over Swin and Segformer, respectively. Taking the prediction results in Figure 15 as an example, LisseMars demonstrates superior segmentation performance compared to other models. As shown in the first and third rows of Figure 15, LisseMars accurately segments the boundary contours of multiple objects. The second and fourth rows demonstrate that the object categories predicted by LisseMars are closer to the true categories. The segmentation results indicate that our model demonstrates outstanding performance in real Martian scenes. On very large rocks (in BigRock category), LisseMars-T does not rank the top performance. This is because, for the segmentation of large-scale rocks, the core strategy of learning-based methods focuses on employing deeper network architectures to achieve a larger receptive field, thus enabling the effective extraction of large-scale features. The proposed modules in LisseMars, however, are primarily optimized for small-scale object segmentation and for edge device deployment. Driven by such constraints, the segmentation model necessitated a lightweight design, at the expense of segmentation performance on large rocks.

4.3.3. Semantic Segmentation on TianWen

We further used the real Mars images of the TianWen-1 mission to validate the genrealizataion of our model. As shown in Table 4, we compared the Tiny model (LisseMars-T) with other Tiny SOATs; our model outperformed MobileNetV3 and Segformer by 0.26% and 0.72% in mIoU. In addition, LisseMars attained the most competitive mIoU 57.93% in Rock segmentation, achieved 0.6% and 1.64% mIoU higher than those of MobileNetV3 and Convnext, respectively. Compared with SOTA, the visualization of the segmentation results of our method is shown in Figure 16. The red box in Figure 16 shows a significant difference in the segmentation results. Our model accurately segmented the object, as shown in the first and last rows of Figure 16, while other methods exhibit a false negative. For the excessive clustering of multiple targets shown in the second and third rows of Figure 16, our model accurately captures the precise object boundaries, while other models suffer from over-segmentation and mistakenly identify multiple objects as a single object. The results demonstrate that our model exhibits excellent generalization ability in real Martian scenes.

4.4. Ablation Studies

4.4.1. The Impact of Different Proposed Modules

We conducted further research and improvements to validate the efficiency of the proposed modules. As shown in Table 5, we progressively added the WMA, CFFN, GFM, and DPCM modules to the BaseFormer model and analyzed their impact on various categories. The results in Table 5 demonstrate that the model’s performance improves with the incremental addition of each module. Specifically, after adding the WMA module, the model exhibited notable improvements in the Gravel and Smallrock, with increases of 4% and 2.37%, respectively. The next module, CFFN, brings further improvements, particularly in BedRock (+0.44%) and Sand (+2.36%). Adding the GFM module provides a more noticeable enhancement, especially in categories such as Gravel (+14.24%) and SmallRock (+7.8%). Finally, the addition of the DPCM module further enhanced the model’s performance, leading to significant improvements in object segmentation, with the final mIoU reaching 87.37%. These results demonstrate the effectiveness of our proposed modules. As shown in the visualization results in Figure 17, as the number of stages in the model increases, the heatmap shows that the model’s attention area to the objects becomes clearer and distinct, ultimately successfully segmenting the objects.

4.4.2. The Impact of WMA

To validate the efficiency of WMA, we changed the transformer module of the encoder while keeping the GFM decoder unchanged. As shown in Table 6, our model ranked the top with mIoU of 87.37%, which is 2.67% higher than the second Light4Mars. In addition, WMA performs better than other models in terms of Gravel, SmallRock, Ridge, etc. This indicates that WMA can effectively improve the overall performance of image segmentation by fully integrating global and local information to optimize the encoder module without changing the GFM decoder.

4.4.3. The Impact of CFFN

To verify the role of CFFN, we replaced the multilayer while keeping the other structures of the network unchanged. As shown in Table 7, CFFN achieves significantly better segmentation performance for various objects compared to CMLP and UFFN. While CMLP and UFFN effectively capture local information through their layer-by-layer fully connected structures, our CFFN incorporates multi-head convolution to enhance global information extraction, enabling more effective fusion of both global and local features.

4.4.4. The Impact of GFM

As shown in Table 8, we explore the performance of different decoders in our model. Among the evaluated decoders, GFM consistently outperforms other methods in predicting various objects. In particular, GFM achieved the highest segmentation accuracy of 36.48% and 77.65%, respectively, in detecting Gravel and small rocks, which proves the effectiveness of GFM in small object segmentation.

4.4.5. The Impact of DPC

To prove the advantages of our proposed DPC, we replaced it with the state-of-the-art convolution, as shown in Table 9. Our DPC achieved an mIoU that is 0.15% higher than that of DCN and 0.18% higher than Gconv. In addition, our DPC achieved the highest prediction accuracy of 92.71% and 99.61% respectively in Segmenting BedRock and Soil, and ranked the top in the accuracy of predicting Grave and SmallRock. These results prove the effectiveness of DPC. Moreover, to further validate the impact of DPCM’s position in different stages on the performance of the LisseMars model, we conducted additional experiments by adjusting the placement of DPCM. As shown in Table 10, when DPCM is positioned in stage 3, it achieves optimal segmentation accuracy by better integrating low-level features and low-level features.

4.5. Runtime Performance

To demonstrate the performance of the proposed model on an edge device, we deployed a resource-efficient and high-precision LisseMars-T model on Nvidia Jeston NX platform. It is a credit card-sized and low-power (10 W/15 W) system-on-a-chip, making it capable of being deployed on a lightweight helicopter or as a key component for deploying algorithms in deep space exploration applications.

The model is exported in TensorRT format (https://developer.nvidia.cn/tensorrt (accessed on 10 April 2025)) with

512 \times 512

resolution of the input image. As shown in Table 11, the average inference time per image of LisseMars on NX is 0.048 s on the SynMras-Air, 0.04 s on MarsScapes, respectively. On real TianWen datasets, our model achieved a processing time of 0.044 s per image, showcasing its potential for deployment in future Mars helicopter missions.

5. Conclusions

In this paper, we propose a novel lightweight model semantic segmentation LisseMars, specifically designed for the Mars helicopter. The model integrates a new window movable attention (WMA) module to enhance the fusion of global and local information. Additionally, we developed a multi-head convolutional feedforward network (CFFN) to further improve feature extraction and fusion, along with a dynamic polygon convolutional (DPC) to achieve precise segmentation of polygonal structures. The proposed method was evaluated on the challenging SynMars-Air, MarsScapes, and TianWen datasets. Experimental results demonstrated that LisseMars effectively segmented object contours across various complex Martian scenes. It has achieved or approached state-of-the-art results across most terrain categories, and only fell short on large rocks due to the constraints of its lightweight architecture. Furthermore, the model was successfully deployed on the edge device, validating its real-time performance and potential integration for future Mars helicopter mission. Future research will be dedicated to refining model lightweighting and multi-scale feature extraction, ultimately enabling accurate semantic segmentation for the diverse range of object on the Martian surface.

Author Contributions

Conceptualization, B.L., M.Y. and X.X.; methodology, B.L., F.W. and Y.Q.; formal analysis, B.Z. and Q.L.; investigation, B.L., Y.Q. and F.W.; resources, B.Z., H.C. and X.H.; data curation, Y.Q., F.W. and Q.L.; writing—original draft preparation, Y.Q., F.W. and B.L.; writing—review and editing, X.X. and M.Y.; visualization, B.L. and Y.Q.; supervision, H.C., M.Y. and X.X.; project administration, B.Z. and X.H.; funding acquisition, M.Y. and X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is sponsored by the National Natural Science Foundation of China (NSFC) through grants Nos. 52472448, 62394354, 62394355, and the Natural Science Foundation of Jilin Province through grants No. YDZJ202301ZYTS396. The authors also thank the China National Space Administration for providing the TianWen-1 data processed and produced by the Ground Research and Application System (GRAS) of China’s Lunar and Planetary Exploration Program to make this study possible.

Data Availability Statement

SynMars-Air dataset: https://github.com/CVIR-Lab/SynMars/tree/SynMars-Air (accessed on 10 January 2025); MarsScapes dataset: https://github.com/InRobots/MarsScapes (accessed on 10 January 2025); Tianwen-1 dataset: https://www.nssdc.ac.cn/nssdc_zh/html/task/tianwen1.html (accessed on 10 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, X.; Yao, M.; Cui, H.; Fu, Y. Safe Mars landing strategy: Towards lidar-based high altitude hazard detection. Adv. Space Res. 2019, 63, 2535–2550. [Google Scholar] [CrossRef]
Leake, C.; Grip, H.; Steyert, V.; Hasseler, T.D.; Cacan, M.; Jain, A. HeliCAT-DARTS: A High Fidelity, Closed-Loop Rotorcraft Simulator for Planetary Exploration. Aerospace 2024, 11, 727. [Google Scholar] [CrossRef]
Giacomini, E.; Westerberg, L.G. Rotorcraft Airfoil Performance in Martian Environment. Aerospace 2024, 11, 628. [Google Scholar] [CrossRef]
von Ehrenfried, M.D. Ingenuity. In Perseverance and the Mars 2020 Mission: Follow the Science to Jezero Crater; Springer: Cham, Switzerland, 2022; pp. 111–125. [Google Scholar]
Zhao, P.; Gao, X.; Zhao, B.; Liu, H.; Wu, J.; Deng, Z. Machine learning assisted prediction of airfoil lift-to-drag characteristics for mars helicopter. Aerospace 2023, 10, 614. [Google Scholar] [CrossRef]
Xiao, X.; Cui, H.; Yao, M.; Fu, Y.; Qi, W. Auto rock detection via sparse-based background modeling for mars rover. In Proceedings of the 2018 IEEE Congress on Evolutionary Computation (CEC), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar]
Li, G.; Geng, Y.; Xiao, X. Multi-scale rock detection on Mars. Sci. China Inf. Sci. 2018, 61, 1. [Google Scholar] [CrossRef]
Wang, A.; Wang, L.; Zhang, Y.; Hua, B.; Li, T.; Liu, Y.; Lin, D. Landing site positioning and descent trajectory reconstruction of Tianwen-1 on Mars. Astrodynamics 2022, 6, 69–79. [Google Scholar] [CrossRef]
Xu, C.; Huang, X.; Guo, M.; Li, M.; Hu, J.; Wang, X. End-to-end Mars entry, descent, and landing modeling and simulations for Tianwen-1 guidance, navigation, and control system. Astrodynamics 2022, 6, 53–67. [Google Scholar] [CrossRef]
Liu, S.; Kong, J.; Cao, J.; Huang, H.; Man, H.; Yan, J.; Li, X. Precise orbit determination for Tianwen-1 during mapping phase. Astrodynamics 2024, 8, 471–481. [Google Scholar] [CrossRef]
Li, Z.; Wu, B.; Chen, Z.; Ma, Y. Transformer-Based Method for Semantic Segmentation and Reconstruction of the Martian Surface. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 48, 1643–1649. [Google Scholar] [CrossRef]
Wang, Y.; Wu, B. Active machine learning approach for crater detection from planetary imagery and digital elevation models. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5777–5789. [Google Scholar] [CrossRef]
Jiang, S.; Lian, Z.; Yung, K.L.; Ip, W.H.; Gao, M. Automated detection of multitype landforms on mars using a light-weight deep learning-based detector. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5015–5029. [Google Scholar] [CrossRef]
Yang, S.; Cai, Z. High-resolution feature pyramid network for automatic Crater detection on Mars. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4601012. [Google Scholar] [CrossRef]
Cheng, S.; Yao, M.; Xiao, X. Dc-mot: Motion deblurring and compensation for multi-object tracking in uav videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 789–795. [Google Scholar]
Chen, Z.; Wu, B.; Liu, W.C. Mars3DNet: CNN-based high-resolution 3D reconstruction of the Martian surface from single images. Remote Sens. 2021, 13, 839. [Google Scholar] [CrossRef]
Tian, P.; Yao, M.; Xiao, X.; Zheng, B.; Cao, T.; Xi, Y.; Liu, H.; Cui, H. 3D Semantic Terrain Reconstruction of Monocular Close-up Images of Martian Terrains. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar]
Li, Z.; Wu, B.; Liu, W.C.; Chen, Z. Integrated photogrammetric and photoclinometric processing of multiple HRSC images for pixelwise 3-D mapping on Mars. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4601113. [Google Scholar] [CrossRef]
Meoni, G.; Märtens, M.; Derksen, D.; See, K.; Lightheart, T.; Sécher, A.; Martin, A.; Rijlaarsdam, D.; Fanizza, V.; Izzo, D. The OPS-SAT case: A data-centric competition for onboard satellite image classification. Astrodynamics 2024, 8, 507–528. [Google Scholar] [CrossRef]
Hu, R.; Huang, X.; Xu, C. Integrated visual navigation based on angles-only measurements for asteroid final landing phase. Astrodynamics 2023, 7, 69–82. [Google Scholar] [CrossRef]
Gui, Y.; Qi, Y.; Xiao, X.; Lin, B.; Cui, H.; Huang, X. SPTN: Transformer-based spacecraft pose estimation network for space objects tracking. Astrodynamics 2025, 9, 713–725. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ma, X.; Guo, J.; Sansom, A.; McGuire, M.; Kalaani, A.; Chen, Q.; Tang, S.; Yang, Q.; Fu, S. Spatial pyramid attention for deep convolutional neural networks. IEEE Trans. Multimed. 2021, 23, 3048–3058. [Google Scholar] [CrossRef]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10033–10041. [Google Scholar]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Deng, X.; Wang, P.; Lian, X.; Newsam, S. NightLab: A dual-level architecture with hardness detection for segmentation at night. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16938–16948. [Google Scholar]
Guastella, D.C.; Muscato, G. Learning-based methods of perception and navigation for ground vehicles in unstructured environments: A review. Sensors 2020, 21, 73. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Zhao, H.; Yuan, Z.; Xiao, L.; Shen, C.; Wan, X.; Tang, X.; Zhang, L. A Machine Learning Approach for the Autonomous Identification of Hardness in Extraterrestrial Rocks from Digital Images. Aerospace 2024, 12, 26. [Google Scholar] [CrossRef]
Zhang, Z.; Cheng, Y.; Bu, L.; Ye, J. Rapid SLAM Method for Star Surface Rover in Unstructured Space Environments. Aerospace 2024, 11, 768. [Google Scholar] [CrossRef]
Giacomini, E.; Westerberg, L.G. Numerical Study on Particle Accumulation and Its Impact on Rotorcraft Airfoil Performance on Mars. Aerospace 2025, 12, 368. [Google Scholar] [CrossRef]
Solarna, D.; Gotelli, A.; Le Moigne, J.; Moser, G.; Serpico, S.B. Crater detection and registration of planetary images through marked point processes, multiscale decomposition, and region-based analysis. IEEE Trans. Geosci. Remote Sens. 2020, 58, 6039–6058. [Google Scholar] [CrossRef]
Xiao, X.; Cui, H.; Yao, M.; Tian, Y. Autonomous rock detection on mars through region contrast. Adv. Space Res. 2017, 60, 626–635. [Google Scholar] [CrossRef]
Li, H.; Qiu, L.; Li, Z.; Meng, B.; Huang, J.; Zhang, Z. Automatic rocks segmentation based on deep learning for planetary rover images. J. Aerosp. Inf. Syst. 2021, 18, 755–761. [Google Scholar] [CrossRef]
Feng, L.; Wang, S.; Wang, D.; Xiong, P.; Xie, J.; Hu, Y.; Zhang, M.; Wu, E.Q.; Song, A. Mobile-DeepRFB: A Lightweight Terrain Classifier for Automatic Mars Rover Navigation. IEEE Trans. Autom. Sci. Eng. 2023, 22, 17442–17451. [Google Scholar] [CrossRef]
Dai, Y.; Zheng, T.; Xue, C.; Zhou, L. SegMarsViT: Lightweight mars terrain segmentation network for autonomous driving in planetary exploration. Remote Sens. 2022, 14, 6297. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Liu, H.; Yang, H.; Fu, Y. Marsformer: Martian rock semantic segmentation with transformer. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600612. [Google Scholar] [CrossRef]
Liu, H.; Yao, M.; Xiao, X.; Xiong, Y. RockFormer: A U-shaped transformer network for Martian rock segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4600116. [Google Scholar] [CrossRef]
Liu, H.; Yao, M.; Xiao, X.; Cui, H. A hybrid attention semantic segmentation network for unstructured terrain on Mars. Acta Astronaut. 2023, 204, 492–499. [Google Scholar] [CrossRef]
Xiong, Y.; Xiao, X.; Yao, M.; Cui, H.; Fu, Y. Light4Mars: A lightweight transformer model for semantic segmentation on unstructured environment like Mars. ISPRS J. Photogramm. Remote Sens. 2024, 214, 167–178. [Google Scholar] [CrossRef]
Swan, R.M.; Atha, D.; Leopold, H.A.; Gildner, M.; Oij, S.; Chiu, C.; Ono, M. Ai4mars: A dataset for terrain-aware autonomous driving on mars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1982–1991. [Google Scholar]
Xiao, X.; Yao, M.; Liu, H.; Wang, J.; Zhang, L.; Fu, Y. A kernel-based multi-featured rock modeling and detection framework for a mars rover. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 3335–3344. [Google Scholar] [CrossRef]
Liu, H.; Yao, M.; Xiao, X.; Zheng, B.; Cui, H. MarsScapes and UDAFormer: A Panorama Dataset and a Transformer-Based Unsupervised Domain Adaptation Framework for Martian Terrain Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4600117. [Google Scholar] [CrossRef]
Allak, E.; Brommer, C.; Dallenbach, D.; Weiss, S. AMADEE-18: Vision-Based Unmanned Aerial Vehicle Navigation for Analog Mars Mission (AVI-NAV). Astrobiology 2020, 20, 1321–1337. [Google Scholar] [CrossRef]
Qi, Y.; Xiao, X.; Yao, M.; Xiong, Y.; Zhang, L.; Cui, H. AirFormer: Learning-Based Object Detection For Mars Helicopter. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 100–111. [Google Scholar] [CrossRef]
Pan, X.; Ye, T.; Xia, Z.; Song, S.; Huang, G. Slide-transformer: Hierarchical vision transformer with local self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2082–2091. [Google Scholar]
Chen, W.; Shi, K. Multi-scale attention convolutional neural network for time series classification. Neural Netw. 2021, 136, 126–140. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable {DETR}: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Online, 3–7 May 2021. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
Barber, C.B.; Dobkin, D.P.; Huhdanpaa, H. The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. (TOMS) 1996, 22, 469–483. [Google Scholar] [CrossRef]
Ge, Y.; Yang, Z.; Huang, Z.; Ye, F. A multi-level feature fusion method based on pooling and similarity for HRRS image retrieval. Remote Sens. Lett. 2021, 12, 1090–1099. [Google Scholar] [CrossRef]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Koonce, B.; Koonce, B. MobileNetV3. Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 125–144. [Google Scholar]
Wang, W.; Chen, W.; Qiu, Q.; Chen, L.; Wu, B.; Lin, B.; He, X.; Liu, W. Crossformer++: A versatile vision transformer hinging on cross-scale attention. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3123–3136. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking mobile block for efficient attention-based models. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lin, W.; Wu, Z.; Chen, J.; Huang, J.; Jin, L. Scale-aware modulation meet transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6015–6026. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Lv, W.; Wei, L.; Zheng, D.; Liu, Y.; Wang, Y. Marsnet: Automated rock segmentation with transformers for tianwen-1 mission. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3506605. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 June 2018; pp. 3–19. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Vaswani, A. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 11, 6000–6010. [Google Scholar]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]
Cheng, B.; Collins, M.D.; Zhu, Y.; Liu, T.; Huang, T.S.; Adam, H.; Chen, L.C. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12475–12485. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]

Figure 2. From left to right, the examples show SynMars-Air, MarsScapes, and images captured by the Ingenuity helicopter. (a–d) display a comparison of BigRock, Sand, Ridge, and BedRock.

Figure 3. The overall architecture of our LisseMars.

X_{i}

represent the feature output through each encoder modules, and

F_{i}

represent the feature output through each decoder modules.

Figure 3. The overall architecture of our LisseMars.

X_{i}

represent the feature output through each encoder modules, and

F_{i}

represent the feature output through each decoder modules.

Figure 4. The working principle of WMA. Compared to models like PVT, Deformable DETR, and Swin, our WMA demonstrates a significant advantage in terms of receptive field.

Figure 7. The structure of CFFN. (a) FFN in ViT. (b) Mix-FFN in PVTv2. (c) Ours.

Figure 8. (a) Basic shape of Martian rocks. (b) Structure of DPCM. (c) DPC. The input X is processed through various paths, including DPC, and then merged to produce the final output features.

Figure 9. (Left): Initial coordinate display of DPC and a demonstration of the convolution kernel’s movement. (Right): The receptive field of DPC. When the formed polygon has concave edges, the convolution kernel returns to the initial coordinates.

Figure 10. The structure of GFM.

Figure 11. Samples of different categories in the SynMars-Air dataset (Source: [45]).

Figure 12. A sample of MarsScapes panoramic image (Source: [43]).

Figure 13. Samples from the TianWen dataset. (Source: NAOC/GRAS).

Figure 14. Visual comparison on the SynMars-Air test set. The red boxes mask some challenging objects that are difficult to cut.

Figure 15. Visual comparison on the MarsScapes test set.

Figure 16. Visual comparison on the TianWen test set. The red boxes mask some challenging objects that are difficult to cut.

Figure 17. The visualization result of feature heatmaps with different stages.

Table 1. Model variants of LisseMars. N represents the number of blocks,

h e a d

represents the number of heads, and p and K represent the size of patches and the number of convolution kernels, respectively.

Table 1. Model variants of LisseMars. N represents the number of blocks,

h e a d

represents the number of heads, and p and K represent the size of patches and the number of convolution kernels, respectively.

	Module	LisseMars-T	LisseMars-S	LisseMars-B
Stage 1	WMA	p = 4	p = 4	p = 4
Stage 1	WMA	$C_{1}$ = 8, $h e a d$ = 1, $N_{1}$ = 1	$C_{1}$ = 32, $h e a d$ = 2, $N_{1}$ = 1	$C_{1}$ = 32, $h e a d$ = 2, $N_{1}$ = 2
Stage 2	WMA	p = 2	p = 2	p = 2
Stage 2	WMA	$C_{2}$ = 16, $h e a d$ = 2, $N_{2}$ = 1	$C_{2}$ = 64, $h e a d$ = 2, $N_{2}$ = 1	$C_{2}$ = 64, $h e a d$ = 2, $N_{2}$ = 2
Stage 3	DPCM	K = 9	K = 9	K = 9
		$N_{3}$ = 1	$N_{3}$ = 1	$N_{3}$ = 1
		$C_{3}$ = 32,	$C_{3}$ = 64	$C_{3}$ = 128
Stage 4	WMA	p = 2	p = 2	p = 2
Stage 4	WMA	$C_{4}$ = 32, $h e a d$ = 4, $N_{4}$ = 1	$C_{4}$ = 128, $h e a d$ = 4, $N_{4}$ = 1	$C_{4}$ = 128, $h e a d$ = 4, $N_{4}$ = 2
Stage 5	WMA	p = 2	p = 2	p = 2
Stage 5	WMA	$C_{5}$ = 64, $h e a d$ = 8, $N_{5}$ = 1	$C_{5}$ = 256, $h e a d$ = 8, $N_{5}$ = 1	$C_{5}$ = 256, $h e a d$ = 8, $N_{5}$ = 2

Table 2. Semantic Segmentation with 160 K on SynMars-Air (the first and second best results are highlighted in red and underlined, respectively).

Backbone	Params	FLOPs	160 K
Backbone	(M)	(G)	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
Bisenetv2 [52]	3.35	12.31	0	29.14	81.48	88.43	95.36	98.47	98.11	99.78	73.85
MobileNetV3 [53]	3.28	8.60	0.77	36.22	80.9	88.97	94.65	98.71	98.33	99.76	74.79
Crossformer++-T [54]	23.3	4.9	0.01	44.05	80.5	88.34	96.0	98.87	98.17	99.56	75.79
Convnext-T [55]	28	4.5	2.81	54.22	85.94	91.48	95.87	98.83	91.97	99.05	77.52
Cswin-T [56]	38.9	61.5	0.0	0	31.64	62.34	25.21	94.76	75.7	92.97	47.83
EMO-1M [57]	5.6	2.4	0.0	0.04	65.17	78.14	91.28	97.33	94.66	99.2	65.72
Pvtv2-B0 [58]	29.1	45.8	2.62	47.62	77.14	84.34	95.61	98.8	97.53	99.59	75.41
Poolformer-s12 [59]	12.47	54.31	0.03	27.11	55.96	79.27	83.26	97.51	88.55	98.03	66.21
DeepLabV3+-r18 [60]	15.65	30.75	1.15	56.22	89.88	93.13	82.82	98.12	90.55	99.47	76.44
Segformer-b0 [61]	3.7	6.41	2.9	38.96	84.87	88.44	86.98	98.04	95.98	99.56	74.46
Segnext-T [62]	4.3	6.6	0	24.83	79.63	88.41	96.04	98.44	98.52	99.75	73.2
Light4Mars-T [40]	0.1	0.53	4.73	41.38	76.8	85.6	93.42	98.61	94.23	98.44	74.15
LisseMars-T	0.12	0.90	7.25	48.46	84.53	89.91	96.26	98.93	96.33	98.97	77.58
Crossformer++-S [54]	52.0	9.5	1.36	51.69	80.18	87.57	96.1	98.9	94.27	98.18	76.03
Convnext-S [55]	50	8.7	3.91	60.96	86.62	91.77	95.79	99.16	99.38	99.93	79.69
Swin-T [63]	60.0	236	3.03	55.09	76.33	83.41	94.84	98.94	84.41	95.14	73.90
Swin-S [63]	81.0	329	2.29	57.21	82.67	88.17	95.79	99.1	93.61	97.99	77.10
Cswin-S [56]	51.3	73.7	0	17.33	49.1	70.39	77.07	97.14	97.34	98.28	62.64
EMO-2M [57]	6.9	3.5	0.0	0.02	67.15	79.31	92.21	97.33	94.23	98.99	66.16
SMT-B [64]	51.7	76.2	1.9	56.9	91.05	94.35	97.94	99.25	99.0	99.82	82.19
Uniformer-B [65]	70.0	90.34	7.49	69.13	90.1	92.71	97.49	99.3	98.4	99.93	82.51
Pvtv2-B3 [58]	49.0	62.4	5.53	66.38	88.64	91.63	97.88	99.35	99.5	99.93	81.11
PIDNet-s [66]	7.6	47.6	0	31,52	87.48	91.23	97.2	98.86	98.53	99.78	75.58
Poolformer-s36 [59]	34.6	42.0	0.01	27.1	57.19	78.83	83.46	97.52	88.65	97.96	66.34
DeepLabV3+-r50 [60]	47.0	62.7	0.9	56.37	90.18	93.44	97.41	99.13	98.34	99.84	79.45
Segformer-b2 [61]	25.4	15.1	0.99	53.42	89.92	94.04	97.69	99.19	98.94	99.79	79.25
Segnext-B [62]	27.8	35.7	0.09	47.5	86.65	90.01	96.15	99.01	99.14	99.89	77.31
SMT-B [64]	61.8	328	8.27	70.03	89.87	92.11	98.4	99.41	99.37	99.88	82.17
Uniformer-B [65]	80	471	7.49	69.13	90.1	92.71	97.49	99.38	98.4	99.5	81.78
Light4Mars-S [40]	1.50	5.45	15.09	63.97	90.98	93.91	97.6	99.34	99.77	99.96	82.51
LisseMars-S	1.355	6.49	17.63	67.77	91.53	94.04	97.63	99.41	99.3	99.85	83.39
Crossformer++-B [54]	92.0	16.6	1.94	49.97	80.31	87.61	95.36	98.87	96.79	99.83	76.33
Convnext-B [55]	89	45.0	5.33	64.64	86.91	92.07	96.48	99.2	99.42	99.94	80.50
Swin-B [63]	121.0	479	1.9	56.9	91.05	94.35	97.94	99.25	99.0	99.82	80.03
Cswin-B [56]	96.7	115.53	0	25.02	56.92	77.16	79.0	97.37	78.59	94.8	63.61
EMO-5M [57]	10.3	5.8	0.0	0.01	71.14	80.88	92.77	97.53	93.83	98.82	66.87
SMT-L [64]	97.1	132.2	9.05	70.17	90.32	92.49	98.35	99.42	99.56	99.95	82.41
Light4Mars-B [40]	2.57	9.43	16.61	67.38	91.53	94.45	97.87	99.4	99.28	99.9	83.30
Uniformer-L [65]	119.0	130.6	10.36	71.51	91.64	93.65	98.31	99.44	99.56	99.95	83.05
Pvtv2-B5 [58]	85.7	91.1	8.76	69.35	89.59	92.45	97.87	99.4	99.58	99.95	82.12
PIDNet-L [66]	36.9	275.8	0	33.62	89.35	92.63	97.74	98.94	98.75	99.78	76.35
Poolformer-m48 [59]	77.1	47.2	0.94	50.32	86.76	91.88	97.33	99.1	99.09	99.83	78.16
DeepLabV3+-r101 [60]	66.7	83.4	2.19	60.11	89.94	93.16	96.87	99.18	98.58	99.82	79.98
Segformer-b5 [61]	82.0	22.5	3.91	63.4	90.68	94.07	97.76	99.29	99.32	99.88	81.04
Segnext-L [62]	48.9	70.0	0.15	47.36	87.61	90.72	96.36	99.02	99.12	99.9	77.53
SMT-L [64]	102	546	9.05	70.17	90.32	92.49	98.35	99.42	99.56	99.95	82.41
Uniformer-L [65]	100	490	10.36	71.51	91.64	93.65	98.31	99.44	99.56	99.95	83.05
LisseMars-B	9.16	21.7	36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.95	87.37

Table 3. Semantic Segmentation with 160 K on MarsScapes. (the first and second best results are highlighted in red and underlined, respectively).

Backbone	Params	FLOPs	160 K
Backbone	(M)	(G)	Soil	BedRock	Gravel	Sand	BigRock	Ridge	Sky	Rover	Unknow	mIoU (%)
Bisenetv2 [52]	3.35	6.16	66.78	64.79	37.80	59.46	44.02	45.94	70.63	81.98	46.01	57.49
MobileNetV3 [53]	3.28	4.37	69.42	73.48	49.64	65.43	47.47	48.84	77.38	62.30	45.91	59.99
Crossformer++-T [54]	23.3	4.9	61.9	60.17	33.11	48.44	31.55	42.34	64.87	64.14	43.72	50.03
Convnext-T [55]	28	4.5	63.81	49.58	38.64	49.9	38.64	54.88	79.13	60.64	43.09	53.15
EMO-2M [57]	6.9	3.5	57.29	50.55	23.85	43.01	33.83	36.39	71.98	36.67	25.56	42.12
Light4Mars-T [40]	0.1	0.53	69.35	70.25	50.82	60.29	45.61	53.87	81.88	55.42	45.89	59.26
Pvtv2-B0 [58]	7.25	11.79	64.91	59.17	37.63	51.77	36.66	49.54	75.18	86.08	48.54	56.61
Segnext-T [62]	40.0	52.3	62.51	57.74	43.72	49.55	40.41	28.32	63.92	12.79	0	39.89
Uniformer-S [65]	52.0	247	65.85	57.25	38.45	52.67	41.75	46.61	77.66	63.41	39.68	53.7
PIDNet-s [66]	7.6	47.6	60.79	55.84	32.32	53.98	39.9	34.39	49.61	87.45	46.67	51.22
LisseMars-T	0.12	0.90	68.95	67.51	45.81	58.94	40.74	62.24	83.84	89.9	53.43	63.48
MarsNet [67]	33.21	120.29	76.02	56.39	44.77	56.44	35.45	35.83	79.72	71.59	45.69	55.77
Crossformer++-S [54]	52.0	9.5	70.12	61.67	49.33	60.43	39.4	38.63	75.05	42.26	51.71	54.30
Convnext-S [55]	50	8.7	64.42	51.13	40.0	51.22	39.63	48.4	73.85	92.59	45.32	56.28
EMO-3M [57]	6.9	3.5	58.88	55.5	33.77	50.47	34.28	46.39	75.44	61.68	31.03	49.71
Light4Mars-S [40]	1.50	5.45	69.41	71.57	46.15	60.72	47.67	60.84	82.13	95.00	46.26	64.30
Segnext-B [62]	27.8	35.7	53.83	39.01	22.53	46.06	35.81	34.58	57.87	69.29	44.83	44.87
Uniformer-B [65]	80.0	471	65.24	55.51	38.75	52.24	42.37	51.32	80.24	69.91	40.66	55.14
Poolformer [59]	15.65	15.38	67.21	66.50	43.61	60.78	45.96	51.58	78.00	94.79	48.20	61.85
DeepLabV3+ [60]	12.47	27.16	69.53	73.99	43.48	61.86	48.58	58.40	82.54	97.80	46.81	64.78
LisseMars-S	1.355	6.49	70.21	71.92	50.08	58.74	46.98	62.72	80.76	92.32	53.31	65.23
Crossformer++-B [54]	92.0	16.6	69.54	66.53	44.89	59.55	43.62	44.0	64.61	83.18	49.82	58.42
Convnext-B [55]	89	45.0	62.79	51.02	41.61	50.22	38.2	56.45	78.04	95.19	49.2	58.06
EMO-5M [57]	10.28	3.05	68.01	71.06	40.03	58.67	42.76	48.97	80.32	95.79	45.33	61.22
Light4Mars-B [40]	2.57	9.43	70.46	73.44	51.62	62.14	48.65	60.40	84.10	92.08	52.39	66.14
PIDNet-L [66]	36.9	275.8	66.70	65.12	36.45	57.04	39.31	51.87	76.67	89.65	46.01	58.76
Segnext-L [62]	48.0	70.0	63.48	60.42	36.83	53.28	40.21	41.84	72.11	60.97	0	47.68
Uniformer-L [65]	100	490	67.11	62.53	41.49	51.01	42.91	46.49	77.69	75.05	40.3	56.06
Cswin [56]	52.01	115.41	64.18	63.03	33.42	52.07	35.34	49.38	75.30	86.91	46.42	56.23
Swin [63]	58.95	121.86	69.28	75.50	46.47	61.13	47.11	66.65	83.53	96.17	53.13	66.55
SMT [64]	52.24	117.0	68.94	75.96	44.76	64.19	51.02	58.04	80.72	96.66	47.32	65.29
Segformer [61]	3.72	3.70	69.48	75.26	46.62	62.85	48.96	61.98	80.52	98.00	53.13	66.31
LisseMars-B	9.16	21.7	70.48	70.81	50.31	61.28	48.14	67.73	81.51	96.41	54.7	66.81

Table 4. Semantic Segmentation with 160 K on TianWen. (the first and second best results are highlighted in red and underlined, respectively).

Backbone	Params	FLOPs	160 K		mIoU (%)
Backbone	(M)	(G)	Background	Rock	mIoU (%)
Bisenetv2 [52]	3.35	12.31	99.19	53.58	76.39
MobileNetV3 [53]	3.28	8.60	99.27	57.34	78.30
Convnext-T [55]	28.0	4.5	99.22	56.29	77.76
Light4Mars-T [40]	0.1	0.53	99.18	56.15	77.84
Segformer-b0 [61]	3.7	6.41	99.17	53.41	76.74
LisseMars-T	0.12	0.904	99.19	57.93	78.56

Table 5. Effect of different proposed modules (The improvement is highlighted in red).

Backbone	Fusion	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
BaseFormer		16.61	67.38	91.53	94.45	97.87	99.4	99.28	99.9	83.30
	+WMA	20.64	69.75	92.03	94.26	95.69	99.45	98.37	99.9	83.76 (↑0.46)
	+CFFN	19.4	69.25	92.18	94.7	98.05	99.44	99.36	99.92	84.04 (↑0.28)
	+GFM	33.64	77.02	91.73	94.13	98.28	99.6	99.7	99.96	86.76 (↑2.72)
	+DPCM	36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.91	87.37 (↑0.61)

Table 6. Comparison of different decoders based on LisseMars-B on SynMars-Air (the first and second best results are highlighted in red and underlined, respectively).

AttentionEncoder	Decoder	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
CBAM [68]		0	26.25	73.14	79.56	89.85	88.32	90.63	94.75	67.81
SE attention [69]		0	22.24	61.55	70.47	84.21	83.37	86.78	92.42	62.63
Shift Window Attention [61]		2.11	57.06	92.14	94.84	97.84	99.6	99.33	99.75	80.3
Effcient Attention [61]		5.9	65.96	91.8	94.47	98.0	99.30	99.5	99.99	81.8
Linear Spatial Reduction Attention [58]		10.26	70.17	90.45	93.13	97.90	99.49	98.80	99.99	82.52
Mix Attention [24]		14.0	65.2	93.81	94.68	95.13	98.21	99.52	99.76	82.53
Scale-Aware Aggregation [27]	GFM	9.3	70.04	90.38	92.58	98.04	99.41	99.54	99.95	82.5
Multi-head Relation Attention [65]		12.31	72.2	92.02	95.4	98.27	99.58	99.56	99.9	83.65
Squeeze Window Attention [40]		23.04	70.94	92.37	95.0	97.9	99.47	99.44	99.88	84.7
Window Movable Attention (Ours)		36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.91	87.37

Table 7. Comparison of different multi-layer perceptrons based on LisseMars-B on SynMars-Air (the first and second best results are highlighted in red and underlined, respectively).

Model	Multi-Layer	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
LisseMars	CMLP [65]	22.01	70.4	90.87	93.44	92.99	99.46	97.12	99.85	83.27
	MLP [70]	21.88	70.46	90.44	93.02	91.58	99.46	96.28	99.82	82.87
	UFFN [40]	21.75	70.28	90.55	93.17	92.52	99.46	96.84	99.83	83.05
	CFFN (Ours)	36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.91	87.37

Table 8. Comparison of different decoders based on LisseMars-B on SynMars-Air (the first and second best results are highlighted in red and underlined, respectively).

Encoder	Decoder	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
WMA	EncHead [61]	0	18.11	86.74	89.1	97.09	98.6	98.33	99.75	73.47
	SegformerHead [61]	23.07	71.17	90.45	93.33	95.32	99.49	98.45	99.99	83.89
	FPN [71]	26.0	72.77	91.95	94.68	98.03	99.51	99.52	99.91	85.35
	UPerHead [27]	31.53	76.2	92.02	94.3	98.27	99.58	99.46	99.9	86.41
	FCNHead [72]	23.04	70.94	92.37	95.0	97.9	99.47	99.44	99.88	84.77
	ALA [40]	20.64	69.75	92.03	94.26	95.69	99.45	98.37	99.9	83.76
	GFM (Ours)	36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.91	87.37

Table 9. The effects of different convolutions in DPC (the first and second best results are highlighted in red and underlined, respectively).

BaseMoudle	Convolution	Gravel	SmallRock	BigRock	BedRock	Sand	Soil	Ridge	Sky	mIoU (%)
DPCM	Conv	35.62	77.21	92.4	94.77	98.09	99.61	99.66	99.95	87.16
	DWConv [73]	35.39	77.2	92.69	95.0	98.27	99.61	99.4	99.85	87.18
	GConv [74]	35.86	77.21	92.33	94.78	98.19	99.61	99.63	99.93	87.19
	DCN [75]	35.92	77.22	92.6	95.0	98.1	99.61	99.44	99.86	87.22
	DPC (Ours)	36.48	77.65	92.71	95.0	98.04	99.61	99.56	99.91	87.37

Table 10. The position of DPCM (the first result is highlighted in red).

Position	Stage 1	Stage 2	Stage3	Stage 4	Stage 5	mIoU (%)
DPCM	✓					86.07
		✓				86.83
				✓		86.92
					✓	86.5
			✓			87.37

Table 11. Run-time performance of LisseMars on Nvidia Jetson Xavier NX (NX).

Metrics	SynMars-Air	MarsScapes	TianWen
Times (s)	0.048	0.04	0.044
mIoU (%)	77.58	61.48	78.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, B.; Wang, F.; Li, Q.; Zheng, B.; Yao, M.; Xiao, X.; Qi, Y.; Cui, H.; Huang, X. LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter. Aerospace 2025, 12, 1049. https://doi.org/10.3390/aerospace12121049

AMA Style

Lin B, Wang F, Li Q, Zheng B, Yao M, Xiao X, Qi Y, Cui H, Huang X. LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter. Aerospace. 2025; 12(12):1049. https://doi.org/10.3390/aerospace12121049

Chicago/Turabian Style

Lin, Boyu, Fei Wang, Qingeng Li, Bo Zheng, Meibao Yao, Xueming Xiao, Yifan Qi, Hutao Cui, and Xiangyu Huang. 2025. "LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter" Aerospace 12, no. 12: 1049. https://doi.org/10.3390/aerospace12121049

APA Style

Lin, B., Wang, F., Li, Q., Zheng, B., Yao, M., Xiao, X., Qi, Y., Cui, H., & Huang, X. (2025). LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter. Aerospace, 12(12), 1049. https://doi.org/10.3390/aerospace12121049

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LisseMars: A Lightweight Semantic Segmentation Model for Mars Helicopter

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning for Mars

2.2. Mars Terrain Datasets

3. Method

3.1. Overrall Framework

3.2. Window Movable Attention (WMA)

3.3. Convolutional Feedforward Neural Network

3.4. Dynamic Polygon Convolution Module

3.5. Group Fusion Module

3.6. Architecture Details and Variants

4. Experiments

4.1. Dataset

4.2. Implement Details

4.3. Comparative Experiment

4.3.1. Semantic Segmentation on SynMars-Air

4.3.2. Semantic Segmentation on MarsScapes

4.3.3. Semantic Segmentation on TianWen

4.4. Ablation Studies

4.4.1. The Impact of Different Proposed Modules

4.4.2. The Impact of WMA

4.4.3. The Impact of CFFN

4.4.4. The Impact of GFM

4.4.5. The Impact of DPC

4.5. Runtime Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI