TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification

Shi, Shuhong; Zeng, Lingchuan

doi:10.3390/electronics15061194

Open AccessArticle

TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification

by

Shuhong Shi

¹ and

Lingchuan Zeng

^2,*

¹

School of Automation, Beijing Information Science and Technology University, Beijing 100192, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1194; https://doi.org/10.3390/electronics15061194

Submission received: 10 February 2026 / Revised: 4 March 2026 / Accepted: 8 March 2026 / Published: 13 March 2026

(This article belongs to the Special Issue Next-Generation Space Navigation: Opportunistic Signals, Autonomous Orbit Determination, and Security Enhancement)

Download

Browse Figures

Versions Notes

Abstract

Autonomous mobile robots require robust traversability perception to navigate safely in diverse outdoor environments. However, traditional deep learning approaches are data-hungry, requiring large-scale manual annotations, and struggle to adapt quickly to unseen environments. This paper introduces TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a novel framework combining one-shot prototype initialization with trajectory-guided online adaptation for terrain segmentation. Using a single reference image of desired traversable terrain, TARTS establishes an initial prototype from pre-trained DINO Vision Transformer (ViT) features. The system performs segmentation through superpixel-based feature aggregation and valley-emphasis Otsu thresholding while continuously refining the prototype via Exponential Moving Average (EMA) updates driven by automated footprint supervision from the robot’s traversed trajectory. Extensive experiments on our introduced Reference-guided Traversability Segmentation Dataset (RTSD) and the challenging Off-Road Freespace Detection (ORFD) benchmark demonstrate strong performance, achieving 94.5% IoU on RTSD and 94.1% IoU on ORFD, outperforming state-of-the-art supervised methods that require multi-modal inputs and dedicated training. The framework maintains efficient performance (17–24 FPS) on embedded platforms, enabling practical deployment with only a reference image as initialization.

Keywords:

traversability segmentation; training-free learning; vision foundation models; reference-guided segmentation; online adaptation

1. Introduction

Safe and efficient navigation for autonomous mobile robots in diverse environments hinges on perceiving the traversability of surrounding surfaces [1]. Traditional environment perception methods have predominantly relied on geometric information, identifying ground surfaces and obstacles by analyzing depth sensor data [2,3]. However, these approaches often exhibit limited robustness and poor generalization when confronted with ambiguous terrain or areas that share similar geometric features but possess different semantic meanings [4,5]. For instance, a flat patch of grass and a flat marsh may appear geometrically similar, yet their traversability for a robot is vastly different, highlighting the inadequacy of relying solely on geometric cues. To overcome the limitations of geometry-based methods, researchers have begun to integrate semantic segmentation, which assigns pixel-level class labels, to help robots comprehend high-level semantic information. The advent of deep learning-based semantic segmentation methods [6,7,8] has significantly enhanced the robot’s ability to understand complex scenes. Nevertheless, these supervised learning approaches are heavily dependent on large-scale, high-quality, manually annotated datasets. Manual labeling of every new dynamic environment is impractical [9]. Furthermore, the generalization capability of pre-trained models tends to degrade sharply when encountering unseen environments (i.e., domain shift), and these models are often computationally intensive, posing significant challenges for real-time deployment on resource-constrained platforms.

In recent years, self-supervised learning has emerged as a new paradigm that offers a promising solution by learning from massive unlabeled data [10]. By leveraging the robot’s own interactions with the environment (e.g., consecutive frames, odometry), these methods [5,11] can automatically generate supervision signals from the robot’s traversed footprints—regions that the robot has successfully navigated through, which inherently indicate traversable terrain. These footprint annotations are then projected onto camera viewpoints to create pixel-level supervision masks for training segmentation models, thereby reducing the dependence on manual labeling and enhancing the robustness in unknown environments. However, current self-supervised frameworks often still require an initial training phase or substantial computational resources for feature extraction, limiting applicability for training-free, “plug-and-play” deployment.

To address the limitations of existing traversability segmentation approaches, this paper proposes TARTS (Training-free Adaptive Reference-guided Traversability Segmentation), a training-free framework combining reference-guided learning with automated self-supervision for terrain segmentation. First, the system initializes using only a single reference image of the desired traversable terrain. By extracting and globally averaging pre-trained DINO Vision Transformer (ViT) [12] features from this image, the framework establishes an initial traversability prototype that captures semantic characteristics of navigable surfaces, bypassing offline training. Second, the system performs segmentation inference through superpixel-based feature aggregation and adaptive thresholding. Dense semantic features extracted from each incoming frame are compared against the current prototype using cosine similarity, while a valley-emphasis Otsu method [13] automatically sets adaptive thresholds for robust segmentation. Third, we develop an online adaptation loop that continuously refines the traversability prototype through trajectory-guided self-supervision. The framework projects the robot’s recently traversed footprint onto historical camera viewpoints to generate high-fidelity supervision masks, enabling prototype updates via Exponential Moving Average (EMA) without manual annotation.The Flowchart of the TARTS method is Figure 1.

This integrated approach enables immediate deployment capabilities while maintaining lifelong learning through embodied interaction, making it particularly suitable for autonomous navigation in dynamic and previously unseen environments. The main contributions of this paper are outlined below:

1.: We propose a comprehensive training-free traversability segmentation framework that combines reference-guided one-shot prototype initialization with trajectory-based online adaptation, enabling both immediate deployment capabilities and continuous performance improvement through embodied interaction with the environment.
2.: We demonstrate that decoupling semantic recognition from fine-grained spatial localization—by leveraging SLIC for perceptual grouping and patch-level DINO features for semantic discrimination—effectively alleviates the spatial inconsistency inherent in vision foundation models.

2. Related Work

2.1. Semantic Traversability Analysis

Semantic traversability analysis, which involves segmenting traversable regions for mobile robots using semantic understanding, has been a significant area of research. Early and contemporary approaches have leveraged deep learning, particularly semantic segmentation, to classify terrain and identify safe navigation paths.

Kim et al. [1] introduced a scalable methodology for training a semantic traversability estimator by using egocentric videos from pedestrians, combined with an automated annotation strategy that leverages a foundation model to generate training data, thereby reducing the reliance on expensive manual labeling. To enhance segmentation efficiency in unstructured outdoor environments, GA-Nav [8] utilizes a novel group-wise attention mechanism and a corresponding loss function to effectively distinguish between different navigability levels of terrains from RGB images. Focusing on road scenes, RoadFormer [6] is a duplex Transformer-based network that fuses heterogeneous features from both RGB images and surface normal information to parse freespace and hazardous road defects. Addressing the complementary nature of segmentation and boundary detection, Mobile-Seed [7] is a lightweight, dual-task framework featuring a two-stream encoder and an active fusion decoder to simultaneously perform both tasks for mobile robotics. Ewen et al. [14] proposed a Bayesian inference framework that moves beyond simple classification to estimate a probability distribution for terrain properties, such as friction, in real time using a single RGB-D camera to create semantic maps for risk-aware navigation. These innovations highlight a trend toward integrating semantic segmentation with predictive modeling to address limitations in traditional geometric methods. However, these supervised approaches remain heavily reliant on large-scale annotated datasets and often struggle with domain shift when deployed in unseen environments.

2.2. Traversability from Self-Supervision

Self-supervised learning has emerged as a prominent approach for traversability estimation, circumventing the need for extensive, manually annotated datasets. For instance, WayFASTER [11] utilizes sequences of RGB-D images fused with pose estimations, leveraging experience data from a receding horizon estimator to self-supervise a network for predicting traversability, even for areas not immediately visible. Other research leverages the capabilities of large-scale Vision Foundation Models (VFMs). V-STRONG [15] introduces an image-based self-supervised method employing contrastive representation learning, which utilizes human driving data and instance segmentation masks to achieve strong out-of-distribution performance and zero-shot generalization in off-road scenarios. Similarly, Wild Visual Navigation (WVN) [5] employs high-dimensional features from pre-trained models but focuses on online self-supervised learning, enabling rapid in-field adaptation to complex outdoor terrains using an online supervision generation scheme. Addressing the specific needs of legged platforms, STEPP [4] learns from human walking demonstrations, utilizing DINOv2 [16] features within a reconstruction-based framework to identify unfamiliar or hazardous terrain as anomalies exhibiting high reconstruction error. Furthermore, cross-modal self-supervision has been explored, where unsupervised clustering of vehicle–terrain interaction sounds provides sparse labels to subsequently train a visual classifier for pixel-wise semantic segmentation [17].

In summary, while self-supervised methods significantly reduce annotation requirements, most still necessitate an offline training phase or substantial computational resources for online learning, leaving a gap for training-free, immediately deployable solutions.

Recent advances in vision foundation models have further driven zero-shot and one-shot segmentation paradigms. The Segment Anything Model (SAM) [18] enables class-agnostic mask generation through promptable segmentation, while ZISVFM [9] leverages vision foundation models for zero-shot instance segmentation in robotic environments. Our work shares the spirit of these approaches by eliminating domain-specific training, but differs in its focus on binary traversability segmentation with continuous online adaptation through embodied interaction.

3. Methodology

3.1. System Overview

The proposed TARTS system is an efficient, training-free perception pipeline using pre-trained DINO features (semantics) and SLIC superpixels (spatial coherence). It initializes from a single reference feature for rapid, training-free recognition. Optional robot footprint supervision data incrementally refines the prototype.

The workflow consists of three sequential stages: (1) One-shot Prototype Seeding (Figure 2), where a global traversability prototype is created from a single reference image for immediate segmentation; (2) Segmentation Inference (Figure 2), where DINO features for each frame are aggregated by superpixel, compared to the prototype via cosine similarity, and adaptively classified using a valley-emphasis Otsu threshold; and (3) Online Adaptation Loop (Figure 3), where the robot’s trajectory memory is used to project its footprint, generating automated supervision labels that update the prototype via an EMA to adapt to new terrain. This architecture enables immediate deployment and lifelong adaptation.

3.2. Semantic Feature Extraction via Vision Foundation Model

We employ a ViT [19] pre-trained with the DINO [12,20] self-supervised learning framework to serve as our feature extraction backbone. This approach is motivated by several key properties of such models.

First, DINO models exhibit strong “out-of-the-box” performance across various downstream tasks without requiring task-specific fine-tuning. This property is paramount for our framework, as online fine-tuning of a large network is computationally prohibitive on a mobile platform. Furthermore, the model yields dense, patch-level features that preserve rich spatial and semantic information, which is crucial for the fine-grained analysis required.

We specifically select DINO over alternative foundation models such as SAM [18] and CLIP [21]. SAM is designed for class-agnostic, promptable segmentation and excels at boundary delineation, but it lacks inherent semantic understanding of terrain properties and requires explicit spatial prompts (points, boxes, or masks) for each inference. CLIP features are optimized for global image–text alignment and do not provide the dense, patch-level spatial features necessary for fine-grained pixel-level segmentation. In contrast, DINO’s self-supervised pre-training yields dense, semantically rich patch-level features that are directly amenable to prototype-based similarity matching without any task-specific adaptation.

Formally, each input image

I_{i}

is processed by the DINO-trained ViT encoder, which we denote as the function

Φ_{DINO} (\cdot)

, to produce a dense feature map

F_{i} \in R^{h \times w \times D}

.

F_{i} = Φ_{DINO} (I_{i})

(1)

Here, h and w are the spatial dimensions of the feature map, and D is the dimensionality of each feature vector. Each feature vector

f_{i, j} \in R^{D}

in this map corresponds to a local patch in the input image, thereby retaining critical spatial information.

3.3. Reference-Guided One-Shot Traversability Prototype Seeding

The system is initialized via a one-shot learning mechanism using a single reference image,

I_{r e f}

, that depicts the desired terrain type, establishing the reference-guided nature of our approach.

The reference image

I_{r e f}

is processed by the same DINO-trained ViT encoder,

Φ_{DINO} (\cdot)

, to produce a dense feature map

F_{r e f} \in R^{h \times w \times D}

as follows:

F_{r e f} = Φ_{DINO} (I_{r e f})

(2)

Since the entire reference image is assumed to represent the target terrain, every feature vector

f_{i, j}^{r e f} \in R^{D}

within this map is used for initialization. The initial traversability prototype,

p_{0} \in R^{D}

(where the subscript 0 indicates the initial update step before any online adaptation), is then computed as the centroid of all feature vectors in

F_{r e f}

, given by:

p_{0} = \frac{1}{h \cdot w} \sum_{i = 1}^{h} \sum_{j = 1}^{w} f_{i, j}^{r e f}

(3)

This aggregation provides a robust, global representation of the traversable class in the feature space. Simple centroid calculation is adopted because the entire reference image is assumed to depict the target traversable terrain, meaning all feature vectors are equally representative of the desired class.

3.4. Segmentation Inference

For each incoming frame

I_{i}

, the system performs efficient segmentation by comparing aggregated features against the current traversability prototype

p_{t - 1}

, where the subscript

t - 1

denotes that this prototype was obtained from the most recent (i.e.,

(t - 1)

-th) update step. This process involves two main stages: superpixel-based feature aggregation and adaptive classification.

3.4.1. Superpixel-Based Feature Aggregation

To balance computational efficiency with spatial coherence, the input image

I_{i}

is first partitioned into a set of K superpixels,

{s_{k}}_{k = 1}^{K}

, using the SLIC algorithm. Concurrently, the dense feature map

F_{i} \in R^{h \times w \times D}

is produced by the encoder

Φ_{DINO} (\cdot)

, where the spatial dimensions

h \times w

are significantly smaller than the original image resolution

H \times W

.

We deviate from existing works [5] by not employing interpolation methods, such as bilinear interpolation, to upsample this low-resolution feature map to the original image resolution. This is because DINO patch-level features inherently lack fine-grained spatial consistency [12,22]. For each superpixel

s_{k}

, we map its spatial extent (on the original

H \times W

image) onto the

h \times w

patch grid and compute its representative feature

v_{k} \in R^{D}

based on a coverage-based priority rule. Our selection principle first seeks to average the feature vectors of all patches that are fully encompassed by

s_{k}

; if no patch is fully encompassed (e.g., if

s_{k}

is small or lies on a patch boundary), the strategy alternatively selects the feature vector of the single patch that exhibits the highest spatial overlap with

s_{k}

.

We deviate from existing traversability frameworks by explicitly rejecting continuous interpolation methods, such as bilinear upsampling, to align the low-resolution feature map

F_{i}

to the original image resolution

H \times W

. Because DINO [12,22] patch-level features encode highly compressed, non-linear semantic representations, they inherently lack fine-grained spatial consistency at the sub-patch level. Bilinear interpolation mathematically blends these distinct latent vectors, generating synthetic, out-of-distribution feature representations that destroy semantic integrity.

Instead, we propose a discrete, patch-level alignment strategy governed by a mathematically rigorous coverage-based priority rule. Let

P_{x, y}

denote the spatial footprint of a single DINO patch projected onto the high-resolution image grid, corresponding to the latent feature vector

f_{x, y} \in F_{i}

. For each generated superpixel

s_{k}

, we compute the geometric intersection area for all overlapping patches: The priority rule is executed in a strict two-tier hierarchy:

Encompassment-Based Aggregation: We define a patch as “fully encompassed” if the ratio of its intersection area to its total patch area exceeds a strict threshold

η_{c o v}

(empirically set to

0.85

). We form a set of encompassed patches. If

Ω_{k}

is non-empty, the representative superpixel feature is calculated as the unweighted mean of this set.

Maximum Overlap Fallback: If the superpixel is small or highly elongated, spanning multiple patches without encompassing any, the algorithm falls back to selecting the feature vector of the single discrete patch that exhibits the absolute highest intersection area with the superpixel. In instances of equivalent maximum overlap across multiple patches, the tie is deterministically broken by selecting the patch whose spatial centroid minimizes the Euclidean distance to the geometric centroid of

s_{k}

.

We represent this alignment and aggregation process as:

v_{k} = AlignAndAggregate (s_{k}, F_{i})

(4)

The superiority of this proposed patch-level alignment strategy over interpolation is further substantiated in our ablation studies presented in Section 4.3.3.

3.4.2. Similarity Matching and Adaptive Thresholding

The semantic similarity between each superpixel feature

v_{k}

and the current prototype

p_{t - 1}

is measured using cosine similarity, where

p_{t - 1}

represents the prototype from the previous update step. The similarity score

c_{k}

is given by:

c_{k} = sim (v_{k}, p_{t - 1}) = \frac{v_{k} \cdot p_{t - 1}}{∥ v_{k} ∥ ∥ p_{t - 1} ∥}

(5)

To binarize the similarity map into a final segmentation mask, a global threshold

τ

must be determined. Using a fixed, predefined threshold is brittle and fails to adapt to the changing score distributions of different scenes. We therefore employ an adaptive thresholding scheme based on Otsu’s method [23], which automatically finds an optimal threshold by maximizing the between-class variance of the similarity score histogram. To prepare the scores for thresholding, the similarity distribution

{c_{k}}

is first normalized using min–max normalization to scale values between 0 and 1, ensuring consistency across varying ranges in different scenes. Otsu’s method is then applied to this normalized distribution. However, the standard Otsu method assumes a well-separated bimodal distribution and can perform poorly when the histogram exhibits an imbalanced bimodal pattern or when the valley between the foreground and background is shallow or shifted, scenarios that are common in scenes with dominant background regions or subtle object boundaries. To enhance robustness, we utilize a valley-emphasis variant of Otsu’s method. This approach modifies the Otsu criterion to assign greater weight to threshold values that lie in low-density regions (valleys) of the histogram, thereby favoring a more natural separation point and yielding a more stable and reliable threshold,

τ_{otsu}

.

Finally, the complete binary segmentation mask for the query image,

M_{q}

, is generated by classifying each superpixel based on the adaptive threshold:

M_{q} (s_{k}) = \{\begin{matrix} 1 & if c_{k} \geq τ_{otsu} \\ 0 & otherwise \end{matrix}

(6)

where a value of 1 indicates a traversable region.

3.5. Online Prototype Adaptation via Trajectory-Guided Self-Supervision

3.5.1. Self-Supervision via Retrospective Footprint Projection

The core of our online adaptation loop is the automated generation of dense, pixel-accurate supervision masks from the robot’s own physical experience. This embodied ground truth provides an unambiguous and highly reliable traversability label. A key challenge, however, is the inherent sensor blind spot, which prevents the onboard camera from observing the ground currently being traversed. To overcome this, we propose a retrospective footprint projection mechanism that leverages the trajectory memory to project a recently traversed path onto an optimal past viewpoint, thereby generating a high-quality training mask. Figure 3 provides an overview of this online adaptation mechanism, showing how the robot’s physical interaction with the environment drives continuous prototype refinement.

The system maintains a trajectory memory, implemented as a first-in, first-out (FIFO) queue of recent data nodes with maximum length L. Each node

N_{i}

in the trajectory memory contains:

Timestep i
Patch-level feature map $F_{i}$ extracted from the corresponding image
Odometry data encoding the robot’s motion
Camera intrinsic parameters K
Camera extrinsic parameters, specifically the translation $T_{i}$

The distance between consecutive nodes in the trajectory memory is maintained at d, ensuring uniform spatial sampling of the robot’s trajectory. This trajectory memory structure is visualized in the middle panel of Figure 3.

The prototype update process is triggered when the accumulated distance between the current position and the last update node reaches a predefined threshold

d_{extract}

. Such nodes that satisfy the

d_{extract}

criterion are designated as update nodes, and their corresponding timesteps are denoted using the subscript t. Note that the symbol t is reserved exclusively for indexing these update nodes, distinguishing them from regular data nodes indexed by i.

When an update is triggered at update node t, the system performs the following steps. First, a 3D volumetric “footprint” of the recently traversed path is reconstructed from the robot’s odometry data, representing the space swept by the robot’s chassis according to its known physical dimensions. To ensure a high-fidelity projection free from distortion, an appropriate historical viewpoint is selected from the trajectory memory by evaluating the proportion of valid projection points. Specifically, a historical viewpoint is deemed acceptable if the ratio of footprint points that successfully project into the image frame exceeds a predefined threshold

ρ_{proj}

, ensuring sufficient visibility and geometric validity. Finally, the 3D footprint volume is projected onto the selected historical image using its corresponding camera parameters (K,

T_{i}

). This renders a dense, binary supervision mask,

M_{footprint}

, where pixels covered by the footprint are labeled as traversable (1), providing the direct ground-truth signal for prototype adaptation.

3.5.2. Prototype Update with Exponential Moving Average

The newly generated supervision pair, consisting of the historical patch-level feature map

F_{t} \in R^{h \times w \times D}

and the pixel-level footprint mask

M_{footprint} \in {0, 1}^{H \times W}

, is used to update the global traversability prototype from

p_{t - 1}

to

p_{t}

at each update step indexed by t. To align the pixel-level mask with the patch-level features, we downsample

M_{footprint}

to obtain a patch-level mask

M_{patch} \in {0, 1}^{h \times w}

, where each patch is labeled positive if the majority of its pixels are traversable.

Specifically, a feature vector representing the newly observed traversable terrain,

f_{new} \in R^{D}

, is extracted from

F_{t}

by performing masked average pooling over the features corresponding to the positive regions of the patch-level mask:

f_{new} = \frac{1}{| M_{patch}^{+} |} \sum_{(i, j) \in M_{patch}^{+}} F_{t} (i, j)

(7)

where

M_{patch}^{+}

denotes the set of patch coordinates where

M_{patch} = 1

, and

| M_{patch}^{+} |

is the cardinality of this set.

The naive approach would directly replace the old prototype

p_{t - 1}

with the new evidence vector

f_{new}

. This would, however, cause abrupt changes, leading to unstable and oscillatory adaptation. To ensure a smooth and stable learning process, the prototype is updated using the EMA as follows:

p_{t} = α p_{t - 1} + (1 - α) f_{new}

(8)

where

α \in [0, 1]

is a momentum coefficient that controls the adaptation rate. The complete online adaptation pipeline, from automatic annotation to EMA-based prototype update, is depicted in the bottom panel of Figure 3.

4. Experiments

4.1. Dataset

4.1.1. Reference-Guided Traversability Segmentation Dataset (RTSD)

To evaluate our reference-guided traversability segmentation framework, we introduce the RTSD. This dataset is designed for reference-guided analysis, with annotations designating traversability relative to a single reference terrain class per RGB image, incorporating automated footprint supervision for one-shot adaptation. RTSD contains 1225 RGB images (640 × 480) from 10 diverse outdoor scenes, encompassing five terrain types: asphalt, brick, concrete, dirt, and grass.

Data was collected using a ROS2-based mobile robot equipped with an Intel RealSense D435 camera (Intel Corporation, Santa Clara, CA, USA) (Figure 4a). We employ an automated footprint annotation generation algorithm (Section 3.5.1) to create pixel-level masks from the robot’s historical trajectory data, avoiding manual labeling. This process reconstructs the robot’s geometric footprint from trajectory data and projects it onto the camera’s viewpoint to generate a binary traversability signal, which is refined into the final masks. Each image is accompanied by camera intrinsics, reference terrain images, and pixel-level masks categorizing terrain as traversable, traversable unknown, or unreachable (Figure 4b).

Unlike existing datasets like RUGD [24] or ORFD [25], RTSD is specifically tailored to our framework. RUGD lacks the reference samples necessary for our one-shot learning method, while ORFD lacks the trajectory-based ground truth essential for training-free adaptation. RTSD’s trajectory-based ground truth generation provides a scalable and physically-grounded approach, making it an ideal resource for benchmarking reference-guided navigation systems.

4.1.2. Off-Road Freespace Detection (ORFD)

To validate our proposed method in challenging off-road scenarios, we conduct experiments on the ORFD dataset. The ORFD [25] is a public benchmark specifically designed for off-road freespace detection. It comprises 12,198 synchronized LiDAR point cloud and RGB image pairs gathered from diverse and unstructured environments, including woodland, farmland, grassland, and countryside.

The data collection spans various weather conditions (sunny, rainy, foggy, snowy) and lighting conditions (daylight, twilight, darkness) to ensure environmental diversity. For the freespace detection task, each image is accompanied by pixel-wise annotations that categorize the scene into traversable, non-traversable, and unreachable areas. We follow the standard data split provided by the authors for training, validation, and testing.

4.2. Implementation Details

To ensure the reproducibility of our results, we provide comprehensive implementation details of the TARTS framework. Our system is implemented in Python 3.10 using the PyTorch library. We employ a ViT-S/16 pre-trained with the DINOv3 framework [12] as our semantic feature extraction backbone, which operates as a fixed feature extractor without any fine-tuning. For each input image with a resolution of

640 \times 480

, the ViT encoder produces a dense feature map with a feature dimensionality of 368, corresponding to the ViT-S architecture.

In the real-time segmentation inference stage, we partition each input image into

K = 400

superpixels using the Fast SLIC algorithm (https://github.com/Algy/fast-slic) accessed on 9 July 2025 with a compactness parameter of 30, which provides an effective balance between spatial coherence and computational efficiency for feature aggregation. For the online prototype adaptation loop, the trajectory memory is implemented as a FIFO queue with a maximum length of

L = 30

nodes, ensuring a memory buffer of approximately 3 m of travel history, as nodes are cached at

0.1

-m spatial intervals. The prototype update is triggered when the robot travels an accumulated distance of

d_{extract} = 0.5

m. During footprint projection, a historical viewpoint is selected if the ratio of valid projection points exceeds

ρ_{proj} = 0.95

, ensuring high-fidelity supervision mask generation with minimal geometric distortion. The momentum coefficient for the EMA update was empirically set to

α = 0.9

, which provides a balance between stable learning and rapid adaptation to new terrain appearances.

For experiments on the ORFD dataset, which lacks footprint trajectory annotations, we adopt an alternative supervision strategy. Specifically, we randomly sample 10 to 30 prototype features from the ground-truth traversable regions in the segmentation mask labels. The prototype update is performed every 3 frames to predict subsequent frames, enabling continuous adaptation without trajectory-based supervision. Dataset evaluation experiments were conducted on a workstation equipped with an NVIDIA RTX 4070 Ti SUPER GPU. Real-time computational performance benchmarking was performed on an NVIDIA Jetson Orin NX 16GB embedded platform to assess deployment feasibility on resource-constrained mobile robotic systems.

4.3. Quantitative Results

4.3.1. Comparison to Baseline and SOTA

The quantitative evaluation results are presented in Table 1 and Table 2.

Table 1 compares our full TARTS (with online adaptation) method with its baseline variant, TARTS- (without online adaptation). The results show that the full TARTS model consistently surpasses the baseline across both scenarios. On RTSD, TARTS achieves improvements of 1.5%, 0.7%, and 1.2% in Precision, F-score, and IoU. This enhanced precision is crucial for navigation, signifying more reliable terrain predictions and ensuring robot safety. The strong performance on the entire ORFD-All dataset, which encompasses a wide variety of challenging off-road conditions, further underscores the method’s robustness.

Table 2 presents a comparative analysis against state-of-the-art (SOTA) methods on the ORFD-Test benchmark. It is noteworthy that while all SOTA methods require a conventional training phase on a dedicated training set, TARTS operates in a training-free manner, utilizing a single reference terrain image from the test set. Despite this, our method outperforms all listed competitors, including the recent RoadFormer [6], achieving a 1.3% increase in Precision, a 0.9% increase in F-score, and a 1.6% increase in IoU over the strongest baseline. This superior performance is achieved using only RGB data, whereas many competing methods depend on multi-modal inputs like sparse depth or surface normals. This result highlights the efficiency and effectiveness of our approach, demonstrating its broad applicability in scenarios limited to monocular camera data.

4.3.2. Threshold Selection Strategy Ablation

Table 3 demonstrates the critical impact of the threshold strategy. We evaluated four approaches: median-based, mean-based, standard Otsu, and valley-emphasis Otsu. The median-based strategy, while achieving high recall, suffers from severely compromised precision (∼72%) by systematically over-classifying regions. The mean-based strategy offers substantial improvement (91.0% IoU). Notably, the standard Otsu method yields inconsistent performance: it underperforms the mean-based approach on TARTS- (90.2% IoU) but is highly effective on TARTS (94.2% IoU). In contrast, the valley-emphasis Otsu method consistently delivers the most robust and superior performance across both configurations (93.3% IoU for TARTS- and 94.5% IoU for TARTS). Its principle of emphasizing natural distribution boundaries proves more stable than standard Otsu, validating it as the optimal choice.

4.3.3. Feature-Superpixel Alignment Strategy Ablation

To validate our patch-level feature alignment strategy, we conduct an ablation study comparing it against the common bilinear interpolation approach [5]. We evaluate both alignment strategies on the RTSD and ORFD-All datasets under two configurations: with online prototype adaptation (TARTS) and without (TARTS-), to isolate the impact of the alignment choice.

The results in Table 4 demonstrate that our patch-level alignment strategy consistently outperforms bilinear interpolation across both datasets and adaptation configurations. This performance advantage is attributed to the characteristics of DINO features, which lack fine-grained spatial consistency at the sub-patch level [12,22]. Bilinear interpolation synthetically blends these features, introducing sub-patch spatial artifacts and diluting discriminative semantic information. In contrast, our proposed strategy preserves the integrity of the original features by operating at the native patch-level granularity, thereby maintaining the semantic fidelity of DINO features and avoiding interpolation-induced degradation. This design choice results in consistently higher precision and more balanced precision–recall characteristics across diverse scenarios, validating the effectiveness of our alignment strategy.

4.4. Qualitative Results

Figure 5 presents a qualitative comparison between TARTS (with online adaptation) and TARTS- (without) on the ORFD and RTSD datasets. The color-coded visualizations (explained in the caption) highlight the performance differences in challenging scenarios.

TARTS consistently demonstrates enhanced boundary precision and more accurate, conservative traversable region estimation. Compared to the baseline, TARTS exhibits substantially reduced false positives (red regions), particularly in vegetated areas and along transition zones like road edges. This improvement is primarily attributed to the online-adapted prototype in TARTS, which more effectively discriminates between traversable surfaces and surrounding non-traversable obstacles. Across all scenarios, TARTS achieves superior precision, demonstrating that trajectory-guided online adaptation effectively refines the traversability prototype. These qualitative results provide compelling visual evidence corroborating the quantitative improvements presented in Table 1.

4.5. Computational Performance

We evaluate the computational efficiency of TARTS on an NVIDIA Jetson Orin NX platform. The DINOv3 ViT-S/16 backbone is accelerated using TensorRT with INT8/FP16 mixed-precision. The inference pipeline is decomposed into three stages: (1) parallel feature extraction and superpixel segmentation, (2) superpixel-level feature matching, and (3) mask refinement. Note that the online prototype adaptation module is implemented as a separate ROS node and operates asynchronously with negligible computational overhead, thus not affecting the segmentation inference latency reported below.

As shown in Table 5, TARTS achieves end-to-end latencies between 41.52 ms and 57.53 ms (17.4 to 24.1 FPS) across resolutions from

288 \times 288

to

480 \times 480

. These results demonstrate real-time performance, exceeding the typical 10 FPS requirement for mobile robot navigation. Stage 1 (feature extraction/segmentation) constitutes the main computational bottleneck, while Stages 2 and 3 remain efficient across resolutions. Overall, TARTS demonstrates efficient performance suitable for resource-constrained mobile platforms. Notably, unlike all comparison methods in Table 2, which require a dedicated training phase on large-scale annotated datasets and often depend on additional sensor modalities (e.g., surface normals or sparse depth), TARTS eliminates training overhead entirely and operates solely on RGB input, significantly reducing the total computational and data acquisition costs for deployment.

To address the impact of the trajectory memory length L on system overhead, we analyze the memory and computational scaling of the footprint projection module. The memory overhead of the FIFO queue scales strictly linearly as

O (L \cdot h \cdot w \cdot D)

. Because TARTS intelligently caches the low-resolution semantic feature maps rather than the high-resolution RGB frames, the memory footprint remains highly optimized. For the ViT-S/16 backbone operating on a

640 \times 480

input, a single cached feature map requires approximately 1.84 MB of memory. A standard trajectory length of

L = 30

consumes roughly 55 MB of RAM, which leaves a negligible footprint on resource-constrained embedded platforms.

Computationally, the retrospective projection involves evaluating the projection validity ratio

ρ_{p r o j}

across historical nodes to locate an optimal, low-distortion viewpoint. While increasing L linearly expands this search space, the computational complexity of the actual projection geometry

O (| V_{f o o t p r i n t} |)

is bound by the constant spatial density of the robot’s 3D volumetric footprint, completely independent of L. Empirical validation on the Jetson Orin NX confirms that extending L from 30 to 150 (thereby expanding the memory horizon to a massive 15 m) increases the asynchronous online adaptation search latency by less than 4.5 ms, ensuring the main real-time inference thread remains completely unaffected.

4.6. Analysis of Typical Failure Cases and Environmental Boundaries

While TARTS demonstrates robust, state-of-the-art performance across the diverse environments of the RTSD and ORFD datasets, an exhaustive analysis of erroneous predictions reveals specific environmental boundary conditions where the framework consistently underperforms. Identifying these failure modes provides critical insight into the inherent limitations of relying solely on monocular semantic foundation features without orthogonal geometric validation. We categorize these failure cases into three distinct taxonomies:

4.6.1. Reflective and Dynamic Surfaces (e.g., Water, Sheet Ice)

The framework exhibits a systemic reduction in precision when encountering large bodies of water or severe environmental pooling. Because the DINO encoder relies heavily on visual context, specular reflections of the sky, overhanging trees, or infrastructure dominate the patch-level feature extraction. Consequently, the semantic representation shifts away from the target “ground” manifold and toward the reflected objects, causing the cosine similarity score

c_{k}

to drop precipitously and generating broad false negatives. Furthermore, the dynamic textures introduced by ripples disrupt the gradient-based spatial clustering of the SLIC algorithm, leading to fragmented superpixels that poorly align with the true physical boundaries of the water.

4.6.2. Semantic–Geometric Ambiguity

TARTS relies fundamentally on semantic differentiation. If a physical obstacle has the exact same material composition, color, and texture as the traversable ground—such as a steep, unpainted concrete curb merging seamlessly with a concrete sidewalk, or an asphalt road meeting an identical asphalt retaining wall—the DINO features may cluster them into an identical semantic manifold. Because the current iteration of TARTS omits multi-modal depth or surface normal inputs, these purely geometric obstacles lack the semantic variance required for segmentation, resulting in highly localized false positives where the robot attempts to traverse vertical planes.

4.6.3. Extreme Photometric Degradation

In scenarios of severe sensor overexposure (e.g., direct lens glare) or extreme under-exposure (e.g., unlit, deep urban shadows at night), the high-frequency visual information required by the ViT encoder is fundamentally destroyed. Under these conditions, the DINO features collapse into uniform, uninformative latent vectors. The adaptive valley-emphasis Otsu thresholding algorithm subsequently fails to identify a meaningful bimodal separation point in the similarity histogram, occasionally leading to broad classification artifacts where shadowed walls are indistinguishable from shadowed roads.

5. Discussion

Building upon the experimental findings in Section 4, which demonstrate that TARTS achieves competitive or superior performance compared to fully supervised methods while maintaining real-time efficiency, we discuss the broader design insights, limitations, and potential extensions of our framework.

The architectural design of TARTS is fundamentally motivated by decoupling semantic recognition from fine-grained spatial localization—a design choice that addresses inherent limitations in pre-trained vision foundation models while maintaining computational efficiency. As detailed in Section 3.1, our complementary two-stage processing leverages SLIC for perceptual grouping based on low-level visual cues while preserving semantic integrity through patch-level feature alignment.

This design circumvents the spatial inconsistencies inherent in DINO features (Section 3.2) and avoids interpolation-induced degradation. A key finding from our ablation studies (Section 4.3.3, Table 4) is that patch-level alignment consistently outperforms bilinear interpolation across diverse datasets and adaptation configurations, validating the effectiveness of preserving native patch-level semantic integrity. Meanwhile, SLIC’s parallel execution ensures real-time computational efficiency.

A potential limitation of our framework is the reliance on a single reference image for prototype initialization, which may not fully capture the visual diversity of traversable terrain under varying illumination, weather, or seasonal conditions. However, the online adaptation mechanism (Section 3.5) is specifically designed to mitigate this limitation: the EMA-based prototype update continuously refines the traversability representation using the robot’s own traversal experience, progressively incorporating terrain appearances that differ from the initial reference. Consequently, even if the reference image is not fully representative of the entire traversable region, the prototype converges toward a more comprehensive representation as the robot navigates. In scenarios with drastic appearance changes (e.g., transitioning from sunlit to shadowed areas), the system may exhibit transient performance degradation until sufficient footprint supervision is accumulated. While TARTS is developed and validated on ground mobile robots, the reference-guided paradigm and online adaptation mechanism are generalizable to other autonomous platforms. For unmanned aerial vehicles (UAVs), the framework could be adapted for tasks such as landing zone detection or terrain-aware flight planning, where a reference image of safe landing surfaces initializes the prototype and onboard odometry provides self-supervision signals [30]. More broadly, the principle of combining one-shot initialization with embodied self-supervision is applicable to autonomous systems operating in diverse environments [31], provided that the platform can generate reliable trajectory data for footprint projection. Adapting the footprint projection module to account for different platform kinematics and sensor configurations represents a promising direction for future work.

Beyond terrain traversability, this reference-guided paradigm naturally extends to broader one-shot segmentation tasks involving scene-level or material-level targets, such as material localization in industrial or agricultural settings (e.g., segmenting all corn from a scene given a single reference image, as illustrated in Figure 6). For navigation-centric applications, our adaptive valley-emphasis Otsu thresholding proves robust across diverse terrain conditions. While “target-absent” scenarios may theoretically pose challenges for dynamic thresholding in general-purpose object localization tasks, such situations are inherently rare in mobile robot navigation contexts, where traversable terrain is continuously present in the robot’s field of view. Future extensions to non-navigation domains could explore complementary thresholding strategies or confidence-based filtering to address this edge case.

Furthermore, a critical theoretical assumption underlying the automated footprint supervision module is the static nature of the environment between the moment of visual observation and the moment of physical traversal. In highly dynamic environments, such as off-road scenes with rapidly moving vegetation (e.g., tall grass swaying in heavy wind) or shifting dynamic obstacles, this static-world assumption is severely violated. If a dynamic object occupies the projected footprint space during the historical camera capture but moves before the robot physically traverses that exact coordinate, the retrospective projection will erroneously map positive traversability labels onto the dynamic obstacle. This introduces corrupted semantic features

f_{n e w}

into the EMA update loop, potentially degrading the global prototype

p_{t}

over time. Mitigating this limitation in future iterations will require integrating motion priors or real-time optical flow estimation to explicitly mask out highly dynamic pixels prior to footprint projection, ensuring only static, definitively traversed terrain is utilized for self-supervision.

6. Conclusions

In this paper, we presented TARTS, a novel training-free framework for robotic traversability segmentation. Our core contribution is a “plug-and-play” pipeline that initializes from a single reference image and adapts online using automated, trajectory-guided self-supervision. By retrospectively projecting the robot’s physical footprint, the system continuously generates its own supervision signals, allowing a DINO-feature-based prototype to adapt stably via an EMA. This embodied-interaction approach removes the need for manual annotation and offline training. Our experiments, conducted on our new RTSD dataset and the public ORFD benchmark, confirm our method’s efficacy. Notably, our training-free approach outperforms state-of-the-art, fully-supervised methods and achieves real-time performance on an embedded Jetson platform. Meanwhile, as the performance of embedded Jetson platforms and space-grade computing platforms continues to advance, and with their ground-based performance already validated, TARTS is poised to demonstrate substantial development potential in deep space exploration and scenarios involving interference with low-Earth-orbit satellites, enabled by complex semantic recognition and the decoupling of high-precision spatial positioning. Future work could explore multi-class prototype management and the integration of negative supervision signals to further enhance robustness.

Author Contributions

Conceptualization, S.S. and L.Z.; methodology, S.S. and L.Z.; software, S.S.; validation, S.S.; formal analysis, S.S.; investigation, S.S.; resources, L.Z.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, L.Z.; visualization, S.S.; supervision, L.Z.; project administration, L.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Research on Intelligent Situation Assessment Method for Multi-Source Navigation and Positioning Based on Deep Learning, grant number E5E2140201.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We are grateful to the referees for their constructive suggestions to improve the manuscript. This work was supported by the State Key Laboratory of Satellite Navigation System and Equipment Technology with Grant No. CEPNT2023B10.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TARTS	Training-free Adaptive Reference-guided Traversability Segmentation
ViT	Vision Transformer
EMA	Exponential Moving Average
ORFD	Off-Road Freespace Detection
RTSD	Reference-guided Traversability Segmentation Dataset

References

Kim, Y.; Lee, J.H.; Lee, C.; Mun, J.; Youm, D.; Park, J.; Hwangbo, J. Learning Semantic Traversability with Egocentric Video and Automated Annotation Strategy. IEEE Robot. Autom. Lett. 2024, 9, 10423–10430. [Google Scholar] [CrossRef]
Fankhauser, P.; Hutter, M. A universal grid map library: Implementation and use case for rough terrain navigation. In Robot Operating System (ROS): The Complete Reference (Volume 1); Springer: Berlin/Heidelberg, Germany, 2016; pp. 99–120. [Google Scholar]
Papadakis, P. Terrain traversability analysis methods for unmanned ground vehicles: A survey. Eng. Appl. Artif. Intell. 2013, 26, 1373–1385. [Google Scholar] [CrossRef]
Ægidius, S.; Hadjivelichkov, D.; Jiao, J.; Embley-Riches, J.; Kanoulas, D. Watch your stepp: Semantic traversability estimation using pose projected features. In 2025 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2025; pp. 2376–2382. [Google Scholar]
Mattamala, M.; Frey, J.; Libera, P.; Chebrolu, N.; Martius, G.; Cadena, C.; Hutter, M.; Fallon, M. Wild visual navigation: Fast traversability learning via pre-trained models and online self-supervision. Auton. Robot. 2025, 49, 19. [Google Scholar] [CrossRef]
Li, J.; Zhang, Y.; Yun, P.; Zhou, G.; Chen, Q.; Fan, R. RoadFormer: Duplex transformer for RGB-normal semantic road scene parsing. IEEE Trans. Intell. Veh. 2024, 9, 5163–5172. [Google Scholar] [CrossRef]
Liao, Y.; Kang, S.; Li, J.; Liu, Y.; Liu, Y.; Dong, Z.; Yang, B.; Chen, X. Mobile-seed: Joint semantic segmentation and boundary detection for mobile robots. IEEE Robot. Autom. Lett. 2024, 9, 3902–3909. [Google Scholar] [CrossRef]
Guan, T.; Kothandaraman, D.; Chandra, R.; Sathyamoorthy, A.J.; Weerakoon, K.; Manocha, D. Ga-nav: Efficient terrain segmentation for robot navigation in unstructured outdoor environments. IEEE Robot. Autom. Lett. 2022, 7, 8138–8145. [Google Scholar] [CrossRef]
Zhang, Y.; Yin, M.; Bi, W.; Yan, H.; Bian, S.; Zhang, C.H.; Hua, C. ZISVFM: Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models. IEEE Trans. Robot. 2025, 41, 1568–1580. [Google Scholar] [CrossRef]
Wellhausen, L.; Dosovitskiy, A.; Ranftl, R.; Walas, K.; Cadena, C.; Hutter, M. Where should i walk? predicting terrain properties from images via self-supervised learning. IEEE Robot. Autom. Lett. 2019, 4, 1509–1516. [Google Scholar] [CrossRef]
Gasparino, M.V.; Sivakumar, A.N.; Chowdhary, G. Wayfaster: A self-supervised traversability prediction for increased navigation awareness. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 8486–8492. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [PubMed]
Ng, H.F. Automatic thresholding for defect detection. Pattern Recognit. Lett. 2006, 27, 1644–1649. [Google Scholar] [CrossRef]
Ewen, P.; Li, A.; Chen, Y.; Hong, S.; Vasudevan, R. These maps are made for walking: Real-time terrain property estimation for mobile robots. IEEE Robot. Autom. Lett. 2022, 7, 7083–7090. [Google Scholar] [CrossRef]
Jung, S.; Lee, J.; Meng, X.; Boots, B.; Lambert, A. V-strong: Visual self-supervised traversability learning for off-road navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 1766–1773. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Zürn, J.; Burgard, W.; Valada, A. Self-supervised visual terrain classification from unsupervised acoustic feature learning. IEEE Trans. Robot. 2020, 37, 466–481. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 4015–4026. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9650–9660. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Pariza, V.; Salehi, M.; Burghouts, G.; Locatello, F.; Asano, Y.M. NeCo: Improving DINOv2’s spatial representations in 19 GPU hours with Patch Neighbor Consistency. arXiv 2024, arXiv:2408.11054v1. [Google Scholar]
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
Wigness, M.; Eum, S.; Rogers, J.G.; Han, D.; Kwon, H. A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2019; pp. 5000–5007. [Google Scholar]
Min, C.; Jiang, W.; Zhao, D.; Xu, J.; Xiao, L.; Nie, Y.; Dai, B. Orfd: A dataset and benchmark for off-road freespace detection. In 2022 International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2022; pp. 2532–2538. [Google Scholar]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 213–228. [Google Scholar]
Fan, R.; Wang, H.; Cai, P.; Liu, M. Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 340–356. [Google Scholar]
Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2017; pp. 5108–5115. [Google Scholar]
Jin, Z.; Li, H.; Qin, Z.; Wang, Z. Gradient-free cooperative source-seeking of quadrotor under disturbances and communication constraints. IEEE Trans. Ind. Electron. 2024, 72, 1969–1979. [Google Scholar] [CrossRef]
Jin, Z. Global asymptotic stability analysis for autonomous optimization. IEEE Trans. Autom. Control 2025, 70, 6953–6960. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the TARTS method.

Figure 2. The proposed traversability segmentation inference pipeline. All notations (

F_{i}

,

p_{t - 1}

,

v_{k}

,

c_{k}

, etc.) are formally defined in Section 3.2, Section 3.3, Section 3.4 and Section 3.5.

Figure 2. The proposed traversability segmentation inference pipeline. All notations (

F_{i}

,

p_{t - 1}

,

v_{k}

,

c_{k}

, etc.) are formally defined in Section 3.2, Section 3.3, Section 3.4 and Section 3.5.

Figure 3. Online prototype adaptation through trajectory-guided self-supervision.

Figure 4. RTSD dataset collection and annotation. (a) The data collection platform. (b) Representative dataset samples across several terrain classes.

Figure 5. Qualitative comparison between TARTS and TARTS- on ORFD and RTSD datasets. Each row shows three columns: TARTS (left), TARTS- (middle), and RGB input (right). Green indicates correct predictions, red shows false positives, and yellow represents false negatives. Rows 1–2: ORFD dataset. Rows 3–4: RTSD dataset.

Figure 6. Example of TARTS applied to material localization in agricultural settings, demonstrating the framework’s extensibility beyond terrain traversability to scene-level segmentation tasks.

Table 1. Performance comparison of TARTS and its baseline variant (TARTS-) on RTSD and ORFD-All [25] datasets.

Method	RTSD				ORFD-All
Method	P	R	F	IoU	P	R	F	IoU
TARTS-	93.6	99.7	96.5	93.3	93.5	96.2	94.8	90.2
TARTS	95.1	99.3	97.2	94.5	94.4	97.0	95.7	91.7

Table 2. Comparison with state-of-the-art methods on ORFD-Test [25] dataset. All comparison methods require training on the training set, while our TARTS method operates training-free with only a reference terrain image.

Method	Modality	ORFD-Test
Method	Modality	P	R	F	IoU
FuseNet [26]	RGB + Sparse Depth	74.5	85.2	79.5	66.0
SNE-RoadSeg [27]	RGB + Surface Normal	86.7	92.7	89.6	81.2
OFF-Net [25]	RGB + Surface Normal	86.6	94.3	90.3	82.3
RTFNet [28]	RGB + Surface Normal	93.8	96.5	95.1	90.7
MFNet [29]	RGB + Surface Normal	89.6	90.3	89.9	81.7
RoadFormer [6]	RGB + Surface Normal	95.1	97.2	96.1	92.5
TARTS	RGB	96.4	97.5	97.0	94.1

Table 3. Ablation study on adaptive thresholding strategies using RTSD dataset.

Method	Threshold Strategies	RTSD
Method	Threshold Strategies	P	R	F	IoU
TARTS-	Median	72.2	99.9	82.9	72.2
TARTS	Median	72.1	99.9	83.5	72.0
TARTS-	Mean	91.0	99.9	95.3	91.0
TARTS	Mean	91.3	99.7	95.1	91.0
TARTS-	Otsu-standard	91.2	98.8	94.9	90.2
TARTS	Otsu-standard	95.0	99.1	97.0	94.2
TARTS-	Otsu-valley-emphasis	93.6	99.7	96.5	93.3
TARTS	Otsu-valley-emphasis	95.1	99.3	97.2	94.5

Table 4. Ablation study comparing feature–superpixel alignment strategies.

Method	Alignment Strategy	RTSD
Method	Alignment Strategy	P	R	F	IoU
TARTS-	Bilinear Interpolation	91.0	98.9	94.8	90.1
TARTS-	Patch-level Alignment	93.6	99.7	96.5	93.3
TARTS	Bilinear Interpolation	94.7	99.2	97.0	94.1
TARTS	Patch-level Alignment	95.1	99.3	97.2	94.5
		ORFD-All
		P	R	F	IoU
TARTS-	Bilinear Interpolation	92.3	96.9	94.5	89.6
TARTS-	Patch-level Alignment	93.5	96.2	94.9	90.2
TARTS	Bilinear Interpolation	93.0	97.7	95.3	91.0
TARTS	Patch-level Alignment	94.4	97.0	95.7	91.7

Table 5. Per-stage latency breakdown (ms) of TARTS pipeline on Jetson Orin NX across varying input resolutions. Values represent mean latencies over 100 runs.

Resolution	Stage 1	Stage 2	Stage 3	Total (FPS)
$288 \times 288$	25.16	6.15	8.22	41.52 (24.1)
$320 \times 320$	30.23	6.73	7.62	44.57 (22.4)
$352 \times 352$	28.60	7.73	8.50	44.83 (22.3)
$384 \times 384$	33.30	7.58	9.10	49.97 (20.0)
$416 \times 416$	36.42	7.85	8.87	53.14 (18.8)
$448 \times 448$	39.88	8.10	9.01	57.00 (17.5)
$480 \times 480$	40.17	8.40	8.95	57.53 (17.4)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, S.; Zeng, L. TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification. Electronics 2026, 15, 1194. https://doi.org/10.3390/electronics15061194

AMA Style

Shi S, Zeng L. TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification. Electronics. 2026; 15(6):1194. https://doi.org/10.3390/electronics15061194

Chicago/Turabian Style

Shi, Shuhong, and Lingchuan Zeng. 2026. "TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification" Electronics 15, no. 6: 1194. https://doi.org/10.3390/electronics15061194

APA Style

Shi, S., & Zeng, L. (2026). TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification. Electronics, 15(6), 1194. https://doi.org/10.3390/electronics15061194

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TARTS: Training-Free Adaptive Reference-Guided Traversability Segmentation with Automated Footprint Supervision and Experimental Verification

Abstract

1. Introduction

2. Related Work

2.1. Semantic Traversability Analysis

2.2. Traversability from Self-Supervision

3. Methodology

3.1. System Overview

3.2. Semantic Feature Extraction via Vision Foundation Model

3.3. Reference-Guided One-Shot Traversability Prototype Seeding

3.4. Segmentation Inference

3.4.1. Superpixel-Based Feature Aggregation

3.4.2. Similarity Matching and Adaptive Thresholding

3.5. Online Prototype Adaptation via Trajectory-Guided Self-Supervision

3.5.1. Self-Supervision via Retrospective Footprint Projection

3.5.2. Prototype Update with Exponential Moving Average

4. Experiments

4.1. Dataset

4.1.1. Reference-Guided Traversability Segmentation Dataset (RTSD)

4.1.2. Off-Road Freespace Detection (ORFD)

4.2. Implementation Details

4.3. Quantitative Results

4.3.1. Comparison to Baseline and SOTA

4.3.2. Threshold Selection Strategy Ablation

4.3.3. Feature-Superpixel Alignment Strategy Ablation

4.4. Qualitative Results

4.5. Computational Performance

4.6. Analysis of Typical Failure Cases and Environmental Boundaries

4.6.1. Reflective and Dynamic Surfaces (e.g., Water, Sheet Ice)

4.6.2. Semantic–Geometric Ambiguity

4.6.3. Extreme Photometric Degradation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI