A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes

Salas-Espinales, Andrés; Vázquez-Martín, Ricardo; Mandow, Anthony

doi:10.3390/modelling7020050

Open AccessArticle

A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes

by

Andrés Salas-Espinales

^1,2,*

,

Ricardo Vázquez-Martín

²

and

Anthony Mandow

^2,*

¹

Departamento de Eléctrica, Mecatrónica, y Electrónica y Automatización, Facultad de Ingeniería y Ciencias Aplicadas, Universidad Técnica de Manabí, Av. Urbina y Che Guevara, Portoviejo 130105, Manabí, Ecuador

²

Institute for Mechatronics Engineering & Cyber-Physical Systems, Universidad de Málaga, C/Doctor Ortiz Ramos s/n, 29071 Málaga, Andalucia, Spain

^*

Authors to whom correspondence should be addressed.

Modelling 2026, 7(2), 50; https://doi.org/10.3390/modelling7020050

Submission received: 5 January 2026 / Revised: 27 February 2026 / Accepted: 28 February 2026 / Published: 4 March 2026

(This article belongs to the Section Modelling in Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

High-quality RGB–thermal infrared (RGB-T) semantic segmentation datasets are crucial for search-and-rescue (SAR) applications, yet their development is hindered by the scarcity of annotated ground-truth and by the challenges of thermal-camera calibration, which typically depends on heated targets with limited geometric definition. Recent approaches focus on using semantic segmentation annotation tools and transferring RGB masks to multi-spectral data, but they do not fully address the need for robust cross-modal geometric validation, quality control, or human-in-the-loop reliability assessment in RGB-T segmentation. To fill this gap, we propose a validated cross-modal annotation pipeline that combines deep correspondence matching, geometric transformation (affine or homography) of RGB-T pairs, and quantitative alignment validation. Our RGB-T pipeline integrates a semi-automatic annotation pipeline based on the Segment Anything Model 2 (SAM2) in Label Studio, with guided human refinement, and incorporates quantitative cost and quality control via inter-annotator agreement before being used in downstream model training. Results across three annotators show that the proposed approach reduces annotation time by 36% while achieving high annotation quality (mean IoU = 74.9%) and strong inter-annotator agreement (mean pixel accuracy = 74.3%, Cohen’s

κ

= 65%). The proposed RGB-T pipeline was annotated on a SAR-oriented RGB-T dataset comprising 306 image pairs and trained on two SOTA RGB-T. These findings demonstrate the practical value of the proposed methodology and establish a reproducible framework for generating reliable RGB-T semantic segmentation datasets, complementing and extending recent multispectral auto-labeling approaches.

Keywords:

RGB–thermal infrared segmentation; semantic segmentation dataset; semi-automatic annotation; search-and-rescue

1. Introduction

Rescue robots must navigate unfamiliar environments with low visibility, degraded sensing conditions, and moving targets. In these situations, reliable object segmentation is crucial for tasks such as simultaneous localization and mapping (SLAM) [1], modeling disaster sites [2], detecting people in need of assistance [3] or missing people [4], and maintaining situational awareness during catastrophic events [5]. To perform these tasks effectively, however, segmentation algorithms rely on high-quality annotated datasets. Furthermore, to improve robustness under these adverse perception conditions, the combination of RGB and thermal infrared (TIR) imagery is becoming increasingly relevant [6,7,8]. Nevertheless, cross-modal RGB and thermal (RGB–T) settings pose important challenges for establishing reliable correspondences due to significant appearance differences between modalities, including spectral response differences [9,10], contrast inversion, and missing texture patterns [11].

Manually creating semantic segmentation annotations is time-consuming and labor-intensive, and the resulting labels are often inconsistent due to human error. Moreover, the procedures for generating these datasets are often insufficiently described, limiting reproducibility and systematic assessment of annotation quality; while a few SAR datasets provide RGB–T imagery, they are generally limited, highlighting the scarcity of multimodal semantic labels in this domain [6,7,8,12,13].

In particular, RGB–TIR datasets pose a fundamental technical challenge: achieving reliable cross-modal alignment. While standard planar calibration patterns are effective for RGB cameras, calibration in the thermal domain is often problematic due to limitations of heated patterns, such as uneven temperature distribution, low geometric definition, and thermal diffusion [14], which, in turn, complicates the establishment of accurate point correspondences between modalities.

Recently, there has been increasing attention within the community towards developing tools and workflows that reduce the time and cost of annotation, positioning human intervention primarily in a supervisory role [15,16,17,18,19,20]. In particular, promptable models such as SAM2 (Segment Anything Model 2) [21] provide a flexible mechanism for generating initial segmentation masks across a wide range of image scenes, enabling semi-automatic workflows in which human annotators review and refine automatically generated masks. Unlike video object segmentation (VOS) models [22], SAM2 employs a unified architecture for both image- and video-based segmentation tasks. Its transmission memory mechanism supports zero-shot segmentation, reducing annotation time and requiring less user interaction than previous methods, such as Open-Vocabulary models [23]. Related approaches to our proposal, such as the Multispectral Automated Transfer Technique (MATT) [24], use SAM-generated masks on RGB images and transfer them to other modalities, including multispectral or thermal imagery. Although these methods reduce annotation time, they often depend on well-aligned image pairs and do not incorporate systematic cross-modal alignment, annotation quality metrics, or human-in-the-loop reliability assessment, leaving a gap for more general, reproducible pipelines.

Inter-annotation agreement (IAA) is a critical metric that quantifies the degree of consistency with which multiple human annotators label the same data, statistically accounting for agreement by chance. Originally established in fields such as psychology and medicine, IAA metrics have been adopted in computer vision, including medical [25], agricultural [26], and transportation [27], to evaluate the quality of semantic segmentation ground-truth. In the context of RGB-T semantic segmentation, widely used benchmark datasets such as MFNet and PST900 have enabled the rapid development of comparable models. However, insufficient quality control can have major implications for both the reproducibility of the findings and the interpretation of the results [28]. Inconsistency annotations can artificially inflate or suppress quantitative metrics, thereby hiding real model improvements. Consequently, similar to medical imaging applications [29], semantic segmentation annotations in RGB-T datasets should be regarded as accurate and reproducible as possible.

Thus, to address these limitations, we propose a semi-automatic annotation methodology that combines cross-modal alignment, explicit quality checks (QC), human efficiency and reliability evaluation, and downstream validation. The main contributions of this article are:

We propose an RGB-T Pipeline for dataset semantic segmentation that integrates a cross-modal geometric validation method implemented for the RGB and TIR modalities.
We benchmark the proposed RGB-T pipeline against two annotation strategies: a fully manual polygon-based baseline and an RGB-SAM2 baseline built around SAM2 as the ML backend. Both methodologies are evaluated by measuring inter-annotator agreement (IAA) for label Quality Checks (QC) and human annotation timing.
We applied the proposed pipeline to generate 306 annotated image pairs from a SAR-oriented RGB-T dataset [30]. The labeled data were then used to train and evaluate two state-of-the-art RGB-T semantic segmentation models. Both the data and trained models are publicly available at https://github.com/amsalase/CPGFANet/tree/main/rgb_t_pipeline_sar (accessed on 27 February 2026).

The remainder of this paper is organized as follows: Section 2 reviews related literature. Section 3 provides a comprehensive overview of the proposed pipeline, detailing its procedural steps. Section 4 presents the SAR dataset, implementation details, and evaluation metrics for analyzing quality, cost, and IAA. Section 5 presents the results on annotation cost, quality, IAA, and dataset validation. Section 6 summarizes findings and outlines some discussion points. Finally, Section 7 offers conclusions.

2. Related Work

Despite their interest, multimodal RGB-T datasets are still limited in availability. In the autonomous driving domain, the FLIR-ADAS dataset [31] includes both RGB and TIR images, along with bounding box annotations for object detection. Similarly, the MFNET dataset [32] contains RGB and TIR images captured during daytime and nighttime, accompanied by semantic segmentation labels. Additionally, the PST900 dataset [33] focuses on daily activity scenes and also contains RGB-T annotated images. In the search-and-rescue (SAR) domain, several datasets include both RGB and TIR modalities; however, most provide only bounding-box annotations [6,7,8]. In contrast, some work offers multi-class semantic segmentation annotations for SAR scenarios [12,13], but they are limited to RGB imagery, highlighting both the scarcity of multimodal semantic labels and the challenges involved in generating them.

Recent work in semantic segmentation has proposed task-specific annotation pipelines tailored to constrained scenarios, such as degraded imagery, remote sensing, or medical imaging. In [16], the authors propose a two-step training pipeline for semi-supervised semantic segmentation in degraded images captured under adverse conditions such as nighttime or fog. This approach decouples the training of labeled and unlabeled images to mitigate overfitting in extremely limited datasets, but its assumptions limit applicability to unstructured scenarios. In [17], SAM2 is combined with an Enhancement and Labeling Network (ELNet) to accelerate annotation in remote sensing segmentation tasks, with experiments limited to two-class, single-modality images. A systematic framework for semantic segmentation in satellite images is presented in [18]. This work provides a guide for efficiently and systematically exploiting artificial intelligence (AI) tools, including segmentation models, data preparation, model selection, and validation. However, this approach is tailored to a specific structure in satellite images. Finally, in [19], a fully automated end-to-end pipeline is presented to generate semantically enriched BIM-compatible 3D reconstructions from a single RGB image, jointly performing depth estimation and semantic segmentation. This framework is designed for structured environments, limiting its generalization to unstructured scenes.

Beyond task-specific pipelines, other works explore multi-step or interactive workflows for annotation. In [34], an iterative annotation scribble-based method is proposed for segmenting urban city scenes based on scribbles. This iterative process involves obtaining an initial segmentation, which is refined by at least two experienced annotators through user-provided corrections and back-propagation. In [20], a method for automatic segmentation and labeling is proposed based on a sequence of ground extraction (via RANSAC) followed by a coarse-to-fine segmentation strategy in urban scenarios. Moreover, the MATT workflow [24] uses SAM to segment aerial RGB images and transfer these annotations to co-aligned multispectral images, thereby defining an automated segmentation labeling process; while this work is the only one to propose label transfer between modalities, it assumes that the images are co-aligned and synchronized, with the cameras sharing common features such as field of view (FOV) and image resolution. On the other hand, in [35], a novel network architecture enhances building segmentation by addressing multi-scale feature fusion and foreground perception, without incorporating annotation into the pipeline.

These approaches aim to reduce manual annotation effort through semi-supervised methods or iterative refinement, demonstrating a growing interest in developing efficient and reliable annotation pipelines that are crucial for scaling semantic segmentation across diverse and complex scenarios. However, none address cross-modal alignment, inter-annotator agreement, or human-in-the-loop reliability assessment, and few studies integrate these elements into a standardized and generalizable pipeline.

3. SAM2-Driven RGB-T Annotation Pipeline

Figure 1 summarizes the proposed methodology for semantic annotation of multimodal RGB-T imagery. Figure 2 illustrates an example of the proposed methodology. The pipeline is designed to reduce manual annotation effort while preserving spatial and semantic accuracy. To this end, the methodology integrates affine and homography-based cross-modal alignment as complementary local geometric approximations—without requiring explicit thermal camera calibration—with semi-automatic semantic annotation using SAM2, followed by manual refinement supported by thermal imagery. This work contributes a reproducible workflow that combines existing technologies in a structured manner, with explicit alignment and quality control. Detailed implementation aspects are provided in Section 4.

The proposed pipeline consists of four stages:

1.

Cross-Modal Geometric Validation. To address the challenge of significant appearance differences between RGB and TIR modalities, we propose four cross-modal geometric validation steps to establish reliable correspondences between image pairs:

Intrinsic Correction. First, to prevent systematic geometric distortions from propagating into the cross-modal correspondence estimation stage, both RGB and TIR images are undistorted using their intrinsic calibration parameters and radial distortion coefficients, following standard camera calibration procedures [36]. This step compensates for lens-induced distortions and ensures geometrically consistent inputs for subsequent steps.
Cross-modal correspondence. RGB and TIR images share underlying structural correspondences that can be exploited for spatial alignment. Matched keypoints between image pairs are generated by a deep feature-based matcher (SuperGlue [37]). The use of this model, which incorporates contextual and geometric reasoning, enables reliable structural correspondences.
Geometric model estimation. Based on the matched keypoints, two independent geometric hypotheses are estimated using RANSAC: (i) an affine transformation (A), more stable under moderate parallax, and (ii) a projective homography (H), which is considered as a local geometric approximation rather than an exact alignment model. The validity of these hypotheses is done in the following step.
Consistency Evaluation and Model Selection. The qualitative criteria considered in this step are summarized in Table 1. Let N denote the number of cross-modal keypoint matches obtained with SuperGlue. A minimum correspondence condition is first enforced:

$N \geq τ_{N},$

(1)

where $τ_{N}$ is the minimum number of required matches (see Table 1). If this condition is not satisfied, the RGB-T pair is rejected. Let $ρ_{A}$ and $ρ_{H}$ denote the inlier ratios of A and H, respectively, and let $τ_{ρ}$ be the minimum acceptable inlier ratio. The affine model A is selected if

$ρ_{A} \geq ρ_{H} and ρ_{A} \geq τ_{ρ},$

(2)

indicating that A, which is simpler, is chosen without further evaluation of H. If $ρ_{H} > ρ_{A}$ , the homography model H undergoes an additional validation step. Let $ε_{H}$ denote the median reprojection error (measured in the thermal image domain), and let $τ_{ε}$ be the maximum admissible reprojection error. The homography model is selected if

$ε_{H} \leq τ_{ε} .$

(3)

If the homography does not satisfy this condition, the affine model is selected provided that $ρ_{A} \geq τ_{ρ}$ . Otherwise, the RGB-T pair is rejected from the multimodal annotation pipeline.

All in all, the proposed Cross-Modal Geometric Validation stage provides an annotator-independent method for image alignment, eliminating subjective alignment decisions. It should be noted that, in the presence of significant parallax and pronounced 3D scene structure, purely 2D alignment models (including affine and homography transformations) may not achieve globally consistent alignment and would consequently be rejected by the QC criteria. Addressing such cases would require explicit 3D geometric modeling, which is beyond the scope of this work.

Figure 2. Illustration of the proposed annotation pipeline, with an example SAR image pair. The illustration shows: (1) the cross-modal correspondence points estimated by SuperGlue [37] and the warped RGB image in the Cross-modal Geometric Validation stage, (2) the initial mask generated by Label Studio 1.21.0 in the Semi-automatic Annotation stage, and (3) the final refined RGB-T semantic mask from the Annotation Refinement. Notice that three classes were not identified in the RGB image.

2.: Annotation Taxonomy. A fixed, domain-specific annotation taxonomy is defined in SAR scenarios to constrain the labeling process and ensure semantic consistency across annotators.
3.: Semi-automatic annotation. The proposed semi-automatic annotation methodology for RGB–T images leverages the complementary strengths of automatic and manual annotation using tools such as the open-source Label Studio [39]. Initial segmentation masks are generated from RGB images, as these images provide higher spatial resolution, richer texture, and clearer object boundaries than TIR images. During the SAM2 semi-automatic step, annotators interact with the Label Studio SAM2 ML backend using positive clicks or bounding boxes to prompt segmentation. The automatic masks are produced using the SAM2 model, which generates class-specific segmentation proposals based on user-provided prompts. If the automatically generated masks are not satisfactory (e.g., missing or inaccurate classes), annotators refine them manually, creating a semi-automatic annotation process that combines speed with control over quality. This manual refinement on RGB-generated masks is performed using polygon annotation tools such as PolygonLabels in Label Studio.
4.: Annotation refinement. The RGB masks are then transferred to the corresponding TIR images via the estimated homography H or the affine transformation A, as detailed in the Cross-Modal Geometric Validation step. To capture classes that are poorly visible or invisible in the RGB modality, the masks are further refined directly on the TIR images using the same polygon-based annotation strategy. The annotators manually correct residual misalignment, remove spurious islands, and fill holes that SAM2 does not properly capture, ensuring high-quality segmentation masks.

4. Experimental Methodology

In this work, we quantitatively evaluate annotation efficiency and annotation quality using annotation time and inter-annotator agreement metrics, respectively, enabling an objective comparison of annotation strategies within the proposed framework. This section presents the experimental setup for evaluating our semi-automatic annotation methodology for RGB–T images, including descriptions of the SAR dataset and the segmentation models used, the cross-modal camera alignment procedure to ensure geometric consistency between RGB and TIR images, and the annotation cost and quality metrics.

4.1. SAR Dataset and Model Details

We use the proposed pipeline to annotate 306 RGB and TIR image pairs from the UMA-SAR dataset [30]. The image pairs were captured with monocular TIR and RGB analog cameras. Both images have the same resolution (704 × 576) and are mostly overlapped, but with different horizontal FOVs for TIR (

44^{\circ}

) and RGB (

57 . 8^{\circ}

). Three representative examples are shown in Figure 3a,b.

For annotation, we adopt the eleven representative classes in SAR imaging defined in [40]: First-Responder, Civilian, Vegetation, Road, Dirt-Road, Building, Sky, Civilian-Car, Responder-Vehicle, Debris and Command-Post.

With the resulting dataset, we trained two publicly available RGB-T models: FEANet [41] and SAFERNet (with ResNet–152 as backbone) [42]. All instance masks, exported as Label Studio JSON format, including polygon and RLE encodings, were converted into class index maps (0–11) for training.

The models were implemented in the PyTorch v2.9.1 toolbox and trained on a NVIDIA RTX A6000 with CUDA version 12.6. For data augmentation, all training images were flipped and cropped. We adopted stochastic gradient descent (SGD) with a learning rate of 0.003, a momentum of 0.9, and a weight decay of 0.0005. The learning rate was multiplied by a decay rate of 0.95.

To assess annotation quality and consistency, we leverage the MFNet benchmark dataset [32], a multimodal RGB–T dataset for object segmentation in urban scenes. MFNet contains paired RGB and thermal images captured under daytime and nighttime conditions, and its annotations are widely considered ground-truth in the literature. For our experiments, we selected a balanced subset of 30 images (15 daytime, 15 nighttime). MFNet defines eight semantic classes: Car, Person, Bike, curve, Car Stop, Guardrail, Color Cone, and Bump.

4.2. Cross-Modal Alignment Configuration

Keypoints are detected and matched using the SuperGlue Outdoor model with a confidence threshold of 0.2, which is the default value. The TIR image was used as the reference modality, since its FOV is more restrictive than that of the RGB image. The matched keypoints are then mapped back to their original image coordinates, and a projective transformation (either affine or homography) is estimated using the OpenCV library.

Regarding consistency evaluation and model selection, Table 1 presents representative threshold ranges reported in the literature and the implemented values for the three geometric criteria. Finally, the selected model is applied to warp the RGB images onto the TIR view using nearest-neighbor interpolation to preserve label integrity.

4.3. Annotation Cost and Annotation Quality

In this work, we quantitatively evaluate annotation quality and efficiency by comparing three annotation strategies: (1) manual polygon-based annotation using PolygonLabels in Label Studio, (2) a semi-automatic SAM2-based annotation on RGB images only, and (3) the proposed full SAM2-based RGB–T pipeline. Strategy (2) corresponds to an ablated version of the proposed method that excludes the final thermal refinement stage and is included to isolate and quantify the contribution of cross-modal refinement. The evaluation considers segmentation accuracy, inter-annotator agreement (IAA), and annotation time per image (in seconds), which was automatically recorded in Label Studio.

In the experiments, we used three annotators with varying levels of experience. Annotator E (expert) had over five years of experience in semantic segmentation annotation. Annotators

N_{1}

and

N_{2}

(novices) were volunteer students who underwent a structured 4-h training session on the annotation platform and guidelines prior to the annotation process. No explicit instructions were provided regarding prioritizing speed versus annotation quality. A short completion deadline (several days) was imposed to reflect realistic operational constraints.

To evaluate annotation quality, consistency, and efficiency, different annotator configurations were employed. For the MFNet dataset, which provides established ground-truth labels, annotators E and

N_{1}

independently labeled 30 images to assess the consistency of the annotations. For the SAR dataset, which lacks ground-truth annotations, the carefully produced polygonal annotations by E were used as a reference. In this case, all three annotators independently labeled 20 SAR images to evaluate annotation time efficiency and inter-annotator agreement. Annotations were performed in matched batches following the same class order across annotators, ensuring consistent conditions for all strategies.

Inter-annotator agreement among the three annotators was evaluated using class-wise Intersection over Union (IoU) [41] and Cohen’s

κ

[43] to assess annotation quality. Cohen’s

κ

is a chance-corrected measure of agreement for categorical data and, in the context of segmentation maps, quantifies pixel-wise agreement while accounting for agreement expected by chance. Together, these metrics quantify not only the efficiency provided by SAM2 and the thermal modality, but also the consistency and reliability of the resulting segmentation labels.

5. Results

In this section, we present both qualitative and quantitative analyses of the proposed framework. Specifically, we evaluate the cross-modal geometric validation, annotation quality using the MFNet ground-truth, and annotation cost across three strategies: manual polygonal annotation, RGB-only SAM2, and the proposed RGB-T pipeline. We also assess inter-annotator agreement between one expert and two novice annotators. Finally, we demonstrate the applicability of the RGB-T pipeline by using the generated annotations to train two state-of-the-art RGB-T models.

5.1. Cross-Modal Geometric Validation

Figure 3 illustrates representative examples of the proposed RGB-T cross-modal geometric validation. Each row shows (a) the original RGB and (b) TIR images, (c) the spatial density heatmap of validated correspondences, (d) the RGB images warped using the model selected by the QC (homography or affine) and (e) the resulting RGB-T overlay. Rows 1–2 show examples where QC selected a homography model; rows 3–4 show examples where QC selected an affine model; and the last two rows correspond to QC-rejected images due to insufficient geometric consistency.

Column (c) reveals that the heatmaps are not uniformly distributed; instead, they are concentrated in semantically strong regions (e.g., responder-vehicle boundaries, groups of people, salient objects), indicating reliable cross-modal matching. In the two rejected rows, the heatmaps are more scattered or grouped in less informative zones, suggesting less stable inlier correspondences.

Column (d) presents the warped RGB images generated using the transformation selected by the QC. Rows 1–2 show black corners/borders typical of homography. Rows 3–4 show less black borders typical of affinity, which produces a more conservative geometric transformation with fewer extrapolated regions.

Column (e) shows the RGB-T overlay images, which combine the warped RGB data with the TIR data. This column visually demonstrates the effectiveness of Cross-Modal Geometric Validation. In the accepted rows (1-4), elements such as responder-vehicle, people, and dirt-road boundaries exhibit consistent spatial correspondence, with only minor residual misalignment due to the planar assumption in homography estimation (rows 1 and 2). Since the geometric transformation is applied globally, i.e., to the entire image, the overall scene structure is largely preserved. As a result, background objects such as poles, streetlights, or slopes appear geometrically corrected, with small residual parallax inconsistencies visually smoothed out. Importantly, these cases are automatically handled by the quality control stage, which rejects image pairs that fail to meet the predefined geometric consistency criteria, such as lines 5-6, which exhibit noticeable parallax misalignments in background objects such as poles.

All in all, these results demonstrate robust image correspondences in the UMA-SAR dataset across different camera FOVs and misalignments. For the rest of the annotation pipeline analysis, all RGB-T image pairs were aligned using this QC estimation to ensure geometric correspondence.

5.2. Annotation Quality

Due to the lack of publicly available semantic segmentation datasets for RGB–T SAR scenes, we evaluated the absolute segmentation quality of annotator E against ground-truth on the selected subset of 30 MFNet images (15 daytime, 15 nighttime). Three comparison strategies were considered: (1) manual RGB polygons, (2) SAM2-assisted RGB, and (3) the proposed RGB–T pipeline (see Table 2). This setup allows us to assess whether SAM2-assisted approaches improve segmentation quality relative to the MFNet ground-truth annotations.

As shown in the table, SAM2-assisted annotations consistently improve segmentation performance. Manual annotations achieved 65.9% accuracy relative to ground-truth, while SAM2-assisted RGB increased this to 69.5%, and the full RGB–T pipeline further improved it to 74.9%. These results indicate that (1) SAM2 assistance can enhance annotation quality, and (2) incorporating cross-modal refinement can be an effective technique to further improve annotations.

Figure 4 provides qualitative comparisons that corroborate the insights of Table 2. Visually, SAM2-assisted methods (e,f) produce segmentation masks with boundary definitions that are close to the ground-truth data, even in the poorly-lit night images. In contrast, the manual polygon (d) shows approximated boundaries, justifying why manual annotation has a lower mIoU.

5.3. Annotation Cost

Table 3 summarizes the annotation cost across the three strategies. The Friedman non-parametric test revealed a significant global effect (

χ^{2} (2) = 62.921

,

p < 0.001

). Post hoc pairwise comparisons (Wilcoxon tests with Bonferroni correction) indicate that incorporating SAM2-based methods into the annotation pipeline significantly reduced the mean annotation time relative to manual polygon annotation (

p < 0.001

in both comparisons). Specifically, our RGB-T pipeline achieved a median annotation time improvement of 36%, corresponding to a total time saving of approximately 123 min across 20 images. Overall, the results confirm that the SAM2 assistance, even when requiring refinement, substantially reduces the time cost of generating segmentation ground-truth without compromising annotation quality (see Section 5.2).

5.4. Inter-Annotation Agreement

Table 4 shows a possible dependence between annotator expertise and SAM2-assisted annotation performance. The expert annotator (E) maintained almost perfect agreement with SAM2 (

κ = 0.82 \pm 0.08

), demonstrating the ability to critically refine automatic proposals. In contrast, less experienced annotators (

N_{1}

and

N_{2}

) showed a decrease of around 0.14

κ

points compared to their manual annotations, suggesting a stronger reliance on the model outputs with less corrective intervention. When incorporating the RGB-T pipeline, agreement with the RGB reference decreases across annotators. However, this reduction is attributable to the inclusion of additional thermal information not present in the RGB reference. Unlike the RGB-only annotations, the RGB-T pipeline integrates additional structures that are not fully represented in the warped RGB reference due to geometric alignment constraints (Section 3 and Section 4.2). Consequently, lower

κ

values can be interpreted as cross-modality semantic enrichment rather than reduced annotation reliability. Importantly, agreement levels remain within the moderate-to-substantial range for all annotators, indicating stable performance even under increased annotation complexity.

To further assess consistency within the thermal domain, inter-annotation agreement was computed using Annotator E’s RGB-T pipeline as reference. As shown in Table 5, annotators

N_{1}

and

N_{2}

achieve substantial agreement. These results confirm that the thermal refinement stage produces consistent and reproducible annotations across different users, supporting the robustness of the complete RGB-T pipeline.

5.5. RGB-T Pipeline Applicability

This section demonstrates the practical utility of the proposed RGB–T pipeline. The pipeline was used to annotate 306 RGB-T image pairs from the UMA-SAR dataset, which were used to train and evaluate two publicly available state-of-the-art RGB-T semantic segmentation models. Both quantitative and qualitative results are presented to assess the effectiveness of the proposed methodology.

Table 6 offers quantitative segmentation fidelity, where both models segment almost all eleven classes (FEANet does not segment Civilian-Car class due to the similarity of Responder-Vehicle).

The qualitative comparison presented in Figure 5 visually corroborates the quantitative trends observed in Table 6. Furthermore, both FEANet and SAFERNet demonstrate the ability to detect low-contrast objects that are barely distinguishable in RGB but salient in the TIR domain, highlighting the benefits of multimodal fusion.

6. Discussion

This work introduces and validates a comprehensive semi-automatic methodology for generating reliably annotated RGB-T semantic segmentation datasets, specifically motivated by search-and-rescue (SAR) applications. The proposed pipeline integrates cross-modal geometric validation, human-in-the-loop reliability assessment, and quality checks in an RGB-T semantic segmentation framework. The applicability of the resulting annotated dataset was validated by successfully training two state-of-the-art RGB-T multimodal networks, thereby demonstrating the pipeline’s practical value.

The cross-modal alignment technique integrates a geometric validation stage for consistent spatial correspondence. The resulting spatial coherence, based on matching and a geometric transformation (quantitatively validated), enhances alignment across modalities and facilitates better integration of data from multiple sources, such as different cameras. By integrating AI (SuperGlue) to estimate projective homographies and affine transformations, we provide an annotator-independent method and a practical solution for registering RGB and TIR imagery in real-world unstructured SAR scenes. The proposed method successfully compensated for mechanical misalignment, yielding spatially coherent RGB-T pairs suitable for fusion and annotation. Homography relies on the assumption of local planarity, which limits its effectiveness for close objects in the scene, potentially leading to minor residual misalignment. In contrast, an affine transformation provides a more constrained geometric model that preserves parallelism and is less sensitive to projective distortions, making it more stable for locally rigid scenes with limited perspective effects.

The experimental results demonstrate that the proposed RGB–T annotation pipeline provides a robust and efficient framework for generating reliable semantic segmentation labels in SAR environments. The integration of automatic cross-modal geometric validation ensures that only spatially consistent image pairs are propagated to the annotation stage, significantly reducing the risk of geometric inconsistencies caused by parallax or non-planar scene structures. The combination of SAM2-based automatic proposals with guided human refinement enables a substantial reduction in annotation time (36%) while maintaining strong annotation quality, as reflected by the high mean IoU (74.9%) and inter-annotator agreement (pixel accuracy = 74.3%, Cohen’s

κ

= 0.65). The results reveal a relationship between quality and annotators’ expertise. The expert annotator achieved almost perfect agreement with SAM2, indicating effective human-AI interactions. Furthermore, the adaptive selection between affine and homography transformations improves alignment robustness across diverse scene geometries. The effectiveness of the generated annotations is further validated by their use in training state-of-the-art RGB–T segmentation models, confirming the reliability, scalability, and practical applicability of the proposed pipeline for constructing high-quality multimodal semantic segmentation datasets. Overall, the results indicate that our proposed RGB-T pipeline can accelerate dataset creation while preserving annotation quality.

A limitation of the current approach is the reliance on global geometric transformations, which may not fully compensate for strong parallax in highly non-planar scenes. Future work will explore more flexible alignment strategies and mechanisms to further support novice annotators.

Very recently, SAM3 was presented [44]. SAM3 is an evolution from single-object visual segmentation to promptable concept segmentation (PCS), enabling the detection and tracking of all instances of a concept in images using text phrases, image samples, or a combination of both. As future work, integrating SAM3 into the proposed pipeline could further enhance the labeling process and user interaction by leveraging its agentic workflow features, potentially combined with multimodal large language models (MLLMs) for complex reasoning-based labeling.

7. Conclusions

This paper presented a reproducible framework for generating reliable RGB-T semantic segmentation labels through a semi-automatic pipeline. The proposed approach integrates SAM2 as a machine learning backend model to reduce manual annotation effort, while a cross-modality geometric validation method enables label transfer and refinement across RGB and TIR modalities. This method, based on affinity and homography estimation, allows image warping under camera misalignment, which is often complex and inaccurate.

We evaluated the proposed semi-automatic pipeline annotation procedure against a fully manual polygon-based annotation. Annotation cost and quality were assessed using inter-annotation agreement (IAA) for label quality checks and human annotation time. The results show a

9 %

improvement in annotation quality and a median reduction in annotation cost of

36 %

when using the proposed pipeline. Although the pipeline has been motivated by and validated in search and rescue (SAR) applications, its methodological framework is applicable to other domains involving the joint use of thermal and visible imagery, such as infrastructure inspection, environmental monitoring, and autonomous systems operating under low-visibility conditions.

Furthermore, we have used the proposed pipeline to generate 306 annotated RGB-T image pairs from the UMA-SAR dataset. This labeled data was used to train and test two state-of-the-art RGB-T semantic segmentation models, demonstrating the applicability and fidelity of the generated annotated dataset. Both the data and the trained models have been made publicly available.

The ongoing work involves integrating segmentation models trained under this methodology into real robotic platforms to evaluate their performance in operational SAR scenarios. In addition, future work will focus on extending the proposed framework to include additional image modalities, such as depth and multispectral imagery. The recent release of SAM3 represents a significant improvement that can be integrated into this pipeline.

Author Contributions

Conceptualization, A.S.-E., R.V.-M. and A.M.; methodology, A.S.-E. and R.V.-M.; software, A.S.-E.; validation, A.S.-E. and R.V.-M.; formal analysis, A.S.-E., R.V.-M. and A.M.; investigation, A.S.-E. and R.V.-M.; resources, A.S.-E.; data curation, A.S.-E.; writing—original draft preparation, A.S.-E. and R.V.-M.; writing—review and editing, A.S.-E., R.V.-M. and A.M.; visualization, A.S.-E.; supervision, R.V.-M. and A.M.; project administration, A.M.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially funded by the Ministerio de Ciencia, Innovación y Universidades, Gobierno de España, project PID2021-122944OB-I00. This work was technically supported by IMECH.UMA through PPRO-IUI-2023-02 (Universidad de Málaga).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available at https://github.com/amsalase/CPGFANet/tree/main/rgb_t_pipeline_sar (accessed on 27 February 2026).

Acknowledgments

The authors want to thank the collaboration of the Chair for Safety, Emergencies and Disasters of the Universidad de Málaga, led by Jesús Miranda for organizing the exercises. We would like to thank all the members of the Robotics and Mechatronics Lab. The first author received a grant by Asociación Universitaria Iberoamericana de Postgrado (AUIP), Universidad de Málaga, and Universidad Técnica de Manabí. The authors would also like to thank the student annotators Maryerlin García-Bazurto, José Sánchez-Ortiz, and Lucas Jurado-Martínez for their valuable contribution to the annotation process.

Conflicts of Interest

The authors declare no conflict of interest.

References

Feng, M.; Su, W. Semantic Visual SLAM Algorithm Based on Geometric Constraints. In Proceedings of the 2nd International Conference on the Frontiers of Robotics and Software Engineering (FRSE 2024); Lecture Notes in Networks and Systems; Springer: Singapore, 2025; Volume 1290, pp. 42–51. [Google Scholar] [CrossRef]
Yajima, Y.; Kim, S.; Chen, J.; Cho, Y.K. Fast Online Incremental Segmentation of 3D Point Clouds from Disaster Sites. In Proceedings of the 2021 International Symposium on Automation and Robotics in Construction; IAARC: Oulu, Finland, 2021; pp. 341–348. [Google Scholar]
Speth, S.; Gonçalves, A.; Rigault, B.; Suzuki, S.; Bouazizi, M.; Matsuo, Y.; Prendinger, H. Deep learning with RGB and thermal images onboard a drone for monitoring operations. J. Field Robot. 2022, 39, 840–868. [Google Scholar] [CrossRef]
Wang, Z.; Benhabib, B. Concurrent Multi-Robot Search of Multiple Missing Persons in Urban Environments. Robotics 2025, 14, 157. [Google Scholar] [CrossRef]
González-Navarro, R.; Lin-Yang, D.; Vázquez-Martín, R.; Garcia-Cerezo, A. Disaster area recognition from aerial images with complex-shape class detection. In Proceedings of the 2023 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR); IEEE: Piscataway, NJ, USA, 2024; pp. 126–131. [Google Scholar]
Byukusenge, P.; Zhang, Y. Life Detection Based on UAVs—Thermal Images in Search and Rescue Operation. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT); IEEE: Piscataway, NJ, USA, 2022; pp. 1728–1731. [Google Scholar] [CrossRef]
Ahmed, M.; Khan, N.; Ovi, P.R.; Roy, N.; Purushotham, S.; Gangopadhyay, A.; You, S. GADAN: Generative Adversarial Domain Adaptation Network For Debris Detection Using Drone. In Proceedings of the 2022 International Conference on Distributed Computing in Sensor Systems; IEEE: Piscataway, NJ, USA, 2022; pp. 277–282. [Google Scholar] [CrossRef]
Broyles, D.; Hayner, C.R.; Leung, K. WiSARD: A Labeled Visual and Thermal Image Dataset for Wilderness Search and Rescue. In Proceedings of the 2022 International Conference on Intelligent Robots and Systems; IEEE: Piscataway, NJ, USA, 2022; pp. 9467–9474. [Google Scholar] [CrossRef]
Arora, P.; Mehta, R.; Ahuja, R. Enhancing Image Registration Leveraging SURF with Alpha Trimmed Spatial Relation Correspondence. In Proceedings of the Computational Science and Its Applications—ICCSA 2024; Springer: Cham, Switzerland, 2024; pp. 180–191. [Google Scholar] [CrossRef]
Shao, R.; Wu, G.; Zhou, Y.; Fu, Y.; Fang, L.; Liu, Y. LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2022; pp. 14870–14879. [Google Scholar] [CrossRef]
Dlesk, A.; Vach, K.; Pavelka, K. Photogrammetric Co-Processing of Thermal Infrared Images and RGB Images. Sensors 2022, 22, 1655. [Google Scholar] [CrossRef]
Rahnemoonfar, M.; Chowdhury, T.; Murphy, R. RescueNet: A High Resolution UAV Semantic Segmentation Dataset for Natural Disaster Damage Assessment. Sci. Data 2023, 10, 913. [Google Scholar] [CrossRef]
Sirma, A.; Plastropoulos, A.; Tang, G.; Zolotas, A. DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions. arXiv 2025, arXiv:2508.16016. [Google Scholar]
Sutherland, N.; Marsh, S.; Remondino, F.; Perda, G.; Bryan, P.; Mills, J. Geometric Calibration of Thermal Infrared Cameras: A Comparative Analysis for Photogrammetric Data Fusion. Metrology 2025, 5, 43. [Google Scholar] [CrossRef]
Anderson, C.; Schenck, E.; Reinhardt, C.; Blue, R.; Clipp, B. AA-Pipe: Automatic annotation pipeline for visible and thermal infrared video. Opt. Eng. 2025, 64, 092205. [Google Scholar] [CrossRef]
An, G.; Guo, J.; Guo, C.; Wang, Y.; Li, C. Semantic segmentation in adverse scenes with fewer labeled images. Neural Netw. 2025, 191, 107788. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Yu, W.; Lv, Y.; Sun, J.; Sun, B.; Liu, M. SAM2-ELNet: Label Enhancement and Automatic Annotation for Remote Sensing Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 22499–22512. [Google Scholar] [CrossRef]
Wang, R.; Chowdhury, T.; Ortiz, A.C. Semantic segmentation framework for atoll satellite imagery: An in-depth exploration using UNet variants and Segmentation Gym. Appl. Comput. Geosci. 2025, 25, 100217. [Google Scholar] [CrossRef]
Erişen, S.; Mehranfar, M.; Borrmann, A. Single Image to Semantic BIM: Domain-Adapted 3D Reconstruction and Annotations via Multi-Task Deep Learning. Remote Sens. 2025, 17, 2910. [Google Scholar] [CrossRef]
Ye, H.; Mai, S.; Wang, M.; Gao, M.; Fei, Y. Coarse-to-fine Automatic Segmentation and Labeling for Urban MLS Point Clouds. In Proceedings of the International Conference on Robotics and Artificial Intelligence; Association for Computing Machinery: New York, NY, USA, 2025; pp. 31–36. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef] [PubMed]
Zulfikar, I.E.; Mahadevan, S.; Voigtlaender, P.; Leibe, B. Point-VOS: Pointing Up Video Object Segmentation. arXiv 2024, arXiv:2402.05917. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
Gallagher, J.E.; Gogia, A.; Oughton, E.J. A Multispectral Automated Transfer Technique (MATT) for Machine-Driven Image Labeling Utilizing the Segment Anything Model (SAM). IEEE Access 2025, 13, 4499–4516. [Google Scholar] [CrossRef]
Abhishek, K.; Kawahara, J.; Hamarneh, G. What Can We Learn from Inter-Annotator Variability in Skin Lesion Segmentation? In Proceedings of the Skin Image Analysis, and Computer-Aided Pelvic Imaging for Female Health; Springer: Cham, Switzerland, 2026; pp. 23–33. [Google Scholar] [CrossRef]
Zenkl, R.; McDonald, B.A.; Walter, A.; Anderegg, J. Towards high throughput in-field detection and quantification of wheat foliar diseases using deep learning. Comput. Electron. Agric. 2025, 232, 109854. [Google Scholar] [CrossRef]
Ewecker, L.; Wagner, N.; Brühl, T.; Schwager, R.; Sohn, T.S.; Engelsberger, A.; Ravichandran, J.; Stage, H.; Langner, J.; Saralajew, S. Detecting Oncoming Vehicles at Night in Urban Scenarios—An Annotation Proof-of-Concept. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV); IEEE: Piscataway, NJ, USA, 2024; pp. 2117–2124. [Google Scholar] [CrossRef]
Maier-Hein, L.; Eisenmann, M.; Reinke, A.; Onogur, S.; Stankovic, M.; Scholz, P.; Arbel, T.; Bogunovic, H.; Bradley, A.P.; Carass, A.; et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nat. Commun. 2018, 9, 5217. [Google Scholar] [CrossRef]
Kohli, M.D.; Summers, R.M.; Geis, J.R. Medical Image Data and Datasets in the Era of Machine Learning. J. Digit. Imaging 2017, 30, 392–399. [Google Scholar] [CrossRef]
Morales, J.; Vázquez-Martín, R.; Mandow, A.; Morilla-Cabello, D.; García-Cerezo, A. The UMA-SAR Dataset: Multimodal data collection from a ground vehicle during outdoor disaster response training exercises. Int. J. Robot. Res. 2021, 40, 835–847. [Google Scholar] [CrossRef]
Teledyne FLIR OEM. FREE Teledyne FLIR Thermal Dataset for Algorithm Training. 2025. Available online: https://oem.flir.com/solutions/automotive/adas-dataset-form/ (accessed on 5 December 2025).
Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE International Conference on Intelligent Robots and Systems; IEEE: Piscataway, NJ, USA, 2017; pp. 5108–5115. [Google Scholar] [CrossRef]
Shivakumar, S.S.; Rodrigues, N.; Zhou, A.; Miller, I.D.; Kumar, V.; Taylor, C.J. PST900: RGB-Thermal Calibration, Dataset and Segmentation Network. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation; IEEE: Piscataway, NJ, USA, 2020; pp. 9441–9447. [Google Scholar] [CrossRef]
Sambaturu, B.; Gupta, A.; Jawahar, C.; Arora, C. ScribbleNet: Efficient interactive annotation of urban city scenes for semantic segmentation. Pattern Recognit. 2023, 133, 109011. [Google Scholar] [CrossRef]
Xu, H.; Huang, Q.; Liao, H.; Nong, G.; Wei, W. MFFP-Net: Building Segmentation in Remote Sensing Images via Multi-Scale Feature Fusion and Foreground Perception Enhancement. Remote Sens. 2025, 17, 1875. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 4937–4946. [Google Scholar] [CrossRef]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2024; pp. 17581–17592. [Google Scholar] [CrossRef]
LabelStudio. Open Source Data Labeling Platform. 2025. Available online: https://labelstud.io/ (accessed on 23 December 2025).
Salas-Espinales, A.; Vázquez-Martín, R.; García-Cerezo, A.; Mandow, A. SAR Nets: An Evaluation of Semantic Segmentation Networks with Attention Mechanisms for Search and Rescue Scenes. In Proceedings of the 2023 IEEE International Symposium on Safety, Security, and Rescue Robotics; IEEE: Piscataway, NJ, USA, 2024; pp. 139–144. [Google Scholar] [CrossRef]
Deng, F.; Feng, H.; Liang, M.; Wang, H.; Yang, Y.; Gao, Y.; Chen, J.; Hu, J.; Guo, X.; Lam, T.L. FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time Semantic Segmentation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: Piscataway, NJ, USA, 2021; pp. 4467–4473. [Google Scholar] [CrossRef]
Salas-Espinales, A.; Vazquez-Martin, R.; Mandow, A. SAFERNet: Channel, Positional, and Global Attention Fusion for Efficient RGB-T Segmentation in Disaster Robotics. Authorea 2025, 1–34. [Google Scholar] [CrossRef]
MacHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. SAM 3: Segment Anything with Concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the proposed annotation pipeline. The flowchart consist of four stages: (1) Cross-modal geometric validation with five cross-modal geometric validation steps to provide reliable correspondences between image pairs, (2) Annotation Taxonomy where the classes of the dataset are defined, (3) Semi-automatic Annotation where Label Studio with SAM2 as ML backend generate the initial mask, (4) Annotation Refinement where the annotator manually or using SAM2 add or correct annotations, resulting in the final refined RGB-T semantic mask.

Figure 3. Cross-Modal Geometric Validation Results: (a) RGB input, (b) Thermal input, (c) Spatial density heatmap of inlier correspondences obtained from SuperGlue matches after RANSAC-based geometric verification, warmer colors (e.g., red) indicate regions with a higher concentration of matching points, whereas cooler colors (e.g., blue) correspond to areas with fewer or no correspondences, (d) Warped RGB image, and (e) RGB-T overlay. Rows illustrate different QC and Model Selection results: homography selection (rows 1–2), affine selection (rows 3–4), and rejection (rows 5–6).

Figure 4. Qualitative comparisons of annotation strategies on two representative day (top) and night (bottom) MFNet validation images: (a) RGB, (b) Thermal, (c) MFNet ground-truth, (d) Annotator E Manual polygon method, (e) Annotator E SAM2 (RGB) method, (f) RGB-T pipeline. The RGB-T pipeline (f) demonstrates superior adherence to the ground-truth (c), corresponding to the significant improvement in Table 2. (e,f) were annotated using SAM2 positive points only.

Figure 5. Predictions on UMA-SAR dataset: (a) RGB images, (b) TIR images, (c) Ground- truth images, (d) FEANET predicted images, (e) SAFERNet predicted images.

Table 1. Objective Geometric Consistency Evaluation criteria. Validation threshold ranges were determined based on established practice [36,37,38]. The implemented value has been empirically fine-tuned.

Geometric Criteria	Threshold Ranges	Implemented Threshold
Keypoint matches	≥40 [36]	$τ_{N} = 40$
Inlier ratio	≥0.4 [37]	$τ_{ρ} = 0.4$
Median re-projection error	≤[3, 5] px [38]	$τ_{ε} = 4$ px

Table 2. Segmentation quality comparison against MFNet ground-truth. mIoU percentages are computed on 30 MFNet validation images (15 day, 15 night scenes). The relative difference is measured against the Manual Polygon (RGB).

Annotation Method	mIoU (%)	Relative Difference (%)
Manual Polygon (RGB)	65.9	–
SAM2 (RGB)	69.5	+3.6%
RGB-T Pipeline	74.9	+9%

Table 3. Annotation cost comparison across strategies. Total time, mean times

\bar{t}

, and standard deviation s are in seconds. Improvements are relative to Manual Polygon Annotation. The relatively large standard deviations reflect the heterogeneous complexity of SAR scenes.

Table 3. Annotation cost comparison across strategies. Total time, mean times

\bar{t}

, and standard deviation s are in seconds. Improvements are relative to Manual Polygon Annotation. The relatively large standard deviations reflect the heterogeneous complexity of SAR scenes.

Annotator	Method	Total Time (s)	$\bar{t}$ ± s (s)	Total Time Improvement (%)
E	Manual Polygon (RGB)	12,503.05	$625.15 \pm 469.70$	–
	SAM2 (RGB)	$5401.55$	$270.08 \pm 127.48$	$56.80$
	RGB-T Pipeline	$7777.66$	$388.88 \pm 150.85$	$37.80$
$N_{1}$	Manual Polygon (RGB)	$6573.69$	$328.69 \pm 199.83$	–
	SAM2 (RGB)	1985.575	$99.279 \pm 41.414$	$69.80$
	RGB-T Pipeline	$5090.288$	$254.514 \pm 78.547$	$22.57$
$N_{2}$	Manual Polygon (RGB)	18,181.15	$909.06 \pm 647.07$	–
	SAM2 (RGB)	$7799.03$	$389.95 \pm 379.00$	$57.11$
	RGB-T Pipeline	$9647.13$	$482.36 \pm 393.66$	$46.94$

Table 4. Agreement metrics: Pixel Accuracy (%) and Inter-annotator Cohen’s Kappa (

κ

) agreement. Measurements are between annotations produced by each method and a common reference standard (Annotator E’s Manual Polygon on the same image). Categories: Slight (0–0.2), Fair (0.2–0.4), Moderate (0.4–0.6), Substantial (0.6–0.8), Almost perfect (0.8–1.0).

Table 4. Agreement metrics: Pixel Accuracy (%) and Inter-annotator Cohen’s Kappa (

κ

) agreement. Measurements are between annotations produced by each method and a common reference standard (Annotator E’s Manual Polygon on the same image). Categories: Slight (0–0.2), Fair (0.2–0.4), Moderate (0.4–0.6), Substantial (0.6–0.8), Almost perfect (0.8–1.0).

Annotator	Method	Pixel Accuracy (Mean ± SD) (%)	Cohen’s $κ$ (Mean ± SD)	Category (Based on $κ$ )
E	SAM2 (RGB)	88.2 ± 8	0.82 ± 0.08	Almost perfect
E	RGB-T Pipeline	79.8 ± 8	0.70 ± 0.11	Substantial
$N_{1}$	Manual Polygon (RGB)	87.5 ± 7	0.81 ± 0.10	Almost perfect
	SAM2 (RGB)	76.0 ± 15	0.66 ± 0.19	Substantial
	RGB-T Pipeline	69.5 ± 12	0.57 ± 0.14	Moderate
$N_{2}$	Manual Polygon (RGB)	84.2 ± 10	0.76 ± 0.13	Substantial
	SAM2 (RGB)	75.0 ± 12	0.63 ± 0.17	Substantial
	RGB-T Pipeline	68.4 ± 11	0.55 ± 0.14	Moderate

Table 5. Inter-annotator agreement in the thermal domain using Annotator E’s RGB-T Pipeline refinement as reference. Agreement metrics: Pixel Accuracy (%) and Inter-annotator Cohen’s Kappa (

κ

) agreement. Categories: Slight (0–0.2), Fair (0.2–0.4), Moderate (0.4–0.6), Substantial (0.6–0.8), Almost perfect (0.8–1.0).

Table 5. Inter-annotator agreement in the thermal domain using Annotator E’s RGB-T Pipeline refinement as reference. Agreement metrics: Pixel Accuracy (%) and Inter-annotator Cohen’s Kappa (

κ

) agreement. Categories: Slight (0–0.2), Fair (0.2–0.4), Moderate (0.4–0.6), Substantial (0.6–0.8), Almost perfect (0.8–1.0).

Annotator Id	Annotation Method	Pixel Accuracy (Mean ± SD) (%)	Cohen’s $κ$ (Mean ± SD)	Category (Based on $κ$ )
$N_{1}$	RGB-T Pipeline	77.4 ± 13	0.67 ± 0.17	Substantial
$N_{2}$	RGB-T Pipeline	74.3 ± 18	0.65 ± 0.12	Substantial

Table 6. Segmentation fidelity results (%) on the UMA-SAR test set.

	FEANet [41]		SAFERNet [42]
	Acc (%)	IoU (%)	Acc (%)	IoU (%)
First-Responder	86.00	73.18	86.16	74.84
Civilian	62.70	44.56	59.77	48.51
Vegetation	62.96	50.18	55.15	44.64
Road	39.9	27.41	79.78	59.68
Dirt-Road	66.87	61.33	88.60	79.89
Building	74.57	57.85	70.50	56.71
Sky	94.79	90.68	94.61	90.80
Civilian-Car	00.00	00.00	42.65	20.80
Responder-Vehicle	83.96	69.73	83.00	73.09
Debris	79.44	66.83	76.67	64.36
Command-Post	64.85	41.43	86.54	55.72
	mAcc (%)	mIoU (%)	mAcc (%)	mIoU (%)
	65.81	52.44	74.58	60.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Salas-Espinales, A.; Vázquez-Martín, R.; Mandow, A. A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes. Modelling 2026, 7, 50. https://doi.org/10.3390/modelling7020050

AMA Style

Salas-Espinales A, Vázquez-Martín R, Mandow A. A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes. Modelling. 2026; 7(2):50. https://doi.org/10.3390/modelling7020050

Chicago/Turabian Style

Salas-Espinales, Andrés, Ricardo Vázquez-Martín, and Anthony Mandow. 2026. "A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes" Modelling 7, no. 2: 50. https://doi.org/10.3390/modelling7020050

APA Style

Salas-Espinales, A., Vázquez-Martín, R., & Mandow, A. (2026). A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes. Modelling, 7(2), 50. https://doi.org/10.3390/modelling7020050

Article Menu

A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes

Abstract

1. Introduction

2. Related Work

3. SAM2-Driven RGB-T Annotation Pipeline

4. Experimental Methodology

4.1. SAR Dataset and Model Details

4.2. Cross-Modal Alignment Configuration

4.3. Annotation Cost and Annotation Quality

5. Results

5.1. Cross-Modal Geometric Validation

5.2. Annotation Quality

5.3. Annotation Cost

5.4. Inter-Annotation Agreement

5.5. RGB-T Pipeline Applicability

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI