Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former

Mandal, Sayan; Ainetter, Stefan; Fraundorfer, Friedrich

doi:10.3390/robotics15020044

Open AccessArticle

Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former

by

Sayan Mandal

^1,2

,

Stefan Ainetter

^3,† and

Friedrich Fraundorfer

^3,*

¹

Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, 52428 Jülich, Germany

²

Faculty of Electrical and Computer Engineering, University of Iceland, 102 Reykjavík, Iceland

³

Institute of Visual Computing, Graz University of Technology, 8010 Graz, Austria

^*

Author to whom correspondence should be addressed.

^†

Current Address: Huawei Technologies, 8010 Graz, Austria.

Robotics 2026, 15(2), 44; https://doi.org/10.3390/robotics15020044

Submission received: 5 January 2026 / Revised: 29 January 2026 / Accepted: 11 February 2026 / Published: 15 February 2026

(This article belongs to the Special Issue Perception and AI for Field Robotics)

Download

Browse Figures

Versions Notes

Abstract

Segmenting individual timber logs in robotic grasping scenarios poses significant challenges due to cluttered arrangements, overlapping geometries, and visually uniform textures, requiring instance segmentation models that balance accuracy and computational efficiency. In this work, we study the integration of the EfficientViT-SAM backbone into the MP-Former framework to analyze its impact on segmentation accuracy, inference speed, and cross-dataset generalization in autonomous forestry applications. Our contributions are threefold: (1) we benchmark Mask2Former and MP-Former with different variants of Swin Transformer as backbones on the TimberSeg 1.0 dataset, (2) we study the use of the EfficientViT-SAM-XL architecture as an alternative encoder backbone to analyze its impact on inference speed and segmentation accuracy, and (3) we use an In-house dataset as a hold-out test set, comprising 113 images and 923 annotations in the annotated subset and 50 images in the unannotated subset, for evaluating model generalization under real-world deployment scenarios. On the TimberSeg 1.0 dataset, our top-performing model, EfficientViT-SAM-XL1 MP-Former, achieves an mAP of 61.05, outperforming the Swin-B Mask2Former of the TimberSeg 1.0 paper by +3.52 mAP, while running at 12 FPS (+3.53 FPS gain). When tested on our In-house dataset, the model attains an mAP of 67.06. Notably, it matches the memory efficiency of TimberSeg’s strongest baseline, despite having nearly double the number of parameters, demonstrating its practical viability for robotic applications in forestry environments.

Keywords:

instance segmentation; wood logs; timber; robotic grasping; fine-tuning; EfficientViT-SAM; MP-Former; Mask2Former; TimberSeg; forestry

1. Introduction

As forestry operations push toward autonomous machinery, there is a need for accurate and efficient log-handling workflows. Initial works [1,2] explored grasp prediction in cluttered scenes using deep learning. Ref. [3] presented a fully automated log-grasping system using a visual grasp detection network. Recent efforts [4,5,6] have worked on the instance segmentation of wood logs for autonomous loading. Ref. [7] presented a fully autonomous log-loading system using a Mask2Former-based perception stack trained on the TimberSeg 1.0 dataset [4], reporting that segmentation errors were the dominant cause of failed grasps in outdoor deployments, underscoring the need for more accurate and efficient instance segmentation backbones for forestry robotics. In this work, we focus on instance segmentation, as it enables precise delineation of individual logs, which is essential for downstream grasp planning in complex, overlapping log arrangements commonly encountered in autonomous forestry operations.

The release of the Segment Anything Model (SAM) [8] marked a major shift in segmentation research, offering a foundation model trained at scale for universal segmentation tasks. SAM’s encoder has since been repurposed via fine-tuning for diverse downstream tasks across domains such as medical imaging [9], remote sensing [10], and robotic grasping [11]. To make this capability accessible in real-world settings, efficient variants like EfficientSAM [12] and EfficientViT-SAM [13] have emerged, distilling SAM into faster, lightweight models. These models enable us to leverage SAM’s large-scale pretraining while reducing inference costs, offering a stronger alternative to conventional encoders pretrained on MS-COCO [14].

Several SAM variants, including SAM, EfficientViT-SAM, etc., have done evaluations on the TimberSeg 1.0 [4] dataset in a zero-shot setting, using prompts such as points or bounding boxes. While remarkable, zero-shot SAM applications require external inputs or detectors, introducing manual effort and added complexity. In contrast, we treat log segmentation as a conventional instance segmentation task, enabling us to directly fine-tune SAM-based encoders within a task-specific model. This eliminates the need for prompt engineering or auxiliary modules and allows the model to learn end-to-end from training data, making it more practical for real-world forestry automation systems. For this purpose, we extend [4]’s work by making incremental changes to their Swin-B Mask2Former [15] model. We began by surveying recent works on Mask2Former [15], which led us to the MP-Former [16] architecture. Then, we looked for efficient versions of SAM to use as an encoder instead of Swin Transformer [17], for which we chose EfficientViT-SAM. Our target of instance segmentation is only the top-lying logs that can be picked by a crane. As [5] is not publicly available and [6] has annotations of all logs, we only consider TimberSeg 1.0 [4] for this work along with our own In-house dataset as a hold-out test set.

The main contributions of this work are as follows:

We benchmark Mask2Former and MP-Former with multiple Swin Transformer backbones on the TimberSeg 1.0 dataset for log instance segmentation, analyzing accuracy, recall, and inference speed in the context of autonomous log grasping.
We integrate the EfficientViT-SAM-XL as an encoder within the MP-Former architecture for end-to-end fine-tuning. To enable compatibility with the MP-Former pixel decoder, we introduce lightweight upsampling layers to align EfficientViT-SAM feature resolutions and study transposed convolution and transposed CoordConv as alternative designs.
We evaluate model generalization on a real-world In-house test set collected under operational deployment conditions.

2. Materials and Methods

2.1. Datasets Used

2.1.1. TimberSeg 1.0 Dataset

We use the TimberSeg 1.0 [4] dataset for training and validating all models. Ref. [4] employed a 5-fold cross-validation strategy, which we follow as well. Because the official fold assignments were not provided, we generated our own five splits using the same KFold function as in the official TimberSeg repository, implemented with scikit-learn and a fixed random seed of 42. Unlike the original code, where the folds are generated dynamically at each run, we precomputed the splits once and kept them fixed across all experiments to ensure full reproducibility. TimberSeg 1.0 contains 220 images and 2474 instances (26 instances were automatically filtered out by Detectron2 [18] for being too small). Given in Table 1 is the data distribution of images and annotations across our 5-fold (train:val::80:20) splits.

The dataset was collected across four operational environments: forest, lumberyard, roadside and trailer. All images were recorded from the perspective of a log loader operator. Given in Figure 1 are some sample image-annotation pairs from the dataset.

2.1.2. In-House Dataset

To evaluate model performance under real deployment conditions, our project partners provided 113 images, which we manually annotated using the labelme [19] annotation tool, resulting in a total of 923 instance annotations. We refer to this as the ’In-house Annotated dataset’ and use it exclusively for testing to assess model generalization. Due to contractual agreements and data protection regulations, example images from this dataset cannot be publicly shown. In addition, we collected 50 unannotated images from [1,2,3] and from our project partners. This ’In-house Unannotated dataset’ is used for qualitative evaluation of model behavior under diverse viewpoints and lighting conditions. Representative samples that have been explicitly approved by the project partners for disclosure are shown in Figure 2. Although these images can be visualized in this paper, the dataset itself cannot be publicly released due to contractual agreements and data protection regulations.

2.2. Model Architecture Overview

Figure 3 shows our EfficientViT-SAM MP-Former architecture, which integrates an EfficientViT-SAM-XL [13] backbone into the MP-Former [16] segmentation framework. MP-Former improves upon Mask2Former [15] by resolving prediction inconsistencies across Transformer decoder layers via mask-piloted training. It introduces denoising queries based on noisy ground-truth masks to stabilize predictions, while standard matching queries use predicted masks and bipartite matching. This dual-query setup boosts segmentation accuracy and training speed without increasing inference-time computational cost.

2.2.1. Backbone

We use EfficientViT-SAM-XL [13] (XL0 and XL1) as the backbone, which builds on SAM [8] by replacing its ViT [20] encoder with EfficientViT [21]. It uses ReLU-based linear attention [22] to reduce complexity from quadratic to linear and incorporates MBConv [23] and Fused-MBConv [24] blocks. We exclude EfficientViT-SAM-L and experiment with only XL variants. For readability, we will refer to EfficientViT-SAM-XL0 as EVS-XL0 and EfficientViT-SAM-XL1 as EVS-XL1. EVS-XL consists of Stages 0–5 (Figure 4), where Stage 0 is Conv + ResBlock [25], Stage 1–3 have Fused-MBConv [24] and Stage 4–5 have MBConv [23] + EfficientViT module [21]. Linear attention based on ReLU is defined in Equation (1). Given an input

x \in R^{N \times f}

, we compute the query, key, and value projections as

Q = x W_{Q}

,

K = x W_{K}

, and

V = x W_{V}

, where

W_{Q}, W_{K}, W_{V} \in R^{f \times d}

are learnable linear projection matrices. The i-th output feature

O_{i}

can be written as:

\begin{matrix} O_{i} & = \frac{ReLU (Q_{i}) (\sum_{j = 1}^{N} ReLU {(K_{j})}^{T} V_{j})}{ReLU (Q_{i}) (\sum_{j = 1}^{N} ReLU {(K_{j})}^{T})} . \end{matrix}

(1)

As

(\sum_{j = 1}^{N} ReLU {(K_{j})}^{T} V_{j}) \in R^{d \times d}

and

(\sum_{j = 1}^{N} ReLU {(K_{j})}^{T}) \in R^{d \times 1}

are computed once and reused for each query, the overall attention operation requires only

O (N)

computational cost and

O (N)

memory.

Figure 3 illustrates the integration of the EVS-XL encoder (Figure 4) into the MP-Former segmentation framework. We extract features from Stages 2–5 (out of Stages 0–5, shown as color-coded blocks in the backbone) to feed into the Pixel Decoder. This stage selection mirrors the setup used for Swin Transformer backbones. However, Swin-B/S/T generates multi-scale feature maps at

(1 / 4, 1 / 8, 1 / 16, 1 / 32)

input resolution, while EVS-XL yields outputs at

(1 / 8, 1 / 16, 1 / 32, 1 / 64)

. To ensure compatibility with MP-Former’s Pixel Decoder, we insert upsampling layers to match the expected resolutions of

(1 / 4, 1 / 8, 1 / 16, 1 / 32)

(as shown in the “Hierarchical encoder features” block in Figure 3). This ensures the Pixel Decoder configuration remains identical across Swin and EVS-XL variants. To assess upsampling strategies, we perform an ablation comparing transposed convolution [26] (EVS-XL-TC) and transposed CoordConv [27] (EVS-XL-TCC). CoordConv explicitly encodes spatial coordinates and has shown benefits for localization even in single-layer form [28]. This design allows us to evaluate whether improved spatial encoding during upsampling enhances performance in the Pixel Decoder and Transformer Decoder, particularly for instance segmentation involving densely packed timber logs.

2.2.2. Pixel Decoder

The Pixel Decoder receives the hierarchical feature representations produced by the backbone and progressively upsamples these low-resolution features to generate high-resolution, per-pixel embeddings. In Mask2Former, the authors conducted an ablation study comparing several decoder designs, including Feature Pyramid Network (FPN) [29], Semantic Feature Pyramid Network (Semantic FPN) [30], Feature-aligned Pyramid Network (FaPN) [31], Bi-directional Feature Pyramid Network (BiFPN) [32] and the Multi-Scale Deformable Attention Transformer (MSDeformAttn) [33]. Among these, MSDeformAttn achieved the best performance. Consequently, both our work and [4] adopt MSDeformAttn as the Pixel Decoder. Concretely, as shown in Figure 3, the decoder applies multiple MSDeformAttn layers to feature maps at resolutions of

1 / 8

,

1 / 16

, and

1 / 32

of the input image, after which the final

1 / 8

feature map is upsampled to produce

1 / 4

resolution per-pixel embeddings.

2.2.3. Transformer Decoder

The Transformer Decoder [34] takes as input the per-pixel features generated by the Pixel Decoder and processes a set of learnable object queries to predict segmentation masks. Unlike conventional decoders that rely on standard cross-attention over the full image, Mask2Former employs masked cross-attention, which restricts each query’s attention to the predicted foreground regions rather than the entire feature map. This design improves both training efficiency and segmentation accuracy, especially for small objects and fine-grained structures. The decoder further integrates several architectural optimizations, including reordering the self-attention and masked cross-attention operations, using learnable queries that act as region-level proposals, and eliminating dropout to reduce computational overhead. Together, these modifications enhance both the efficiency and the predictive performance of the segmentation model. Final binary mask predictions are obtained by combining the learned object queries with the per-pixel embeddings.

Building upon this decoder design, MP-Former further refines Mask2Former by addressing the issue of inconsistent mask predictions across the successive Transformer Decoder layers, which can lead to inefficient use of object queries and degraded segmentation quality. To mitigate this, MP-Former introduces a mask-piloted (MP) training strategy that supplements the standard learnable queries with additional ground-truth (GT) guided denoising queries. During training, these MP queries employ noised GT masks within the masked cross-attention mechanism instead of relying on the predicted masks from the preceding layer. Through this process, the denoising queries learn to reconstruct clean GT masks and, in turn, implicitly guide the remaining queries toward more consistent predictions across layers. The original learnable queries are treated as matching queries, which use predicted masks for masked attention and are assigned to GT masks via bipartite matching for loss computation. In contrast, the MP queries are directly paired with their corresponding GT masks for supervision. This training paradigm significantly improves segmentation accuracy while also accelerating convergence, without introducing any additional computational cost during inference, thereby preserving the efficiency of the base Mask2Former framework.

2.3. Loss Function

Both Mask2Former and MP-Former rely on the same underlying loss formulation. The distinction is that MP-Former computes two loss branches: one tied to the mask-pilot module, where each prediction is directly paired with its corresponding ground-truth (GT) instance, and another from the standard matching stage, where predictions are associated with ground-truth masks via bipartite matching.

Mask2Former follows the point-sampling strategy introduced in PointRend [35] and Implicit PointRend [36]. For the matching stage, it randomly draws a shared set of K points from all predicted and GT masks to build the cost matrix used in bipartite matching. For the final loss computation, it applies importance sampling [35], selecting separate point sets for each predicted-GT pair. Consistent with the official implementation, we use K = 12,544, i.e.,

112 \times 112

points. Restricting loss computation to these sampled points yields a roughly

3 \times

reduction in GPU memory compared to using full-resolution masks.

Auxiliary losses are applied to both the Transformer Decoder layers and the learnable query embeddings. The overall objective,

L_{final}

, combines a mask loss

L_{mask}

and a classification loss

L_{class}

, with predefined weights. The mask loss itself is a weighted sum of the binary cross-entropy term

L_{bce}

and the Dice loss

L_{dice}

[37]. The full formulation of the final loss is shown below:

\begin{matrix} L_{final} & = L_{mask} + λ_{class} L_{class}, \end{matrix}

(2)

\begin{matrix} L_{final} & = λ_{bce} L_{bce} + λ_{dice} L_{dice} + λ_{class} L_{class} . \end{matrix}

(3)

We adopt the same weighting coefficients as the original implementations [4,15], setting

λ_{bce} = 5

and

λ_{dice} = 5

and

λ_{class} = 2

.

The binary cross-entropy component

L_{bce}

is defined as follows:

L_{bce} = - \frac{1}{N} \sum_{i = 1}^{N} (y_{i} \cdot log ({\hat{y}}_{i}) + (1 - y_{i}) \cdot log (1 - {\hat{y}}_{i}))

(4)

where N denotes the number of pixels,

y_{i}

is the GT label for pixel i (0 or 1), and

{\hat{y}}_{i}

is the predicted probability that pixel i belongs to class 1.

The Dice loss

L_{dice}

is expressed as:

L_{dice} = 1 - \frac{2 \times | Y \cap \hat{Y} |}{| Y | + | \hat{Y} |}

(5)

where Y denotes the GT binary mask and

\hat{Y}

is the corresponding predicted mask.

The classification component, implemented as a cross-entropy loss, is defined as:

L_{class} = - \sum_{c = 1}^{C} y_{o, c} log (p_{o, c})

(6)

where C is the total number of classes,

y_{o, c}

is the binary indicator specifying whether class c is the true label for observation o, and

p_{o, c}

is the model’s predicted probability that observation o belongs to class c.

2.4. Fine-Tuning Strategy

We initially fine-tune MP-Former with EVS-XL encoders pre-trained on SA-1B and randomly initialized decoders, but it fails to generalize due to TimberSeg 1.0’s small size. We then initialize decoder weights from Swin-B Mask2Former pre-trained on MS-COCO [14] and fine-tune the full model, as freezing the encoder degraded performance.

2.5. Implementation Details

We build on the Detectron2 library [18] in PyTorch [38], using official implementations of Mask2Former [15], MP-Former [16], and TimberSeg 1.0 [4]. All models are trained for 8000 iterations with a learning rate

1 \times 10^{- 4}

, using AdamW [39], same as used in [4] while for batch size, we used 16 instead of 8. Unless stated otherwise, hyperparameters match those in TimberSeg 1.0. Training was done in Vienna Scientific Cluster VSC-5 with 2 nodes × 2 NVIDIA A40 GPUs (46 GB; NVIDIA Corporation, Santa Clara, CA, USA) and AMD EPYC 7252 CPUs (Advanced Micro Devices, Santa Clara, CA, USA). Inference and FPS measurements were done on a NVIDIA RTX 3080 Ti (12 GB; NVIDIA Corporation, Santa Clara, CA, USA) and Intel i7-12700K (Intel Corporation, Santa Clara, CA, USA) system, with FPS averaged over validation set using batch size 1 and single-scale inference. The models with Swin variant backbones use their corresponding official MS-COCO pre-trained weights.

To prevent overfitting in experiments with EVS-XL, where the upsampling layers are trained from scratch, we use heavy augmentations using the Albumentations [40] library, as given in Listing 1 and shown in Figure 5. For benchmarking on our In-house Annotated and Unannotated datasets, we take the average weight of five folds of the particular model. By averaging the weights, the resulting model is less likely to be influenced by the idiosyncrasies of any single fold. This improved the model’s ability to generalize to new, unseen data. To evaluate performance, we use mean average precision (AP), mean average precision at IoU 0.5 (AP50), mean average recall (AR) and inference frames per second as used in [4].

Listing 1. Albumentations augmentation pipeline.

import albumentations as A

transforms = A. Compose ([ A. Sequential ([

A. SomeOf ([

A. OneOf ([

A. MotionBlur (),

A. MedianBlur (),

A. Blur (),

A. AdvancedBlur (),

A. Defocus (),

A. GaussianBlur (),

A. GlassBlur (),

], p = 1.0),

A. OneOf ([

A. CLAHE (clip_limit=2),

A. RandomBrightnessContrast (),

A. HueSaturationValue (),

A. RandomGamma (),

A. Emboss (),

A. Equalize (),

A. RGBShift (),

A. Sharpen (),

A. RandomToneCurve (),

], p = 1.0),

A. OneOf ([

A. RandomFog (),

A. RandomRain (),

A. RandomShadow (),

A. RandomSunFlare (),

A. Spatter (),

], p = 1.0),

A. OneOf ([

A. GaussNoise (),

A. ISONoise (),

A. ImageCompression (),

A. ChromaticAberration (),

A. ColorJitter (),

A. Downscale (),

A. MultiplicativeNoise (),

], p = 1.0),

], n = 2, p = 1.0),

A. PixelDropout (dropout_prob = 0.2, p = 0.5),

], p = 0.5)

])

3. Results and Discussion

In this section, we present ablation experiments of all models on the TimberSeg 1.0 dataset (Table 2) and evaluate the average weights of 5 runs on our In-house Annotated dataset (Table 3), measuring instance segmentation performance on AP, AP50, and AR along with FPS and model complexity (backbone and total parameter counts). Our ablation studies compare two EVS-XL variants with two types of upsampling layers: Transposed Conv [26] (TC) and Transposed CoordConv [27] (TCC), and three Swin-Transformer scales (B, S, T), using either Mask2Former (†) or MP-Former (⋄) architecture. In our experiments on TimberSeg 1.0, our fold 1 split almost always had the lowest AP, so for this dataset, we only provide visualizations of fold 1 with the fold 1 model weights. For our In-house Annotated and Unannotated datasets, all results are with averaged weights as mentioned in Section 2.5.

3.1. Benchmark on TimberSeg Dataset

Table 2 shows the ablation study of our models on five folds of the TimberSeg 1.0 dataset. The EVS-XL1-TCC model attains the second highest AP of

61.06 \pm 1.42

, AP50 of

84.27 \pm 2.26

, and AR of

69.53 \pm 0.83

, while running at 11.18 FPS with 189.65M backbone (209.63M total) parameters. For EVS-XL0, TC scores slightly better AP than TCC but lower AP50 and AR. In the Swin variants, we see the MP-Former architecture score higher AP and AR than their Mask2Former counterparts. Compared to the Swin variants, the EVS-XL variants achieve better precision-throughput balance, even though they have a higher number of parameters. Interestingly, our Swin-S baselines outperform SwinB-Mask2Former of [4], which we attribute to our bigger batch size.

3.2. Benchmark on Our In-House Annotated Dataset

On our In-house Annotated dataset (Table 3), EVS-XL1-TC leads in AP with 67.06 AP, 86.09 AP50, and 74.16 AR at 11.99 FPS, confirming robust generalization to real-world scenes. We observe that TC performs better than TCC for EVS-XL1 and conversely for EVS-XL0. Notably, the Swin MP-Former variants, despite their top performance on TimberSeg 1.0, experience a relative drop on the In-house data, suggesting that the mask-piloted training strategy may induce a degree of dataset-specific overfitting. Because MP-Former introduces denoising queries directly guided by ground-truth masks during training, it can encourage the decoder to specialize more strongly to the spatial statistics and annotation style of the training dataset, which may reduce robustness when evaluated on scenes with different clutter patterns, viewpoints, or sensor characteristics. In contrast, the EVS-XL MP-Former variants maintain strong performance on the In-house dataset. We hypothesize that this improved robustness is related to the much broader pretraining of EfficientViT-SAM on the SA-1B corpus and to its convolutional and linear-attention inductive biases, which provide a more generic representation and may reduce the tendency of the decoder to over-specialize during mask-piloted training. A direct comparison between EVS backbones with and without mask-piloted training is left for future work. In addition, future work will explore stronger regularization strategies, such as increased data augmentation, query dropout, or reduced reliance on denoising queries, to further mitigate dataset-specific overfitting in mask-piloted training.

3.3. Effect of Key Design Factors

Based on the ablation results in Table 2 and Table 3, three design factors primarily influence model performance: (i) the segmentation framework (Mask2Former vs. MP-Former), (ii) the backbone architecture and pretraining (Swin vs. EfficientViT-SAM, and backbone scale), and (iii) the upsampling strategy used to adapt EfficientViT-SAM features to the pixel decoder. Since these factors were not varied in a full factorial design, the following analysis relies on controlled model pairs and aims to identify consistent empirical patterns rather than establish strict causal effects.

3.3.1. Effect of Segmentation Framework

For all Swin backbones, replacing Mask2Former with MP-Former yields a consistent improvement on TimberSeg 1.0. For Swin-B, AP increases from

60.37

to

61.10

(+0.73); for Swin-S, from

58.10

to

58.73

(+0.63); and for Swin-T, from

56.88

to

57.47

(+0.59). Similar gains are observed in AR. On the In-house dataset, however, MP-Former does not consistently outperform Mask2Former for Swin backbones: for Swin-B, AP decreases from

65.21

to

64.43

, and for Swin-S, from

63.25

to

61.00

. This indicates that mask-piloted training primarily improves in-domain accuracy on TimberSeg but may reduce robustness under dataset shift for Swin-based models due to potential overfitting.

3.3.2. Effect of Upsampling Strategy

The choice between transposed convolution (TC) and transposed CoordConv (TCC) has a negligible and inconsistent effect on TimberSeg performance. For EVS-XL1, AP changes from

61.05

to

61.06

(+0.01), and for EVS-XL0, from

59.31

to

59.28

(

- 0.03

). On the In-house dataset, the relative ordering reverses between EVS-XL1 (TC better than TCC) and EVS-XL0 (TCC better than TC). All differences remain within approximately

\pm 1.3

AP, indicating that the upsampling design has a second-order influence compared to backbone and framework choices.

3.3.3. Effect of Backbone Architecture and Pretraining

The impact of backbone choice differs markedly between in-domain performance and cross-dataset generalization. On TimberSeg 1.0, the best Swin-B MP-Former model (

61.10

AP) and the best EVS-XL1 MP-Former model (

61.06

AP) achieve nearly identical AP, indicating that backbone choice has a limited effect on peak in-domain accuracy when the framework is fixed. In contrast, on the In-house dataset, EVS-XL1-TC improves AP from

64.43

(Swin-B MP-Former) to

67.06

(+2.63), and also outperforms all Swin variants across both frameworks. This suggests that large-scale SA-1B pretraining and the convolutional and linear-attention inductive biases of EfficientViT-SAM primarily improve robustness under dataset shift rather than in-domain accuracy.

3.3.4. Summary of Effects

Across both datasets, the segmentation framework mainly governs in-domain accuracy on TimberSeg (+0.6–0.7 AP for Swin backbones), the upsampling strategy has a negligible and inconsistent effect, and the backbone architecture and pretraining dominate cross-dataset generalization (+2.6 AP on the In-house dataset). Given the coupled nature of the ablation design, these conclusions should be understood as robust trends supported by multiple controlled comparisons rather than as isolated causal attributions.

3.4. Visual Comparison

Figure 6 shows results of instance segmentation on our fold1 validation set of TimberSeg 1.0. Figure 7 shows results on our In-house Unannotated dataset. As we can see in Figure 6, in the third sample, EVS-XL1-TC and EVS-XL0-TCC are able to detect logs partially occluded by foliage and not annotated in the ground truth. In Figure 7, samples 4 to 6, we can see all the models have very good performance at night under very low-light conditions as well. Overall, the models do not exhibit distinct visual differences in segmentation in terms of false positives or false negatives, apart from the EVS-XL models having more refined edge predictions and less bleeding to neighboring overlapping logs. This is expected, as all models are trained with identical supervision on the same dataset and optimized for the same objective. In several examples, false detections and missed instances remain present across all backbones, indicating that the failure modes are shared by the architecture family rather than specific to the proposed encoder. Consequently, the main benefits of the EVS-XL backbones are reflected more clearly in the quantitative metrics (AP and AR) and in improved robustness across the two datasets, rather than in striking visual differences in isolated examples. Further improving the discrimination of hard false positives and resolving extreme instance ambiguities in dense clutter remains an open problem and will likely require additional geometric cues, multi-view information, or explicit reasoning about log topology beyond monocular RGB input.

3.5. Inference Speed

From Table 2 and Table 3, it is evident that the integrated EfficientViT-SAM MP-Former models achieve a substantially better accuracy-throughput trade-off than the Swin-based baselines. The EVS-XL1-TC model, which delivers the highest AP on the In-house Annotated dataset, runs at 11.99 FPS, a

54.51 %

speedup over our Swin-B baseline implementations while achieving comparable AP and AR on the TimberSeg 1.0 dataset. The lighter EVS-XL0-TCC variant further increases throughput to 12.61 FPS, making it almost as fast as Swin-T baselines in spite of having

2.93

times more parameters while still maintaining higher accuracy.

Despite the EVS-XL backbones having significantly more parameters than Swin-B (up to 189.55M backbone parameters vs. 86.88M for Swin-B), their inference speed remains comparable because the dominant runtime cost is not parameter count but the type of operations used in the backbone. Swin Transformer relies on window-based self-attention, whose computational cost grows quadratically with the number of spatial tokens within each window and becomes increasingly expensive at high resolution. In contrast, EfficientViT-SAM relies primarily on convolutional blocks and ReLU-based linear attention, whose complexity scales linearly with the number of tokens. As a result, despite having nearly twice the number of parameters, EVS-XL achieves similar or higher FPS because its operations are more hardware-efficient and memory-friendly at the input resolutions used in our experiments. In practice, all EVS-XL models comfortably fit within the 12 GB memory budget of the RTX 3080 Ti used for benchmarking, indicating that the increased parameter count does not impose a memory bottleneck in the targeted robotic deployment setting. This confirms that parameter count alone is not a reliable proxy for inference speed or practical deployability. Moreover, the MP-Former architecture introduces no additional inference-time overhead compared to Mask2Former, since the mask-piloted queries are used exclusively during training.

Overall, these results demonstrate that the integrated EVS-XL MP-Former models are well suited for near real-time robotic grasping pipelines on commodity GPU hardware, providing a strong balance between accuracy and throughput under our experimental setting.

3.6. Limitations and Future Directions

Although the integrated EfficientViT-SAM MP-Former improves quantitative performance and generalization, the qualitative visual differences between models remain limited, particularly in terms of hard false positives and false negatives in dense log piles. As shown in Figure 6 and Figure 7, all methods exhibit similar failure cases in extreme clutter and heavy occlusion, suggesting that current monocular RGB-based instance segmentation is approaching a performance ceiling for this setting. Addressing these limitations will likely require incorporating additional geometric information such as depth or multi-view cues, explicit modeling of log topology, or temporal consistency across frames. Exploring such directions is an important avenue for future work toward safer and more reliable robotic grasping in highly cluttered forestry environments.

In addition, although the integrated EVS-XL models fit within the memory budget of our target hardware, exploring model compression techniques such as structured pruning or post-training quantization is an important direction for future work to further reduce memory footprint and enable deployment on more resource-constrained robotic platforms. We leave a systematic study of compression and accuracy–efficiency trade-offs to future work, as the focus of this paper is on architectural benchmarking and backbone selection rather than deployment-specific optimization.

Finally, this work restricts its empirical study to the Mask2Former/MP-Former family in order to enable controlled architectural comparisons and reproducible ablations. A broader comparison against recent end-to-end instance segmentation pipelines and foundation-model-based systems, such as YOLO26 [41], RF-DETR [42], DINOv3 [43] or newer generations of SAM-based models (e.g., SAM2 [44]), remains an important direction for future work. Such a study would require a dedicated benchmarking effort that carefully controls for differences in training protocols, data scale, and real-time deployment constraints, and is beyond the scope of the present work, whose focus is on understanding the impact of backbone choice and pretraining within a fixed segmentation framework.

4. Conclusions

In this paper, we proposed the integration of EfficientViT-SAM-XL encoders into MP-Former and were successful in getting both AP and FPS improvements on TimberSeg 1.0. Furthermore, the integrated models achieved very good results on our In-house Annotated and Unannotated logs datasets. Additionally, we presented an ablation study on MP-Former and Mask2Former, highlighting MP-Former’s superior performance at the risk of potential overfitting. While our In-house dataset consists of a log-piles-to-truck-loading scenario, further evaluations on data having more varied views and weather conditions need to be done in future works.

Author Contributions

Conceptualization, S.M., S.A. and F.F.; methodology, S.M.; software, S.M.; validation, S.M.; formal analysis, S.M.; investigation, S.M.; resources, S.A. and F.F.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, S.M., S.A. and F.F.; visualization, S.M.; supervision, S.A. and F.F.; project administration, F.F.; funding acquisition, F.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the FFG-project “FutureWoodTrans”, contract number: 898697. Computation resources were provided by Vienna Scientific Cluster VSC-5. The article charges are supported by the TU Graz Open Access Publishing Fund.

Data Availability Statement

The TimberSeg 1.0 dataset used for training and validation is publicly available as described in Fortin et al. [4]. The code and trained model weights developed in this work will be made publicly available on GitHub: https://github.com/smandal94/logs_segment. The In-house dataset cannot be shared due to binding contractual and GDPR restrictions and is therefore unavailable for third-party access.

Acknowledgments

The authors want to thank all the project partners of the FutureWood project for their contribution, starting from structuring the project and model requirements to real-world model evaluation and capturing test data. Sayan Mandal’s work done as part of his master’s thesis [45] at TU Graz. Stefan Ainetter was a PhD student at TU Graz during the time of contribution to this work.

Conflicts of Interest

Stefan Ainetter is currently employed by Huawei Technologies Austria, and Sayan Mandal is currently employed by Forschungszentrum Jülich GmbH. This research was conducted while the authors were at Graz University of Technology and is unrelated to their current employment. The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

SAM	Segment Anything Model
ViT	Vision Transformer
FPS	Frames Per Second
mAP	Mean Average Precision
MP	Mask Piloted
GT	Ground Truth
EVS-XL0	EfficientViT-SAM-XL0
EVS-XL1	EfficientViT-SAM-XL1
FPN	Feature Pyramid Network
FaPN	Feature-aligned Pyramid Network
BiFPN	Bi-direction Feature Pyramid Network
MSDeformAttn	Multi-Scale Deformable Attention Transformer
IoU	Intersection over Union
AP	Average Precision
AP50	Mean Average Precision at IoU 0.5
AR	Average Recall
TC	Transposed Conv
TCC	Transposed CoordConv

References

Ainetter, S.; Fraundorfer, F. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar] [CrossRef]
Ainetter, S.; Böhm, C.; Dhakate, R.; Weiss, S.; Fraundorfer, F. Depth-aware Object Segmentation and Grasp Detection for Robotic Picking Tasks. In Proceedings of the 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November 2021. [Google Scholar] [CrossRef]
Gietler, H.; Böhm, C.; Ainetter, S.; Schöffmann, C.; Fraundorfer, F.; Weiss, S.; Zangl, H. Forestry crane automation using learning-based visual grasping point prediction. In Proceedings of the 2022 IEEE Sensors Applications Symposium (SAS), Sundsvall, Sweden, 1–3 August 2022; pp. 1–6. [Google Scholar] [CrossRef]
Fortin, J.M.; Gamache, O.; Grondin, V.; Pomerleau, F.; Giguère, P. Instance Segmentation for Autonomous Log Grasping in Forestry Operations. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 6064–6071. [Google Scholar] [CrossRef]
Usui, K. Estimation of log-gripping position using instance segmentation for autonomous log loading. Int. J. For. Eng. 2024, 35, 251–269. [Google Scholar] [CrossRef]
Steininger, D.; Simon, J.; Trondl, A.; Murschitz, M. TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025. [Google Scholar] [CrossRef]
Ayoub, E.; Fernando, H.; Larrivée-Hardy, W.; Lemieux, N.; Giguère, P.; Sharf, I. Log Loading Automation for Timber-Harvesting Industry. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 17920–17926. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
Gao, Y.; Xia, W.; Hu, D.; Wang, W.; Gao, X. DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; pp. 509–519. [Google Scholar] [CrossRef]
Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Noh, S.; Kim, J.; Nam, D.; Back, S.; Kang, R.; Lee, K. GraspSAM: When Segment Anything Model Meets Grasp Detection. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19 May–23 May 2025; pp. 14023–14029. [Google Scholar] [CrossRef]
Xiong, Y.; Varadarajan, B.; Wu, L.; Xiang, X.; Xiao, F.; Zhu, C.; Dai, X.; Wang, D.; Sun, F.; Iandola, F.; et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16111–16121. [Google Scholar] [CrossRef]
Zhang, Z.; Cai, H.; Han, S. Efficientvit-sam: Accelerated segment anything model without performance loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 7859–7863. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. Available online: https://github.com/facebookresearch/Mask2Former (accessed on 5 March 2024).
Zhang, H.; Li, F.; Xu, H.; Huang, S.; Liu, S.; Ni, L.M.; Zhang, L. MP-Former: Mask-Piloted Transformer for Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18074–18083. Available online: https://github.com/IDEA-Research/MP-Former (accessed on 5 March 2024).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 5 March 2024).
Wada, K. labelme: Image Polygonal Annotation with Python. 2018. Available online: https://github.com/wkentaro/labelme (accessed on 5 March 2024).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual, 3–7 May 2021. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17256–17267. Available online: https://github.com/mit-han-lab/efficientvit (accessed on 5 March 2024).
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2815–2823. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285. [Google Scholar]
Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2018; Volume 31, pp. 9628–9639. [Google Scholar]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6392–6401. [Google Scholar] [CrossRef]
Huang, S.; Lu, Z.; Cheng, R.; He, C. Fapn: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 844–853. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2017; Volume 30, pp. 6000–6010. [Google Scholar]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar] [CrossRef]
Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2607–2616. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2019; Volume 32, pp. 8026–8037. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection. arXiv 2025, arXiv:2509.25164. [Google Scholar] [CrossRef]
Robinson, I.; Robicheaux, P.; Popov, M.; Ramanan, D.; Peri, N. RF-DETR: Neural architecture search for real-time detection transformers. arXiv 2025, arXiv:2511.09554. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]
Mandal, S. Leveraging Foundation Models in Instance Segmentation of Wood Logs for Robotic Grasping. Master’s Thesis, Graz University of Technology, Graz, Austria, 2024. [Google Scholar] [CrossRef]

Figure 1. Sample image-annotation pairs from the TimberSeg 1.0 [4] showing the variability of logs positioning and environmental conditions present in it.

Figure 2. Sample images from our In-house Unannotated dataset showing varying angles and lighting present in it. Images in the right-most column are of samples taken from the indoor logs dataset used in [1,2,3].

Figure 3. Our EfficientViT-SAM-XL MP-Former [13,16] model with the upsampling layers for each backbone feature level (the same type of upsampler is used in all levels). The entire model is fine-tuned.

Figure 4. Our proposed integration of the EfficientViT-SAM-XL [13] encoder with the upsampling layers to match Swin-B’s [17] output resolution (the same type of upsampler is used in all levels). Stage 0 is Conv + ResBlock [25], Stages 1–3 have Fused-MBConv [24] and Stages 4–5 have MBConv [23] + EfficientViT module [21].

Figure 5. Few samples of augmented images from TimberSeg 1.0 used during training.

Figure 6. Results on fold-1 validation from the TimberSeg 1.0 [4] dataset. From left to right: (a) Groundtruth (GT), (b) EfficientViT-SAM-XL1-TransposedConv (shortened to EVS-xl1-Conv), (c) EfficientViT-SAM-XL0-TransposedCoordConv (shortened to EVS-xl0-CoordConv), (d) Swin-B. All models use the MP-Former architecture with weights trained on the fold-1 split.

Figure 7. Results from our In-house Unannotated dataset. From left to right: GT, EfficientViT-SAM-XL1-TransposedConv (shortened to EVS-xl1-Conv), EfficientViT-SAM-XL0-TransposedCoordConv (shortened to EVS-xl0-CoordConv), Swin-B. All use the MP-Former architecture and all model weights used are averages of five-folds.

Table 1. Training and validation data distribution of TimberSeg 1.0 [4] across the folds used in our experiments.

Fold	Train Images	Train Instances	Val Images	Val Instances
Fold0	176	1919	44	555
Fold1	176	1970	44	504
Fold2	176	2005	44	469
Fold3	176	2026	44	448
Fold4	176	1976	44	498

Table 2. Performance comparison of all models on TimberSeg 1.0 [4]. † denotes Mask2Former [15] architecture, ⋄ denotes MP-Former [16] architecture, and △ denotes results from [4]. EVS is short for EfficientViT-SAM [13], TC for Transposed Conv [26], and TCC for Transposed CoordConv [27]. Best performances have been marked in bold. AP, AP50, AR report the mean ± std of 5 folds.

Backbone	AP	AP50	AR	FPS	Backbone Params	Total Params
EVS-XL1-TC (SA-1B) ^⋄	61.05_{$\pm 1.88$}	84.09_{$\pm 2.17$}	69.59_{$\pm 1.63$}	11.99	189.55M	209.53M
EVS-XL1-TCC (SA-1B) ^⋄	61.06_{$\pm 1.42$}	84.27_{$\pm 2.26$}	69.53_{$\pm 0.83$}	11.183	189.65M	209.63M
EVS-XL0-TC (SA-1B) ^⋄	59.31_{$\pm 2.19$}	82.68_{$\pm 2.16$}	67.86_{$\pm 2.20$}	13.6	118.91M	138.89M
EVS-XL0-TCC (SA-1B) ^⋄	59.28_{$\pm 2.02$}	83.42_{$\pm 2.25$}	67.89_{$\pm 1.43$}	12.61	119.0M	138.98M
Swin-B (IN22k) ^†,△ [4]	57.53_{$\pm 3.37$}	84.28_{$\pm 2.44$}	65.16_{$\pm 3.40$}	8.47	86.88M	106.86M
Swin-B (IN22k) ^†	60.37_{$\pm 1.58$}	85.42_{$\pm 2.34$}	67.85_{$\pm 1.08$}	7.76	86.88M	106.86M
Swin-B (IN22k) ^⋄	61.1_{$\pm 1.47$}	84.89_{$\pm 2.03$}	69.42_{$\pm 0.69$}	7.76	86.88M	106.86M
Swin-S (IN1k) ^†	58.10_{$\pm 1.88$}	83.33_{$\pm 2.42$}	65.97_{$\pm 1.28$}	10.17	48.84M	68.69M
Swin-S (IN1k) ^⋄	58.73_{$\pm 2.07$}	83.29_{$\pm 2.34$}	67.5_{$\pm 1.64$}	10.17	48.84M	68.69M
Swin-T (IN1k) ^†	56.88_{$\pm 1.89$}	82.75_{$\pm 1.88$}	65.52_{$\pm 1.48$}	12.65	27.52M	47.38M
Swin-T (IN1k) ^⋄	57.47_{$\pm 1.91$}	82.41_{$\pm 2.03$}	67.12_{$\pm 1.52$}	12.65	27.52M	47.38M

Table 3. Performance comparison of models on our In-house Annotated dataset. † means Mask2Former architecture, ⋄ means MP-Former architecture. EVS is short for EfficientViT-SAM, TCC for Transposed CoordConv, TC for Transposed Conv. Best performances have been marked in bold.

Backbone	AP	AP50	AR	FPS
EVS-XL1-TC (SA-1B) ^⋄	67.06	86.09	74.16	11.99
EVS-XL1-TCC (SA-1B) ^⋄	65.81	85.37	74.19	11.183
EVS-XL0-TC (SA-1B) ^⋄	59.22	77.17	74.78	13.6
EVS-XL0-TCC (SA-1B) ^⋄	66.16	85.90	74.22	12.61
Swin-B (IN22k) ^†	65.21	86.26	70.43	7.76
Swin-B (IN22k) ^⋄	64.43	85.40	72.00	7.76
Swin-S (IN1k) ^†	63.25	84.65	70.42	10.17
Swin-S (IN1k) ^⋄	61.00	81.36	71.09	10.17
Swin-T (IN1k) ^†	61.36	83.77	70.69	12.65
Swin-T (IN1k) ^⋄	60.65	83.24	69.82	12.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mandal, S.; Ainetter, S.; Fraundorfer, F. Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics 2026, 15, 44. https://doi.org/10.3390/robotics15020044

AMA Style

Mandal S, Ainetter S, Fraundorfer F. Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics. 2026; 15(2):44. https://doi.org/10.3390/robotics15020044

Chicago/Turabian Style

Mandal, Sayan, Stefan Ainetter, and Friedrich Fraundorfer. 2026. "Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former" Robotics 15, no. 2: 44. https://doi.org/10.3390/robotics15020044

APA Style

Mandal, S., Ainetter, S., & Fraundorfer, F. (2026). Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics, 15(2), 44. https://doi.org/10.3390/robotics15020044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets Used

2.1.1. TimberSeg 1.0 Dataset

2.1.2. In-House Dataset

2.2. Model Architecture Overview

2.2.1. Backbone

2.2.2. Pixel Decoder

2.2.3. Transformer Decoder

2.3. Loss Function

2.4. Fine-Tuning Strategy

2.5. Implementation Details

3. Results and Discussion

3.1. Benchmark on TimberSeg Dataset

3.2. Benchmark on Our In-House Annotated Dataset

3.3. Effect of Key Design Factors

3.3.1. Effect of Segmentation Framework

3.3.2. Effect of Upsampling Strategy

3.3.3. Effect of Backbone Architecture and Pretraining

3.3.4. Summary of Effects

3.4. Visual Comparison

3.5. Inference Speed

3.6. Limitations and Future Directions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI