Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former
Abstract
1. Introduction
- We benchmark Mask2Former and MP-Former with multiple Swin Transformer backbones on the TimberSeg 1.0 dataset for log instance segmentation, analyzing accuracy, recall, and inference speed in the context of autonomous log grasping.
- We integrate the EfficientViT-SAM-XL as an encoder within the MP-Former architecture for end-to-end fine-tuning. To enable compatibility with the MP-Former pixel decoder, we introduce lightweight upsampling layers to align EfficientViT-SAM feature resolutions and study transposed convolution and transposed CoordConv as alternative designs.
- We evaluate model generalization on a real-world In-house test set collected under operational deployment conditions.
2. Materials and Methods
2.1. Datasets Used
2.1.1. TimberSeg 1.0 Dataset
2.1.2. In-House Dataset
2.2. Model Architecture Overview
2.2.1. Backbone
2.2.2. Pixel Decoder
2.2.3. Transformer Decoder
2.3. Loss Function
2.4. Fine-Tuning Strategy
2.5. Implementation Details
| Listing 1. Albumentations augmentation pipeline. |
| import albumentations as A |
| transforms = A. Compose ([ A. Sequential ([ |
| A. SomeOf ([ |
| A. OneOf ([ |
| A. MotionBlur (), |
| A. MedianBlur (), |
| A. Blur (), |
| A. AdvancedBlur (), |
| A. Defocus (), |
| A. GaussianBlur (), |
| A. GlassBlur (), |
| ], p = 1.0), |
| A. OneOf ([ |
| A. CLAHE (clip_limit=2), |
| A. RandomBrightnessContrast (), |
| A. HueSaturationValue (), |
| A. RandomGamma (), |
| A. Emboss (), |
| A. Equalize (), |
| A. RGBShift (), |
| A. Sharpen (), |
| A. RandomToneCurve (), |
| ], p = 1.0), |
| A. OneOf ([ |
| A. RandomFog (), |
| A. RandomRain (), |
| A. RandomShadow (), |
| A. RandomSunFlare (), |
| A. Spatter (), |
| ], p = 1.0), |
| A. OneOf ([ |
| A. GaussNoise (), |
| A. ISONoise (), |
| A. ImageCompression (), |
| A. ChromaticAberration (), |
| A. ColorJitter (), |
| A. Downscale (), |
| A. MultiplicativeNoise (), |
| ], p = 1.0), |
| ], n = 2, p = 1.0), |
| A. PixelDropout (dropout_prob = 0.2, p = 0.5), |
| ], p = 0.5) |
| ]) |
3. Results and Discussion
3.1. Benchmark on TimberSeg Dataset
3.2. Benchmark on Our In-House Annotated Dataset
3.3. Effect of Key Design Factors
3.3.1. Effect of Segmentation Framework
3.3.2. Effect of Upsampling Strategy
3.3.3. Effect of Backbone Architecture and Pretraining
3.3.4. Summary of Effects
3.4. Visual Comparison
3.5. Inference Speed
3.6. Limitations and Future Directions
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| SAM | Segment Anything Model |
| ViT | Vision Transformer |
| FPS | Frames Per Second |
| mAP | Mean Average Precision |
| MP | Mask Piloted |
| GT | Ground Truth |
| EVS-XL0 | EfficientViT-SAM-XL0 |
| EVS-XL1 | EfficientViT-SAM-XL1 |
| FPN | Feature Pyramid Network |
| FaPN | Feature-aligned Pyramid Network |
| BiFPN | Bi-direction Feature Pyramid Network |
| MSDeformAttn | Multi-Scale Deformable Attention Transformer |
| IoU | Intersection over Union |
| AP | Average Precision |
| AP50 | Mean Average Precision at IoU 0.5 |
| AR | Average Recall |
| TC | Transposed Conv |
| TCC | Transposed CoordConv |
References
- Ainetter, S.; Fraundorfer, F. End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021. [Google Scholar] [CrossRef]
- Ainetter, S.; Böhm, C.; Dhakate, R.; Weiss, S.; Fraundorfer, F. Depth-aware Object Segmentation and Grasp Detection for Robotic Picking Tasks. In Proceedings of the 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November 2021. [Google Scholar] [CrossRef]
- Gietler, H.; Böhm, C.; Ainetter, S.; Schöffmann, C.; Fraundorfer, F.; Weiss, S.; Zangl, H. Forestry crane automation using learning-based visual grasping point prediction. In Proceedings of the 2022 IEEE Sensors Applications Symposium (SAS), Sundsvall, Sweden, 1–3 August 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Fortin, J.M.; Gamache, O.; Grondin, V.; Pomerleau, F.; Giguère, P. Instance Segmentation for Autonomous Log Grasping in Forestry Operations. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 6064–6071. [Google Scholar] [CrossRef]
- Usui, K. Estimation of log-gripping position using instance segmentation for autonomous log loading. Int. J. For. Eng. 2024, 35, 251–269. [Google Scholar] [CrossRef]
- Steininger, D.; Simon, J.; Trondl, A.; Murschitz, M. TimberVision: A Multi-Task Dataset and Framework for Log-Component Segmentation and Tracking in Autonomous Forestry Operations. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 28 February–4 March 2025. [Google Scholar] [CrossRef]
- Ayoub, E.; Fernando, H.; Larrivée-Hardy, W.; Lemieux, N.; Giguère, P.; Sharf, I. Log Loading Automation for Timber-Harvesting Industry. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 17920–17926. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
- Gao, Y.; Xia, W.; Hu, D.; Wang, W.; Gao, X. DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; pp. 509–519. [Google Scholar] [CrossRef]
- Ding, L.; Zhu, K.; Peng, D.; Tang, H.; Yang, K.; Bruzzone, L. Adapting Segment Anything Model for Change Detection in VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
- Noh, S.; Kim, J.; Nam, D.; Back, S.; Kang, R.; Lee, K. GraspSAM: When Segment Anything Model Meets Grasp Detection. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19 May–23 May 2025; pp. 14023–14029. [Google Scholar] [CrossRef]
- Xiong, Y.; Varadarajan, B.; Wu, L.; Xiang, X.; Xiao, F.; Zhu, C.; Dai, X.; Wang, D.; Sun, F.; Iandola, F.; et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16111–16121. [Google Scholar] [CrossRef]
- Zhang, Z.; Cai, H.; Han, S. Efficientvit-sam: Accelerated segment anything model without performance loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 7859–7863. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. Available online: https://github.com/facebookresearch/Mask2Former (accessed on 5 March 2024).
- Zhang, H.; Li, F.; Xu, H.; Huang, S.; Liu, S.; Ni, L.M.; Zhang, L. MP-Former: Mask-Piloted Transformer for Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18074–18083. Available online: https://github.com/IDEA-Research/MP-Former (accessed on 5 March 2024).
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
- Wu, Y.; Kirillov, A.; Massa, F.; Lo, W.Y.; Girshick, R. Detectron2. 2019. Available online: https://github.com/facebookresearch/detectron2 (accessed on 5 March 2024).
- Wada, K. labelme: Image Polygonal Annotation with Python. 2018. Available online: https://github.com/wkentaro/labelme (accessed on 5 March 2024).
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual, 3–7 May 2021. [Google Scholar]
- Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 17256–17267. Available online: https://github.com/mit-han-lab/efficientvit (accessed on 5 March 2024).
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 2815–2823. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285. [Google Scholar]
- Liu, R.; Lehman, J.; Molino, P.; Petroski Such, F.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2018; Volume 31, pp. 9628–9639. [Google Scholar]
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
- Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6392–6401. [Google Scholar] [CrossRef]
- Huang, S.; Lu, Z.; Cheng, R.; He, C. Fapn: Feature-aligned pyramid network for dense image prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 844–853. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2017; Volume 30, pp. 6000–6010. [Google Scholar]
- Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar] [CrossRef]
- Cheng, B.; Parkhi, O.; Kirillov, A. Pointly-supervised instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 2607–2616. [Google Scholar] [CrossRef]
- Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS); The MIT Press: Cambridge, MA, USA, 2019; Volume 32, pp. 8026–8037. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
- Sapkota, R.; Cheppally, R.H.; Sharda, A.; Karkee, M. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection. arXiv 2025, arXiv:2509.25164. [Google Scholar] [CrossRef]
- Robinson, I.; Robicheaux, P.; Popov, M.; Ramanan, D.; Peri, N. RF-DETR: Neural architecture search for real-time detection transformers. arXiv 2025, arXiv:2511.09554. [Google Scholar]
- Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. DINOv3. arXiv 2025, arXiv:2508.10104. [Google Scholar] [CrossRef]
- Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]
- Mandal, S. Leveraging Foundation Models in Instance Segmentation of Wood Logs for Robotic Grasping. Master’s Thesis, Graz University of Technology, Graz, Austria, 2024. [Google Scholar] [CrossRef]







| Fold | Train Images | Train Instances | Val Images | Val Instances |
|---|---|---|---|---|
| Fold0 | 176 | 1919 | 44 | 555 |
| Fold1 | 176 | 1970 | 44 | 504 |
| Fold2 | 176 | 2005 | 44 | 469 |
| Fold3 | 176 | 2026 | 44 | 448 |
| Fold4 | 176 | 1976 | 44 | 498 |
| Backbone | AP | AP50 | AR | FPS | Backbone Params | Total Params |
|---|---|---|---|---|---|---|
| EVS-XL1-TC (SA-1B) ⋄ | 61.05 | 84.09 | 69.59 | 11.99 | 189.55M | 209.53M |
| EVS-XL1-TCC (SA-1B) ⋄ | 61.06 | 84.27 | 69.53 | 11.183 | 189.65M | 209.63M |
| EVS-XL0-TC (SA-1B) ⋄ | 59.31 | 82.68 | 67.86 | 13.6 | 118.91M | 138.89M |
| EVS-XL0-TCC (SA-1B) ⋄ | 59.28 | 83.42 | 67.89 | 12.61 | 119.0M | 138.98M |
| Swin-B (IN22k) †,△ [4] | 57.53 | 84.28 | 65.16 | 8.47 | 86.88M | 106.86M |
| Swin-B (IN22k) † | 60.37 | 85.42 | 67.85 | 7.76 | 86.88M | 106.86M |
| Swin-B (IN22k) ⋄ | 61.1 | 84.89 | 69.42 | 7.76 | 86.88M | 106.86M |
| Swin-S (IN1k) † | 58.10 | 83.33 | 65.97 | 10.17 | 48.84M | 68.69M |
| Swin-S (IN1k) ⋄ | 58.73 | 83.29 | 67.5 | 10.17 | 48.84M | 68.69M |
| Swin-T (IN1k) † | 56.88 | 82.75 | 65.52 | 12.65 | 27.52M | 47.38M |
| Swin-T (IN1k) ⋄ | 57.47 | 82.41 | 67.12 | 12.65 | 27.52M | 47.38M |
| Backbone | AP | AP50 | AR | FPS |
|---|---|---|---|---|
| EVS-XL1-TC (SA-1B) ⋄ | 67.06 | 86.09 | 74.16 | 11.99 |
| EVS-XL1-TCC (SA-1B) ⋄ | 65.81 | 85.37 | 74.19 | 11.183 |
| EVS-XL0-TC (SA-1B) ⋄ | 59.22 | 77.17 | 74.78 | 13.6 |
| EVS-XL0-TCC (SA-1B) ⋄ | 66.16 | 85.90 | 74.22 | 12.61 |
| Swin-B (IN22k) † | 65.21 | 86.26 | 70.43 | 7.76 |
| Swin-B (IN22k) ⋄ | 64.43 | 85.40 | 72.00 | 7.76 |
| Swin-S (IN1k) † | 63.25 | 84.65 | 70.42 | 10.17 |
| Swin-S (IN1k) ⋄ | 61.00 | 81.36 | 71.09 | 10.17 |
| Swin-T (IN1k) † | 61.36 | 83.77 | 70.69 | 12.65 |
| Swin-T (IN1k) ⋄ | 60.65 | 83.24 | 69.82 | 12.65 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mandal, S.; Ainetter, S.; Fraundorfer, F. Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics 2026, 15, 44. https://doi.org/10.3390/robotics15020044
Mandal S, Ainetter S, Fraundorfer F. Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics. 2026; 15(2):44. https://doi.org/10.3390/robotics15020044
Chicago/Turabian StyleMandal, Sayan, Stefan Ainetter, and Friedrich Fraundorfer. 2026. "Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former" Robotics 15, no. 2: 44. https://doi.org/10.3390/robotics15020044
APA StyleMandal, S., Ainetter, S., & Fraundorfer, F. (2026). Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former. Robotics, 15(2), 44. https://doi.org/10.3390/robotics15020044

