Next Article in Journal
Optimal Design of Geared Joint for Semi-Active Knee Aid
Previous Article in Journal
A Suspended-Configuration Endoscopic Robotic Platform with Dual-Module Actuation for Enhanced Gastrointestinal Interventions
Previous Article in Special Issue
A Review of Robot-Assisted Needle-Insertion Approaches in Corneal Surgeries
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmarking Robust AI for Microrobot Detection with Ultrasound Imaging

1
School of Information and Physical Sciences, University of Newcastle, Newcastle 2308, Australia
2
School of Engineering, University of Newcastle, Newcastle 2308, Australia
3
Department of Computer Science, College of Science & Art at Mahayil, King Khalid University, Abha 62529, Saudi Arabia
4
Faculty of Computing and Information Technology (FoCIT), Sohar University, Sohar 311, Oman
5
PCIGITI Center, SickKids Hospital, Toronto, ON M5G 1E8, Canada
*
Authors to whom correspondence should be addressed.
Actuators 2026, 15(1), 16; https://doi.org/10.3390/act15010016 (registering DOI)
Submission received: 6 October 2025 / Revised: 5 December 2025 / Accepted: 11 December 2025 / Published: 29 December 2025

Abstract

Microrobots are emerging as transformative tools in minimally invasive medicine, with applications in non-invasive therapy, real-time diagnosis, and targeted drug delivery. Effective use of these systems critically depends on accurate detection and tracking of microrobots within the body. Among commonly used imaging modalities, including MRI, CT, and optical imaging, ultrasound (US) offers an advantageous balance of portability, low cost, non-ionizing safety, and high temporal resolution, making it particularly suitable for real-time microrobot monitoring. This study reviews current detection strategies and presents a comparative evaluation of six advanced AI-based multi-object detectors, including ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, and the latest YOLO variants (v11, v12), being applied to microrobot detection in US imaging. Performance is assessed using standard metrics (AP50–95, precision, recall, F1-score) and robustness to four visual perturbations: blur, brightness variation, occlusion, and speckle noise. Additionally, feature-level sensitivity analyses are conducted to identify the contributions of different visual cues. Computational efficiency is also measured to assess suitability for real-time deployment. Results show that ResNeSt-269 achieved the highest detection accuracy, followed by Res2NeXt-101 and ConvNeXt, while YOLO-based detectors provided superior computational efficiency. These findings offer actionable insights for developing robust and efficient microrobot tracking systems with strong potential in diagnostic and therapeutic healthcare applications.

1. Introduction

Microrobots have emerged as a transformative innovation at the intersection of bio-medical engineering and intelligent control, enabling new possibilities in minimally invasive medicine [1]. Their development has attracted increasing attention from both research and clinical communities, particularly for applications in non-invasive treatment, diagnosis, and targeted therapy. This progress marks an important step toward expanding the role of microrobotics across diverse areas of biomedicine [2]. These tiny robots can navigate through complex biological environments inside the human body. Size-wise, microrobots are broadly classified into three categories: nanoscale (100 nm–1 µm), microscale (1–100 µm), and millimeter scale (1–5 mm). For this reason, they can revolutionize the diagnosis and treatment landscape through targeted drug delivery, localized diagnostics, and microsurgical procedures [3]. Schmidt et al. [4] analyzed the effectiveness of a cell-based microrobot for tumor therapy. These robots can be guided to tumor sites and deliver drugs directly, thereby reducing adverse effects on surrounding healthy tissues, which is a limitation commonly encountered in conventional treatments such as radiation therapy and chemotherapy, as noted by Xie et al. [5]. Furthermore, as described by Lan et al. [6], these microrobots enable therapeutic interventions in challenging-to-reach regions of the body, including the gastrointestinal tract, cerebral vasculature, and deep lung airways, where traditional methods are ineffective or fail, providing a compelling argument for developing microrobot-based technologies for medical applications.
Yan et al. [7] studied microrobot tracking within the human body. The technical functional properties of microrobots help them overcome physical barriers and efficiently access confined anatomical areas. Research by Lee et al. [8] investigated the use of microrobots in diverse biological environments, including within the human body. They showed that the functional properties of microrobots, such as their small size and magnetic actuation, allow these devices to overcome physical barriers and access confined anatomical regions, thereby enabling more efficient targeted interventions. This perspective is echoed by Nauber et al. [9], who critically discuss the flexibility that microrobots bring to therapeutic interventions in reproductive medicine. It is evident that Microrobots are rapidly advancing across multiple medical applications, including reproductive medicine [9], ocular drug delivery [10], targeted therapeutic applications [11], immune evasion [12], toxin detection [13], and brain vasculature [14]. Ongoing research and development in micro-robotic technology are crucial for advancing less invasive treatments and diagnostics, which can lead to better outcomes by delivering the right treatments at the right place and promoting healing.
The above literature highlights how microrobots can benefit medical systems, with the potential to impact patient care and quality of life directly. A critical factor in advancing microrobot applications is the development of state-of-the-art methods for detecting, tracking, and supporting their navigation. Microrobots can be detected and tracked using various imaging modalities, including Magnetic Resonance Imaging (MRI) [14], Computed Tomography (CT) [15], optical imaging, and ultrasound (US) [16]. However, in clinical settings, they face significant challenges in detection and tracking, as noted by Pane et al. [17,18], including limited real-time tracking capabilities, radiation exposure, cost, and non-portability. Furthermore, as highlighted by Wrede et al. and Sawhney et al. [14,19], the microrobot’s small size, dynamic operating environments, and the presence of surrounding tissue artifacts further complicate the detection and tracking of microrobots. Figure 1 highlights therapeutic, diagnostic, monitoring, and clinical benefits of microrobots.

1.1. Aims and Contributions

This work aims to compare the standard state-of-the-art object detection models and establish a comprehensive and reproducible performance benchmark for ultrasound-guided microrobot detection. The novelty of this work lies in presenting an extensive benchmarking study of six object detectors: ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, and two latest nano variants of YOLO (v11, and v12) for microrobot detection using US imaging. The focus of this work is not only on analyzing baseline performance metrics, including AP (75–90), precision, recall, and F1-score, but also on robustness under challenging operating conditions, efficiency, and feature-level contributions. This will also make the system more explainable and open to interpretation. The contributions of this work are as follows:
  • Systematic performance benchmarking of six state-of-the-art multi-object detectors for microrobot detection using US imaging.
  • Evaluation of the robustness of object detectors under realistic US imaging challenges, including blur, brightness variation, occlusion, and speckle noise.
  • Feature-level sensitivity analysis that quantifies the contributions of visual cues (edges, sharpness, textures, and regions) to detector performance.
  • Understand the detector’s efficiency and deployment feasibility by examining computational efficiency using metrics such as latency, frames per second (FPS), parameter count, and giga floating-point operations per second (GFLOPs).

1.2. Structure of the Paper

This paper consists of six sections that provide a clear and focused exploration of deep learning-based detectors for microrobot detection using the US. In Section 2, a literature review is conducted related to the role of US imaging in micro robotics and US-based object detection. Section 3 explains the methodology, including the setup of experiments and datasets, as well as a benchmarking pipeline for microrobot detection. Section 4 discusses the results. In Section 5, limitations of the study are discussed. Finally, the paper concludes in Section 6.

2. Literature Review

Real-time detection of microrobots remains challenging in various medical procedures, including targeted drug delivery, minimally invasive surgery, and localized diagnostics [20,21,22]. Additionally, microrobot tracking depends on the performance of microrobot detectors. US imaging is used for this purpose to provide visual feedback that guides medical procedures [23]. The major drawback of the US is its limited spatial resolution and signal-to-noise ratio, which are significant challenges in microrobotic scenarios [20,24]. On the other hand, today, AI is transforming virtually every domain, including healthcare [25], and medicine [26], physics [27], chemistry [28], education [29,30], robotics [31,32], astronomy [33], and beyond from different application perspectives. Similarly, AI can be leveraged to address the challenges posed by US imaging in detecting microrobots.

2.1. Why Ultrasound Imaging?

Various medical imaging modalities, such as Magnetic Resonance Imaging (MRI), Computed Tomography (CT) scans, optical imaging, and US, can be beneficial for designing machine-learning- and deep-learning-powered microrobot detection and tracking methods [34,35,36]. MRI imaging can be used to visualize soft tissues and vascular structures. Tong et al. [37] discuss the importance of MRI in breast tumor identification. MRI uses a paramagnetic contrast agent that enhances tumor visualization, providing valuable characteristics that inform surgical decisions and treatment plans. Further, CT scans offer rapid imaging and high spatial resolution, making them a strong choice in trauma settings and for evaluating complex anatomical structures. The utility of CT in assessing scoliosis is excellent, as it provides a detailed view of vertebral rotation, which is not possible with other 2D imaging methods [38]. However, MRI and CT scans have limitations that make them unsuitable for detecting and tracking microrobots, as shown in Figure 2. From a technical perspective, MRI systems are costly to deploy and maintain, which limits their accessibility [39,40]. While CT scan systems can have adverse effects on human cells, as demonstrated in several research studies [41,42]. On the contrary, US imaging is a low-cost technology with minimal side effects, making it an optimal candidate for detecting and tracking microrobots.
Numerous studies [23,43,44,45] have demonstrated that US images are well suited for real-time monitoring of microrobots and are ideal for point-of-care use compared with MRI and CT. It is because the US is non-ionizing, cost-effective, portable, and compact, providing high temporal resolution. Figure 2 presents a detailed comparative analysis of MRI, CT, and US imaging. Further, building on this knowledge, Cheung et al. [38] stated that contrast-enhanced US imaging considerably increases diagnostic efficacy. Additionally, Figure 2 illustrates the reasons why the US is the optimal choice for microrobot detection and tracking. However, US imaging contains noise, has low contrast, and has limited resolution, which makes the detection and tracking of microrobots challenging. With advancements in AI and the emergence of deep learning as a powerful tool in computer vision tasks, some improvements have been achieved in the detection and tracking of microrobots using US.

2.2. Microrobot Detection

Microrobots in medical science are tiny robotic devices used inside the human body to aid in diagnosis, treatment, minimally invasive surgery, and tissue repair. Detection of these small robots is essential for real-time monitoring and precise movement control, enabling the microrobot to perform as planned on the intended target within the body. This also guarantees the safety, accuracy, and effectiveness of medical procedures [46]. As depicted in Figure 2, US images are preferred over MRI and CT scan images because they are more efficient in terms of size, cost, scan speed, and portability [47]. In the last decade, machine learning and, particularly, deep learning methods have made significant inroads in the development of microrobot detection methods based on US imaging [48].
Nowadays, Convolutional Neural Networks (CNNs)-based methods are widely used for microrobot detection in ultrasound imaging for medical applications [21,49]. Botros et al. [21] propose a deep learning-based method for detecting and tracking chain-like magnetic microrobots in real-time using ultrasound imaging, achieving up to 95% detection and 93% tracking accuracy in dynamic in vitro environments. Further, Sadak [49] introduces an explainable CNN model, the Attention-Fused Bottleneck Module (AFBM), to improve microrobot classification and localization using US imaging. It outperforms YOLO-based models, achieving a maximum mAP of 0.909 at an IoU of 0.95. Similarly, Li et al. [22] proposed a new variant of YOLOv5, which produces over 91% prediction accuracy for microrobots of sizes 1–3 mm. The proposed method is effective for both 2D and 3D US imaging in detection and tracking for in vivo applications.
Reinforcement learning (RL) [50], along with vision-based transformers (ViTs) [51,52], is also emerging as a strong contender for microrobot detection methods. However, most researchers currently focus on diagnosis-based detection using US imaging rather than microrobot detection. Medany et al. [53] introduce a model-based Rapproach for controlling and detecting microrobots using US imaging. The paper has a particular focus on utilizing image feedback to support real-time and complex environments. The advantage of the method is that it can operate with limited training data and achieves up to 90% accuracy across various scenarios. Liu et al. [51] presented a method that combines CNNs with a transformer for detecting robotic capsules in real time using 3D US. The advantage of working with transformers is that they can capture long-range context, thereby improving detection accuracy. The system achieved a detection accuracy of over 90% and a localization error of 1.5 mm. Microrobots face challenges in manual navigation due to complex dynamics and multiple motion parameters. Schrage et al. [54] use reinforcement learning to enable autonomous navigation by training microrobots on over 100,000 images for accurate detection and control. The method combines primary and secondary acoustic radiation forces to guide microswarms along desired paths with robustness and adaptability. The gastrointestinal tract is complex due to its intricate internal anatomy, making detection challenging, where US imaging plays a critical role. In conventional tracking of capsule location and operational state, a substantial challenge is added for clinical applications. To address this issue, Liu et al. [55] introduced more intelligent robotic behaviors by proposing a deep learning-based approach that combines hierarchical attention with ultrasound imaging for both state classification and 2D in-plane pose estimation of capsule robots. It achieved 97% accuracy in mechanism state detection and a mean centroid/orientation error of 0.24 mm.

3. Methodology

This work lies in establishing a robust and clinically relevant benchmarking framework for detecting microrobots of different shapes using US imaging. The focus is not just on conventional accuracy reporting, but also on moving beyond it and assessing detectors (models) under realistic biomedical conditions. As stated in Figure 3, we selected six of the most recent state-of-the-art object detector which includes ConvNeXt [56], Res2NeXt-101 [57], ResNeSt-269 [58], U-Net [59], and two latest variant of YOLO (v11 [60], v12 [61]), which have demonstrated superior performance across various domains and are chosen for their accuracy, speed, and efficiency. Thus, making them well-suited for real-time medical applications. Further, we selected the USMicroMagSet [20] dataset for its dedicated and diverse ultrasound microrobot images. To ensure model reliability, reproducibility, and statistically sound performance, all detectors are trained using the same hyperparameters and on 5-fold cross-validation. The analysis methodology covers accuracy, efficiency, and feature-level analysis. This multi-tiered design guarantees not only scientific robustness but also practicability for real-world biomedical applications.

3.1. Experimental Setup

All experiments are conducted using RunPod.io, a commercial GPU cloud service platform. We employed the NVIDIA B200 GPU, featuring the state-of-the-art Blackwell architecture, for the more computationally intensive multi-microrobot detection training and testing at an image size of 1920 × 1080, which requires a robust processing infrastructure. The B200 GPU node has 28 vCPUs from an AMD EPYC 9555 (64-core) processor and one B200 GPU with 180 GB vRAM. To store the dataset, 200 GB of network storage has been used. The B200 GPU delivers ~2250 FP16 TFLOPs (Dense) and up to ~4500 FP16 TFLOPs (Sparse), facilitating extreme throughput for next-generation AI workloads. The implementation uses the Ultralytics and Torchvision libraries in a Python environment built on PyTorch. We have standardized the core hyperparameters across all six models as stated in Table 1. Furthermore, Figure 4 provides a detailed explanation of the B200 GPU computational infrastructure used in this work.

3.2. Dataset

In this work, we utilized USMicroMagSet [20], a comprehensive, publicly available B-mode US dataset for microrobots available on IEEE DataPort. It consists of over 40,000 annotated US frames of size 1920 × 1080 for eight different shapes of magnetic microrobots and two locomotion modes. However, the total number of usable images is 32,511. Table 1 contains all microrobot class distributions used during training and testing. It should be noted that we have not divided the dataset into training and testing sets; instead, this work used the same division provided in the USMicroMagSet dataset folder structure, where training and testing data are located in the (train + val) and test folders, respectively. It helps ensure the work is universally applicable and supports future benchmarking efforts that can be performed using the same training and testing data.
Furthermore, Figure 5 visualizes samples of unlabeled and labeled US images from USMicroMagSet. This dataset is commonly used for benchmarking detection and tracking algorithms [49], particularly in machine learning and deep learning. Other works have curated datasets from closed-loop control experiments, generally not openly available. Nauber et al. [62] used a custom US dataset to fine-tune deep learning models for microrobot detection and tracking. The lack of large, diverse, high-quality, and annotated datasets in the US has been a considerable barrier. Other than USMicroMagSet, few datasets address the gap between the need and availability of US datasets to fuel research in this field. Specifically, we observe a lack of deep learning-based methods for detection and tracking. The current issue is due to a lack of a dataset; training these powerful AI algorithms is not possible. Furthermore, ongoing works [49,61,62] focus on enhancing spatial resolution, minimizing artifacts, and improving US quality in complex, dynamic environments, such as the human body.

3.3. Benchmarking Pipeline

A standardized benchmarking pipeline, as depicted in Figure 6, has been developed for experimentation with six state-of-the-art detectors that have demonstrated effectiveness across multiple domains. These detectors can play a notable role in addressing the challenge of microrobot detection using ultrasound imaging. The most recent and widely used object detectors are selected, including ConvNeXt [56], Res2NeXt-101 [57], ResNeSt-269 [58], U-Net [59], and the twoYOLO variants (v11 [60] and 12 [61]). To have a robust, reliable, and reproducible final model, five-fold cross-validation (5-Fold CV) is used. Table 2 presents comparative details of the used detectors and their relative superiority from a specific perspective.
The training pipeline, as depicted in Figure 6, uses the 5-Fold CV; the training dataset is randomly partitioned for each fold for training (80%) and validation (20%) splits. For all models, we used 30 epochs. As the dataset is large and the images are high resolution (1920 × 1080), we need to design models that not only achieve high accuracy but are also computationally efficient. Therefore, we used mixed-precision (FP16) to leverage Tensor Core acceleration fully. The total number of workers used is 16, and a batch size of 16 is used to accommodate high-resolution input. A larger batch size would require more memory, potentially overwhelming even the most advanced GPUs. For reproducibility, a fixed random seed (42) is used. During 5-Fold CV training (patience = 5) and during final model refitting, disabled validation and early stopping (patience = 0) were used to consolidate the five best-performing models into a single final model trained on the entire dataset (training and validation). All trained weights and performance data are saved. The testing pipeline is designed to analyze the detectors from multiple perspectives: (1) accuracy, (2) visual perturbation, (3) feature contribution, and (4) computational efficiency.

4. Results and Discussion

Benchmarking has been conducted for multi-microrobot detection. The six state-of-the-art detectors are evaluated based on four criteria we set for this work.
First, the detector is evaluated using classical detection benchmark metrics, including Average Precision (AP 50–95), precision, recall, and F1 Scores. Secondly, each detector is evaluated under challenging conditions to test its robustness, including blurriness, brightness, occlusion, and noise. Thirdly, feature-level contributions are also computed for each detector. Lastly, computation resource requirements are also measured, which include latency (ms/image), frame per second (FPS), parameters in millions, and model file size in MB, revealing the classic speed/accuracy/size trade-off in deep learning.

4.1. Classical Detection Metrics

Overall and Per-class performance analysis, as presented in Table 3 and Table 4, reveals an apparent dichotomy in the difficulties encountered during detection from the dataset. As stated in Table 3, ResNeSt-269 achieved the highest mAP across IoU thresholds (0.50–0.95) of 0.878. Whereas ConvNeXt and Res2NeXt-101 performed similarly to ResNeSt-269, with mAPs of 0.8667 and 0.8664, respectively. U-Net achieved moderate mAP (50–95) of 0.731. However, YOLO11 and YOLO12 trailed the others and achieved the lowest mAPs of 0.6178 and 0.6131, respectively. Furthermore, U-Net achieved the lowest precision of 0.7309196, whereas YOLO11 and YOLO12 achieved the lowest recall of 0.7690 and 0.8403, respectively. Utilizing the rigorous COCO standard, Mean Average Precision (mAP 50–95), and the F1-Score as primary metrics, our analysis demonstrates that an extensive evaluation of six detectors, as stated in Table 4, revealed that ResNeSt-269, Res2NeXt-101, and ConvNeXt show a high degree of proficiency in detecting eight different shapes of the microrobots using US imaging. As shown in Table 4, both YOLO11 and YOLO12 failed to detect rolling-cube-shaped microrobots, which would affect their overall performance in Table 3. For US microrobot detection, where accuracy is critical due to the nature of its applications, ResNeSt269 is the strongest detector, followed by Res2NeXt-101 and ConvNeXt. The YOLO11 and YOLO12 accuracies are the lowest among the six detectors.
Table 4 quantifies the robustness and stability of each detector. In Table 4, we report 95% confidence intervals (CIs) computed across the 5-fold CV results. Confidence intervals are used to show the variability of detection accuracy under different train and validation splits. However, it highlights the uncertainty associated with the CV process. Still, it does not prove the uncertainty of the final test evaluation, as the CV process splits the training data into 80% and 20% for training and validation, respectively. As a result, it is expected that the testing results may fall slightly outside the lower and upper confidence intervals.
Generally, for stronger architectures such as ResNeSt-269, Res2NeXt-101, and ConvNeXt, mAP may fall slightly outside the CV confidence interval, which is expected behavior, as the final detector is trained on 100% of the training data (as stated in Table 4). It tends to improve performance in stronger architectures. However, detectors that are architecturally weaker for the microrobot detection problem, such as YOLO11 and YOLO12, do not improve their performance even when trained on complete training data. Further, Section 4.4 can be referred to for understanding the weaker and stronger architectures of the detectors, particularly for the microrobot detection problem. Thus, the stated results in Table 4, related to confidence intervals, implicitly reflect the robustness of the detectors and complement mAP (50–95) as given in Table 4.

4.2. Perturbations

The goal of introducing perturbation is to conduct meaningful stress tests of ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, YOLO11, and YOLO12 detectors using perturbation analysis. Four challenging conditions are introduced in the actual US images: blur, brightness, occlusion, and speckle noise.
A Gaussian blur with a kernel size of (21, 21) with kernel (k = 10) is introduced. Where OpenCV automatically computed σ ≈ 3.5, according to the single Gaussian convolution is applied. It is a noticeable and significant blur that can simulate real-world microrobot motion, physiological motion, and a poor acoustic window. The blurred image can be defined as
I b x , y , c = ( I G σ ) ( x , y , c )
where * is a 2d convolution, G σ is a Gaussian kernel, x and y are pixel coordinates, and finally, color channels are denoted by c ϵ { R , G , B }
Whereas brightness and contrast are adjusted using an alpha (α) value of 1.8 (contrast control) and a beta (β) value of −40 (brightness control), which simulate the effects of operator settings, compensation (TGC), and signal attenuation. We have applied an affine transformation to each given as
I b c x , y , c = c l i p ( α I x , y , c + β )
For an occlusion, five black boxes are added to each image. Further, the dimensions of these boxes are relative to the image size, with each box’s height and width set to 15% of the original dimensions. This helps block parts of the ultrasound view and simulate anatomical blockages, including anatomical structures and imaging artifacts. The intensity of the pixel after occlusion (region O) can be given as
I o c x , y , c = 0 ,         x , y   ϵ   0 x , y , c ,         o t h e r w i s e
Under blur perturbations, as depicted in Figure 7, ConvNeXt and Res2NeXt-101 exhibit strong resilience, with AP drops of only 12–14%. Whereas ResNeSt-269, YOLO12, and YOLO11 performed moderately with (18–30%) AP drop. However, the U-Net completely collapsed, resulting in a nearly 79% drop in AP. Figure 7 depicts the same trend seen in brightness perturbations, which appeared to be the most challenging and severely degraded AP. ConvNeXt and Res2NeXt-101 remain relatively stable with a (15–28%) AP drop. Whereas ResNeSt-269, YOLO12, and YOLO11 detectors suffer a severe AP drop of (32–71%), and the U-Net totally collapsed with an 84.85% AP drop. Figure 7 illustrates the overall moderate impact of occlusion, excluding the U-Net. In summary, for all three blur, brightness, and occlusion, ConvNeXt, Res2NeXt-101, and ResNeSt-269 are the most robust, while YOLO detectors performed moderately, and U-Net shows significant vulnerability.
Speckle noise perturbation is arguably the most crucial robustness test for microrobot applications using US imaging. Speckle noise generally results from the interference of backscattered echoes, which cannot be or are not artificially produced [63]. Speckle Noise with variances of 0.02 to 0.1 is added to simulate the US texture signature. This type of noise is not simple sensor noise and represents a highly intricate interference pattern of real-world sound waves scattered by microscopic structures within tissue. US environments with speckle noise add a substantial and realistic challenge for microrobot detectors. This is because it may arise from faulty equipment, an unskilled operator, or suboptimal acquisition settings. Therefore, this work systematically examines the performance robustness of ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, YOLO11, and YOLO12 object detectors against speckle noise, a phenomenon commonly observed in US images.
We evaluated each detector on five levels of speckle noise perturbations with noise variances ( σ s 2 = 0.02, 0.04, 0.06, 0.08, 0.1). Each pixel is perturbed by an additive noise term n x , y , c N ( 0 , σ s 2 ) . Thus, the resulting intensity is given as
I s p x , y , c = c l i p ( I x , y , c [ 1 + n x , y , c ] )
Speckle noise analysis shows striking differences in model resilience as noise variance increases, as depicted in Figure 8. YOLO12 shows minor AP drops, with a range between (1.1 to 1.7), whereas YOLO11 AP drop range is (1.1 to 4.0), slightly higher than its successor. After YOLO’s based detector, ConvNeXt is also resilient to speckle noise with AP drop range (5.3 to 11.2). In contrast, ResNeSt-269 and U-Net failed under speckle noise, with APs dropping above 50% once the variance reached 0.08. highlighting their strong noise tolerance. In summary, YOLO-based detectors maintain reliable performance under challenging conditions with ultrasound noise. These findings provide substantial proof that noise brings challenging and sometimes catastrophic instability during microrobot detection.

4.3. Feature Contributions

To quantify feature-level contributions, we used a computationally lighter methodology. The idea here is to identify which visual factors each detector relies on during detection. A robustness-based feature contribution analysis on the mAP drops, which is observed under different perturbations. We mapped each perturbation to an abstract feature, such as Blur to Edges, Brightness to Illumination, Occlusion to Spatial, and Speckle Noise to Texture. The feature level contribution is given as
W f = 100 · m a x ( 0 , d f ) f m a x ( 0 , d f )
where the mAP drop is denoted by d f , and f is the element of edges, texture, illumination, and spatial. The most significant features reported for each detector are the features with the largest W f .
We do not rely on pixel-level Grad-CAM as it provides local saliency and is less reliable. Further, they do not offer model-level feature dependence, and from an operational perspective, they are computationally too heavy for detector architectures. Instead, we used robustness perturbations to induce model-level feature dependence, often resulting in a more stable and more informative model.
Feature-level analysis reveals that different detector architectures focus on distinct features, a virtue of their architectures, as depicted in Figure 9. ConvNeXt exhibits a balanced profile across all four analyzed features, with spatial features (31%) slightly predominating. Whereas Res2NeXt-101 shows a different trend in feature contributions. It places a strong emphasis on illumination (35%) and texture (25%), making it more sensitive to light and effective under high-contrast conditions. However, this also means that it is potentially vulnerable to changes in illumination. ResNeSt-269 relies most heavily on the texture (43%) feature, suggesting a specialization in fine-grained surface patterns, which may be relevant to noise-rich US imaging. Furthermore, U-Net focuses on illumination (35%) and edges (32%), which is understandable given its segmentation-supportive architecture, which benefits from boundary detection and intensity gradients. Lastly, as depicted in Figure 9, YOLO11 and YOLO12 exhibit an overwhelming reliance on illumination, accounting for around 56%, and neglect texture, indicating that, despite architectural advances, they still rely heavily on illumination.

4.4. Explainability

Explainability fosters trust and transparency. It establishes accountability, debugging, and fairness. Particularly in high-stakes domains, it helps in understanding the safety profile. For example: healthcare and security. One major issue with AI today is that it is challenging to explain the decision-making process by which it arrives at its conclusions, as deep learning algorithms are inherently black boxes. Now, considerable research and development efforts have been going into development models whose decisions can be explained. In this work, we contribute to the effort to develop explainability for microrobot detection using ultrasound imaging. We applied our understanding of algorithmic design principles and the knowledge gained in Section 4.3, “Feature Contributions”. Based on integrated analysis of the AP (50–95) across all six detectors, the three worst-performing object classes are the rolling cube, the sheet robot, and sphere1, with AP (50–95) values of 0.598, 0.676, and 0.768, respectively. U-Net performed worst on Sphere1, YOLO12 performed worst on SheetRobot, and YOLO11 performed worst on RollingCube. An integrated analysis of model feature sensitivity and model architecture largely explains why Sphere1, Sheetrobot, and RollingCube emerged as the three worst-performing classes for specific detectors.
As stated above, the U-Net achieved the lowest performance on sphere1, due to weak boundaries, low contrast, and ultrasonic streaking artifacts. The U-Net’s failure on the sphere1 class stems from its architectural design. It is an encoder–decoder segmentation model with skip connections, where it depends significantly on edges (32%) and illumination (35%). However, its dependence on texture is only 13%. The Sphere1 microrobot class provides almost none of the boundary cues that the U-Net’s architectural design can exploit. This result is in poor detection performance. For YOLO11 and YOLO12, the worst results are observed for the rollingcube and sheetrobot microrobot classes. This includes a substantial performance deterioration for YOLO11 on rollingcube. Feature-level analysis provides strong evidence that YOLO models overwhelmingly prioritize illumination, with an approximate 56% focus. Whereas, it almost entirely ignores texture (<3%). Whereas, if we look at some of the properties of rollingcube and sheetrobot, they appear as dark, low-intensity, texture-driven microrobots in US imagining and possess significantly less brightness contrast. Therefore, YOLO becomes nearly feature-insensitive for rollingcube and sheetrobot classes. This also explains why ResNeSt-269 achieved top performance, as it uses split-attention blocks that yield the highest texture reliance of approximately 43%. This helps it capture fine-grained intensity fluctuations, which aid in identifying challenging microrobot shapes. This explains a texture-concentrated detector. ConvNeXt, with its transformer-inspired CNN architecture with hierarchical feature extraction, focuses on all features in a generalized manner. Res2NeXt-101 employs a split–transform–merge strategy, enabling it to focus on illumination and texture.
In summary, based on Table 2, Table 4, and Figure 9, the worst-performing classes are not random outliers but rather reflect a systematic mismatch between each model’s architectural design and the features of specific US microrobot classes.

4.5. Computational Efficiency

Accuracy is considered the most essential performance metric for evaluating detectors. However, computational efficiency plays an equally critical role in determining its real-world implementation. Particularly for microrobot detection applications, where detection is performed in real time. Latency, FPS, number of model parameters, and model size play a critical role in any detector’s success in real-time deployment beyond laboratory settings. As depicted in Figure 10a, YOLO11 is the fastest, with a latency of 6.1 and a FPS of 163. Res2NeXt-101 is the second-fastest, with a latency of 6.7 ms and a frame rate of 148 Hz, but it is a very large model, as shown in Figure 10b.
In contrast, YOLO12 ranked third in terms of latency and FPS, as shown in Figure 10a. ConvNeXt detectors are heavy, yet moderately efficient. Furthermore, U-Net and ResNeSt-269 are too heavy and slow, rendering them inefficient for deployment. In summary, YOLO11 is the best candidate for real-world deployment on limited hardware with the best compact efficiency. Whereas U-Net, and Res2NeXt-101 will have a high compute budget.

5. Limitations

The dataset scope and diversity are the limitations. The dataset is extensive, but its diversity may be questioned. Additionally, the dataset has class imbalance, as shown in Table 1. For example: sphere1 = 5900 images vs. rollingcube = 1255 images. The class-wise results in Table 5 show that detection accuracy does not scale proportionally with the number of training instances, whether small or large. An important point to note is that even minority classes, such as rollingcube, still contain sufficient training images to train advanced object detectors to produce accurate detections. Table 5 provides evidence that heavier models, such as ConvNeXt and ResNeSt-269, achieve excellent accuracy on these smaller classes. However, for the majority classes, such as sphere1, they do not achieve higher accuracy. These performance traits highlight that the model architecture plays a more decisive role than raw sample count, particularly for microrobot detection in US imaging. The class imbalance in the dataset does not appear to affect model behavior. Despite this, it remains a limitation because YOLO based lighter models may be affected.
Patient anatomy varies from case to case, depending on probe settings or operator preference. However, we assume the findings are generalizable. Similarly, perturbation conditions on which detectors are tested are limited. In the real world, perturbation values for each type may vary across cases and operating environments. Furthermore, we used a powerful B200 GPU node; however, the availability of such computational resources may not be feasible in clinical practice, which could affect results on resource-constrained or embedded devices. Despite these limitations, this study can be used to develop technologies that improve diagnosis and treatment effectiveness.

6. Conclusions

Microrobot technology and its applications are in their infancy. However, they can transform the healthcare sector, leading to improved diagnosis, treatments, and overall quality of life. This study systematically benchmarked six state-of-the-art object detectors, including ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, YOLO11, and YOLO12, for microrobot detection using US imaging. The detectors are evaluated not only using accuracy metrics but also through four different visual perturbations, feature-level contributions, and computational efficiency. ResNeSt-269 achieves the highest accuracy. However, YOLO-based detectors offer superior computational efficiency, proving the trade-off between precision and real-time feasibility. The findings emphasize the challenges in selecting a detector that balances detection quality with real-time and real-world implementation constraints, particularly in clinical applications.
Future research will concentrate on the development of devices that can seamlessly embed advanced detectors.

Author Contributions

For Conceptualization, A.A., C.H., F.A. and S.L.; methodology, A.A.; software, A.A.; validation, C.H., S.L. and F.A.; formal analysis, A.A.; investigation, A.A.; resources, A.A.; writing—original draft preparation, A.A.; writing—review and editing, C.H., F.A., L.C., M.R. and S.L.; visualization, A.A., M.R. and L.C.; supervision, C.H., S.L. and F.A.; project administration, C.H. and S.L.; funding acquisition, S.L., C.H. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available on IEEE DataPort and is properly cited in the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, Z.; Li, C.; Dong, L.; Zhao, J. A Review of Microrobot’s System: Towards System Integration for Autonomous Actuation In Vivo. Micromachines 2021, 12, 1249. [Google Scholar] [CrossRef] [PubMed]
  2. Sun, T.; Chen, J.; Zhang, J.; Zhao, Z.; Zhao, Y.; Sun, J.; Chang, H. Application of Micro/Nanorobot in Medicine. Front. Bioeng. Biotechnol. 2024, 12, 1347312. [Google Scholar] [CrossRef]
  3. Iacovacci, V.; Diller, E.; Ahmed, D.; Menciassi, A. Medical Microrobots. Annu. Rev. Biomed. Eng. 2024, 26, 561–591. [Google Scholar] [CrossRef] [PubMed]
  4. Schmidt, C.K.; Medina-Sánchez, M.; Edmondson, R.J.; Schmidt, O.G. Engineering Microrobots for Targeted Cancer Therapies from a Medical Perspective. Nat. Commun. 2020, 11, 5618. [Google Scholar] [CrossRef]
  5. Xie, L.; Liu, J.; Yang, Z.; Chen, H.; Wang, Y.; Du, X.; Fu, Y.; Song, P.; Yu, J. Microrobotic Swarms for Cancer Therapy. Research 2025, 8, 0686. [Google Scholar] [CrossRef]
  6. Lan, X.; Du, Y.; Liu, F.; Li, G. Development of Microrobot with Optical Magnetic Dual Control for Regulation of Gut Microbiota. Micromachines 2023, 14, 2252. [Google Scholar] [CrossRef]
  7. Yan, Y.; Jing, W.; Mehrmohammadi, M. Photoacoustic Imaging to Track Magnetic-Manipulated Micro-Robots in Deep Tissue. Sensors 2020, 20, 2816. [Google Scholar] [CrossRef]
  8. Lee, J.G.; Raj, R.R.; Day, N.B.; Shields, C.W. Microrobots for Biomedicine: Unsolved Challenges and Opportunities for Translation. ACS Nano 2023, 17, 14196–14204. [Google Scholar] [CrossRef]
  9. Nauber, R.; Goudu, S.R.; Goeckenjan, M.; Bornhäuser, M.; Ribeiro, C.; Medina-Sánchez, M. Medical Microrobots in Reproductive Medicine from the Bench to the Clinic. Nat. Commun. 2023, 14, 728. [Google Scholar] [CrossRef] [PubMed]
  10. Kakde, I.; Dhondge, P.; Bansode, D. Microrobotics a New Approach in Ocular Drug Delivery System: A Review. Prepr. Res. Sq. 2022. [Google Scholar] [CrossRef]
  11. Hu, N.; Ding, L.; Liu, Y.; Wang, K.; Zhang, B.; Yin, R.; Zhou, W.; Bi, Z.; Zhang, W. Development of 3D-Printed Magnetic Micro-Nanorobots for Targeted Therapeutics: The State of Art. Adv. NanoBiomed Res. 2023, 3, 2300018. [Google Scholar] [CrossRef]
  12. Cabanach, P.; Pena-Francesch, A.; Sheehan, D.; Bozuyuk, U.; Yasa, O.; Borros, S.; Sitti, M. Zwitterionic 3D-Printed Non-Immunogenic Stealth Microrobots. Adv. Mater. 2020, 32, 2003013. [Google Scholar] [CrossRef]
  13. Yang, L.; Zhang, Y.; Wang, Q.; Zhang, L. An Automated Microrobotic Platform for Rapid Detection of C. Diff Toxins. IEEE Trans. Biomed. Eng. 2020, 67, 1517–1527. [Google Scholar] [CrossRef] [PubMed]
  14. Wrede, P.; Degtyaruk, O.; Kalva, S.K.; Deán-Ben, X.L.; Bozuyuk, U.; Aghakhani, A.; Akolpoglu, B.; Sitti, M.; Razansky, D. Real-Time 3D Optoacoustic Tracking of Cell-Sized Magnetic Microrobots Circulating in the Mouse Brain Vasculature. Sci. Adv. 2022, 8, eabm9132. [Google Scholar] [CrossRef]
  15. Xing, J.; Yin, T.; Li, S.; Xu, T.; Ma, A.; Chen, Z.; Luo, Y.; Lai, Z.; Lv, Y.; Pan, H.; et al. Sequential Magneto-Actuated and Optics-Triggered Biomicrorobots for Targeted Cancer Therapy. Adv. Funct. Mater. 2021, 31, 2008262. [Google Scholar] [CrossRef]
  16. Tiryaki, M.E.; Sitti, M. Magnetic Resonance Imaging-Based Tracking and Navigation of Submillimeter-Scale Wireless Magnetic Robots. Adv. Intell. Syst. 2022, 4, 2100178. [Google Scholar] [CrossRef]
  17. Pane, S.; Iacovacci, V.; Sinibaldi, E.; Menciassi, A. Real-Time Imaging and Tracking of Microrobots in Tissues Using Ultrasound Phase Analysis. Appl. Phys. Lett. 2021, 118, 014102. [Google Scholar] [CrossRef]
  18. Pane, S.; Iacovacci, V.; Ansari, M.H.D.; Menciassi, A. Dynamic Tracking of a Magnetic Micro-Roller Using Ultrasound Phase Analysis. Sci. Rep. 2021, 11, 23239. [Google Scholar] [CrossRef]
  19. Sawhney, M.; Karmarkar, B.; Leaman, E.J.; Daw, A.; Karpatne, A.; Behkam, B. Motion Enhanced Multi-Level Tracker (MEMTrack): A Deep Learning-Based Approach to Microrobot Tracking in Dense and Low-Contrast Environments. Adv. Intell. Syst. 2024, 6, 2300590. [Google Scholar] [CrossRef]
  20. Botros, K.; Alkhatib, M.; Folio, D.; Ferreira, A. USMicroMagSet: Using Deep Learning Analysis to Benchmark the Performance of Microrobots in Ultrasound Images. IEEE Robot. Autom. Lett. 2023, 8, 3254–3261. [Google Scholar] [CrossRef]
  21. Botros, K.; Alkhatib, M.; Folio, D.; Ferreira, A. Fully Automatic and Real-Time Microrobot Detection and Tracking Based on Ultrasound Imaging Using Deep Learning. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9763–9768. [Google Scholar] [CrossRef]
  22. Li, H.; Yi, X.; Zhang, Z.; Chen, Y. Magnetic-Controlled Microrobot: Real-Time Detection and Tracking through Deep Learning Approaches. Micromachines 2024, 15, 756. [Google Scholar] [CrossRef]
  23. Dillinger, C.; Rasaiah, A.; Vogel, A.; Ahmed, D. Real-Time Color Flow Mapping of Ultrasound Microrobots. bioRxiv 2025. [Google Scholar] [CrossRef]
  24. Kim, Y.H. Artificial Intelligence in Medical Ultrasonography: Driving on an Unpaved Road. Ultrasonography 2021, 40, 313. [Google Scholar] [CrossRef]
  25. Alam, F.; Plawiak, P.; Almaghthawi, A.; Qazani, M.R.C.; Mohanty, S.; Roohallah Alizadehsani, A. NeuroHAR: A Neuroevolutionary Method for Human Activity Recognition (HAR) for Health Monitoring. IEEE Access 2024, 12, 112232–112248. [Google Scholar] [CrossRef]
  26. Alam, F.; Mohammed Alnazzawi, T.S.; Mehmood, R.; Al-maghthawi, A. A Review of the Applications, Benefits, and Challenges of Generative AI for Sustainable Toxicology. Curr. Res. Toxicol. 2025, 8, 100232. [Google Scholar] [CrossRef]
  27. Jiao, L.; Song, X.; You, C.; Liu, X.; Li, L.; Chen, P.; Tang, X.; Feng, Z.; Liu, F.; Guo, Y.; et al. AI Meets Physics: A Comprehensive Survey. Artif. Intell. Rev. 2024, 57, 256. [Google Scholar] [CrossRef]
  28. Olawade, D.B.; Fapohunda, O.; Usman, S.O.; Akintayo, A.; Ige, A.O.; Adekunle, Y.A.; Adeola, A.O. Artificial Intelligence in Computational and Materials Chemistry: Prospects and Limitations. Chem. Afr. 2025, 8, 2707–2721. [Google Scholar] [CrossRef]
  29. Zhang, F.; Wang, X.; Zhang, X. Applications of Deep Learning Method of Artificial Intelligence in Education. Educ. Inf. Technol. 2025, 30, 1563–1587. [Google Scholar] [CrossRef]
  30. Mehmood, R.; Alam, F.; Albogami, N.N.; Katib, I.; Albeshri, A.; Altowaijri, S.M. UTiLearn: A Personalised Ubiquitous Teaching and Learning System for Smart Societies. IEEE Access 2017, 5, 2615–2635. [Google Scholar] [CrossRef]
  31. Soori, M.; Arezoo, B.; Dastres, R. Artificial Intelligence, Machine Learning and Deep Learning in Advanced Robotics, a Review. Cogn. Robot. 2023, 3, 54–70. [Google Scholar] [CrossRef]
  32. Roshanfar, M.; Salimi, M.; Kaboodrangi, A.H.; Jang, S.J.; Sinusas, A.J.; Wong, S.C.; Mosadegh, B. Advanced Robotics for the Next-Generation of Cardiac Interventions. Micromachines 2025, 16, 363. [Google Scholar] [CrossRef]
  33. Zhang, H.; Wang, J.; Zhang, Y.; Du, X.; Wu, H.; Zhang, T.; Zhang, C.; Wang, H.L.; Zhang, J. Review of Artificial Intelligence Applications in Astronomical Data Processing. Astron. Tech. Instrum. 2024, 1, 1–15. [Google Scholar] [CrossRef]
  34. Hussain, S.M.; Brunetti, A.; Lucarelli, G.; Memeo, R.; Bevilacqua, V.; Buongiorno, D. Deep Learning Based Image Processing for Robot Assisted Surgery: A Systematic Literature Survey. IEEE Access 2022, 10, 122627–122657. [Google Scholar] [CrossRef]
  35. Bi, Y.; Jiang, Z.; Duelmer, F.; Huang, D.; Navab, N. Machine Learning in Robotic Ultrasound Imaging: Challenges and Perspectives. Annu. Rev. Control Robot. Auton. Syst. 2024, 7, 335–357. [Google Scholar] [CrossRef]
  36. Roshanfar, M.; Salimi, M.; Jang, S.J.; Sinusas, A.J.; Kim, J.; Mosadegh, B. Emerging Image-Guided Navigation Techniques for Cardiovascular Interventions: A Scoping Review. Bioengineering 2025, 12, 488. [Google Scholar] [CrossRef] [PubMed]
  37. Tong, W.; Zhang, X.; Luo, J.; Pan, F.; Liang, J.; Huang, H.; Li, M.; Cheng, M.; Pan, J.; Zheng, Y.; et al. Value of Multimodality Imaging in the Diagnosis of Breast Lesions with Calcification: A Retrospective Study. Clin. Hemorheol. Microcirc. 2020, 76, 85–98. [Google Scholar] [CrossRef] [PubMed]
  38. Cheung, C.W.J.; Zhou, G.Q.; Law, S.Y.; Mak, T.M.; Lai, K.L.; Zheng, Y.P. Ultrasound Volume Projection Imaging for Assessment of Scoliosis. IEEE Trans. Med. Imaging 2015, 34, 1760–1768. [Google Scholar] [CrossRef]
  39. Chinene, B.; Mudadi, L.-s.; Mutasa, F.E.; Nyawani, P. A Survey of Magnetic Resonance Imaging (MRI) Availability and Cost in Zimbabwe: Implications and Strategies for Improvement. J. Med. Imaging Radiat. Sci. 2025, 56, 101819. [Google Scholar] [CrossRef]
  40. Murali, S.; Ding, H.; Adedeji, F.; Qin, C.; Obungoloch, J.; Asllani, I.; Anazodo, U.; Ntusi, N.A.B.; Mammen, R.; Niendorf, T.; et al. Bringing MRI to Low- and Middle-Income Countries: Directions, Challenges and Potential Solutions. NMR Biomed. 2024, 37, e4992. [Google Scholar] [CrossRef]
  41. Bora, A.; Açıkgöz, G.; Yavuz, A.; Bulut, D. Computed Tomography: Are We Aware of Radiation Risks in Computed Tomography? East. J. Med. 2014, 19, 164–168. [Google Scholar]
  42. Power, S.P.; Moloney, F.; Twomey, M.; James, K.; O’Connor, O.J.; Maher, M.M. Computed Tomography and Patient Risk: Facts, Perceptions and Uncertainties. World J. Radiol. 2016, 8, 1567–1580. [Google Scholar] [CrossRef]
  43. Stojkovic, M.; Rosenberger, K.; Kauczor, H.U.; Junghanss, T.; Hosch, W. Diagnosing and Staging of Cystic Echinococcosis: How Do CT and MRI Perform in Comparison to Ultrasound? PLoS Negl. Trop. Dis. 2012, 6, e1880. [Google Scholar] [CrossRef] [PubMed]
  44. Marcello Scotti, F.; Stuepp, R.T.; Dutra-Horstmann, K.L.; Modolo, F.; Gusmão Paraiso Cavalcanti, M. Accuracy of MRI, CT, and Ultrasound Imaging on Thickness and Depth of Oral Primary Carcinomas Invasion: A Systematic Review. Dentomaxillofac. Radiol. 2022, 51, 20210291. [Google Scholar] [CrossRef] [PubMed]
  45. Kraus, B.B.; Ros, P.R.; Abbitt, P.L.; Kerns, S.R.; Sabatelli, F.W. Comparison of Ultrasound, CT, and MR Imaging in the Evaluation of Candidates for TIPS. J. Magn. Reson. Imaging 1995, 5, 571–578. [Google Scholar] [CrossRef] [PubMed]
  46. Liu, X.; Jing, Y.; Xu, C.; Wang, X.; Xie, X.; Zhu, Y.; Dai, L.; Wang, H.; Wang, L.; Yu, S. Medical Imaging Technology for Micro/Nanorobots. Nanomaterials 2023, 13, 2872. [Google Scholar] [CrossRef]
  47. Wang, Q.; Zhang, L. Ultrasound Imaging and Tracking of Micro/Nanorobots: From Individual to Collectives. IEEE Open J. Nanotechnol. 2020, 1, 6–17. [Google Scholar] [CrossRef]
  48. Xiao, X.; Zhang, J.; Shao, Y.; Liu, J.; Shi, K.; He, C.; Kong, D. Deep Learning-Based Medical Ultrasound Image and Video Segmentation Methods: Overview, Frontiers, and Challenges. Sensors 2025, 25, 2361. [Google Scholar] [CrossRef]
  49. Sadak, F. An Explainable Deep Learning Model for Automated Classification and Localization of Microrobots by Functionality Using Ultrasound Images. Robot. Auton. Syst. 2025, 183, 104841. [Google Scholar] [CrossRef]
  50. Hu, M.; Zhang, J.; Matkovic, L.; Liu, T.; Yang, X. Reinforcement Learning in Medical Image Analysis: Concepts, Applications, Challenges, and Future Directions. J. Appl. Clin. Med. Phys. 2023, 24, e13898. [Google Scholar] [CrossRef]
  51. Liu, X.; He, C.; Wu, M.; Ping, A.; Zavodni, A.; Matsuura, N.; Diller, E. Transformer-Based Robotic Ultrasound 3D Tracking for Capsule Robot in GI Tract. Int. J. Comput. Assist. Radiol. Surg. 2025, 20, 2011–2018. [Google Scholar] [CrossRef]
  52. Azad, R.; Kazerouni, A.; Heidari, M.; Aghdam, E.K.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review. Med. Image Anal. 2024, 91, 103000. [Google Scholar] [CrossRef]
  53. Medany, M.; Piglia, L.; Achenbach, L.; Mukkavilli, S.K.; Ahmed, D. Model-Based Reinforcement Learning for Ultrasound-Driven Autonomous Microrobots. Nat. Mach. Intell. 2025, 7, 1076–1090. [Google Scholar] [CrossRef] [PubMed]
  54. Schrage, M.; Medany, M.; Ahmed, D. Ultrasound Microrobots with Reinforcement Learning. Adv. Mater. Technol. 2023, 8, 2201702. [Google Scholar] [CrossRef]
  55. Liu, X.; Esser, D.; Wagstaff, B.; Zavodni, A.; Matsuura, N.; Kelly, J.; Diller, E. Capsule Robot Pose and Mechanism State Detection in Ultrasound Using Attention-Based Hierarchical Deep Learning. Sci. Rep. 2022, 12, 21130. [Google Scholar] [CrossRef] [PubMed]
  56. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
  57. Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
  58. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  59. Weng, W.; Zhu, X. U-Net: Convolutional Networks for Biomedical Image Segmentation. IEEE Access 2015, 9, 16591–16603. [Google Scholar] [CrossRef]
  60. Jocher, G.; Qiu, J. Ultralytics YOLO11. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 15 August 2025).
  61. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
  62. Nauber, R.; Hoppe, J.; Robles, D.C.; Medina-Sánchez, M. Photoacoustics-Guided Real-Time Closed-Loop Control of Magnetic Microrobots through Deep Learning. In Proceedings of the MARSS 2024—7th International Conference on Manipulation, Automation, and Robotics at Small Scales, Delft, The Netherlands, 1–5 July 2024. [Google Scholar] [CrossRef]
  63. Khalifa, M.; Hamza, H.M.; Hosny, K.M. De-Speckling of Medical Ultrasound Image Using Metric-Optimized Knowledge Distillation. Sci. Rep. 2025, 15, 23703. [Google Scholar] [CrossRef]
Figure 1. Applications of microrobots in the human body, highlighting therapeutic, diagnostic, monitoring, and clinical benefits.
Figure 1. Applications of microrobots in the human body, highlighting therapeutic, diagnostic, monitoring, and clinical benefits.
Actuators 15 00016 g001
Figure 2. Ultrasound (US) is considered the preferred imaging modality for detecting and tracking microrobots.
Figure 2. Ultrasound (US) is considered the preferred imaging modality for detecting and tracking microrobots.
Actuators 15 00016 g002
Figure 3. Justification for selecting six state-of-the-art detectors based on their architecture, efficiency, clinical relevance, and complementary strengths.
Figure 3. Justification for selecting six state-of-the-art detectors based on their architecture, efficiency, clinical relevance, and complementary strengths.
Actuators 15 00016 g003
Figure 4. High-level architecture of the Blackwell B200 GPU (1× GPU, 28 vCPUs, 283 GB memory, 180 GB VRAM) showing HBM3e, cache, and Graphics Processing Clusters with SMs and Tensor Cores.
Figure 4. High-level architecture of the Blackwell B200 GPU (1× GPU, 28 vCPUs, 283 GB memory, 180 GB VRAM) showing HBM3e, cache, and Graphics Processing Clusters with SMs and Tensor Cores.
Actuators 15 00016 g004
Figure 5. Ultrasound samples of eight microrobot types, arranged in four rows and four columns. Each class is shown with an unlabeled image, followed by its labeled counterpart. Green Color bounding boxes are labeled images. Class names are the first word of the image name.
Figure 5. Ultrasound samples of eight microrobot types, arranged in four rows and four columns. Each class is shown with an unlabeled image, followed by its labeled counterpart. Green Color bounding boxes are labeled images. Class names are the first word of the image name.
Actuators 15 00016 g005
Figure 6. Benchmarking pipeline visualization.
Figure 6. Benchmarking pipeline visualization.
Actuators 15 00016 g006
Figure 7. Barplot of AP drop (%) under perturbations for six detectors, which includes variations such as blurriness, brightness, and occlusion.
Figure 7. Barplot of AP drop (%) under perturbations for six detectors, which includes variations such as blurriness, brightness, and occlusion.
Actuators 15 00016 g007
Figure 8. Barplot of AP drop (%) under perturbations for six detectors, which includes speckle noise variances (0.02, 0.04, 0.06, 0.08, 0.1).
Figure 8. Barplot of AP drop (%) under perturbations for six detectors, which includes speckle noise variances (0.02, 0.04, 0.06, 0.08, 0.1).
Actuators 15 00016 g008
Figure 9. Feature contribution (%) for six detectors.
Figure 9. Feature contribution (%) for six detectors.
Actuators 15 00016 g009
Figure 10. The computational efficiency of ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, YOLO11, and YOLO12 is measured in terms of latency, FPS, model parameters, and model size.
Figure 10. The computational efficiency of ConvNeXt, Res2NeXt-101, ResNeSt-269, U-Net, YOLO11, and YOLO12 is measured in terms of latency, FPS, model parameters, and model size.
Actuators 15 00016 g010
Table 1. Uniform training settings.
Table 1. Uniform training settings.
ParameterValue/SettingPurpose
Epochs30Ensures equal training duration across architectures
Batch Size16Maintains identical gradient update frequency
Dataloader Workers16Ensures equal data throughput efficiency
Input Resolution1080 × 1920 (rectangular)/YOLO imgsz = 1920Same image scale for consistent feature learning
OptimizerAdamWSame optimization behavior across networks
Learning Rate0.0001Uniform learning dynamics
Mixed Precision (AMP)EnabledSame precision & GPU acceleration method
Model Compilationtorch.compile/compile = TrueEqual computational optimization
Table 2. Distribution of microrobot classes in training and testing sets.
Table 2. Distribution of microrobot classes in training and testing sets.
Class NameTraining CountTesting Count
cube22671222
cylinder1596861
flagella19561054
helical31551699
rollingcube1255677
sheetrobot30101622
sphere159003178
sphere319881071
Total 21,12711,384
Table 3. Comparison of YOLO11, YOLO12, ConvNeXt, Res2NeXt-101, ResNeSt-269, and U-Net.
Table 3. Comparison of YOLO11, YOLO12, ConvNeXt, Res2NeXt-101, ResNeSt-269, and U-Net.
Detectors
Launch Year
Key InnovationStrength for Ultrasound Microrobot DetectionClinical Relevance
YOLO12
(2025)
Attention-centric modules (Area Attention, R-ELAN)Improves contextual awareness and fine-grained discrimination, even if the background is noisysupporting trustworthy clinical decision-making by maximizing precision in challenging imaging conditions
YOLO11
(2024)
Improved backbone–neck design, multi-task capability.Greater adaptability for handling complex microrobot appearances and dynamics in ultrasound.Supports advanced applications related to monitoring and controlling microrobot behavior.
ConvNeXt
(2022)
The transformer-inspired CNN architecture uses large kernels, depthwise convolutions, and normalization improvements.Robust hierarchical feature extraction, speckle noise handling, and low contrast in ultrasound.Provides reliable baseline detection and supports fusion with more advanced detectors.
ResNeSt-269
(2020)
Split-attention blocks for channel-wise feature recalibrationEnhances robustness to background clutter in ultrasound by selectively emphasizing informative channels.Supports precise localization of microrobots amidst complex tissue structures.
Res2NeXt-101
(2019)
Combines ResNeXt split-transform-merge strategy with Res2Net multi-scale feature representation.Able to capture fine-to-coarse structural details, which results in better detection of varying sizes and shapes of microrobots.Enables the accurate identification of diverse microrobot morphologies, which is crucial for heterogeneous robotic swarms.
U-Net
(2015)
Symmetric encoder–decoder with skip connections for biomedical image segmentation.During noisy ultrasound scans, it provides strong boundary delineation of microrobots.Widely adopted in medical imaging, it facilitates accurate segmentation for monitoring the interactions of microrobots with tissue.
Table 4. Overall performance heatmap metric of six detectors.
Table 4. Overall performance heatmap metric of six detectors.
ModelmAP (50–95)mAP50mAP75Precision (Mean)Recall (Mean)95% CI Lower95% CI Upper
YOLO120.6178244330.9629434760.6702676250.935804460.8403481520.55910.6126
YOLO110.6131752230.9498971810.6902780550.9620459260.7690228820.31860.7536
ConvNeXt0.8667521480.986782730.9160068630.9853241230.9951957120.79550.8484
ResNeSt2690.8782496450.9885784390.9303650860.9860162590.9957285130.82190.8515
Res2NeXt-1010.8664164540.9884453420.9190168980.9858294650.9959241840.82110.8563
U-Net0.73091960.8751631980.7819380160.73091960.8807453510.7320.8141
Table 5. Class-wise performance metrics heatmap for six microrobot detectors.
Table 5. Class-wise performance metrics heatmap for six microrobot detectors.
DetectorsClassNamemAP_50–95PrecisionRecallF1-Score
YOLO12cube0.8068876180.9659990010.9764795330.971210994
cylinder0.5756279520.9132420060.9779326360.944480898
flagella0.6877892020.9238099470.99715370.959081665
helical0.74888740.9216034370.9876397880.953479589
rollingcube0.473179148100
sheetrobot0.3894239040.8442494910.8254437060.834740694
sphere10.585506470.9690818730.9665392170.967808875
sphere30.6752937730.9484499290.9915966390.969543491
YOLO11cube0.7547435640.9916178120.9680899580.97971265
cylinder0.6240716550.9370220610.9756097560.955926652
flagella0.7065637980.9888281660.9962049340.992502843
helical0.771228780.989562420.9941141850.99183308
rollingcube0.43616945710.0023942880.004777139
sheetrobot0.4194820150.8596800180.8273736130.843217488
sphere10.5773130130.9806513870.9713656390.975986427
sphere30.6158295030.9490055420.4170306850.579434752
ConvNeXtcube0.9690445660.9713603820.9991816690.985074627
cylinder0.8731524350.9883990720.9895470380.988972722
flagella0.9064030050.9943342780.9990512330.996687175
helical0.9477840660.9964726630.9976456740.997058824
rollingcube0.9397578240.9955752210.997045790.996309963
sheetrobot0.8589208130.9548961420.9919852030.97308739
sphere10.5888960360.9889763780.9880427940.988509366
sphere30.8500585560.992578850.9990662930.995812006
ResNeSt-269cube0.9669200180.9576470590.9991816690.977973568
cylinder0.8777280450.9930394430.9941927990.993615786
flagella0.9193149210.9981042650.9990512330.998577525
helical0.9530006050.9976484420.9988228370.998235294
rollingcube0.930283070.997045790.997045790.99704579
sheetrobot0.8789188860.953171310.9913686810.971894832
sphere10.6505905390.9924074660.9870988040.989746017
sphere30.8492410780.9990662930.9990662930.999066293
Res2NeXt-101cube0.9647710320.9713603820.9991816690.985074627
cylinder0.8682129380.9873125720.9941927990.990740741
flagella0.8902166490.9971590910.9990512330.998104265
helical0.9480497840.9976470590.9982342550.997940571
rollingcube0.9416044950.997045790.997045790.99704579
sheetrobot0.8616035580.9526907160.9932182490.972532448
sphere10.6094605920.9908343870.9864694780.988647114
sphere30.8474124670.99258572810.99627907
U-Netcube0.5808231830.5808231830.5808231830.580823183
cylinder0.8204905390.8204905390.8204905390.820490539
flagella0.7872567180.7872567180.7872567180.787256718
helical0.7620201710.7620201710.7620201710.762020171
rollingcube0.8879491090.8879491090.8879491090.887949109
sheetrobot0.6478205320.6478205320.6478205320.647820532
sphere10.5737268330.5737268330.5737268330.573726833
sphere30.7872695920.7872695920.7872695920.787269592
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almaghthawi, A.; He, C.; Luo, S.; Alam, F.; Roshanfar, M.; Cheng, L. Benchmarking Robust AI for Microrobot Detection with Ultrasound Imaging. Actuators 2026, 15, 16. https://doi.org/10.3390/act15010016

AMA Style

Almaghthawi A, He C, Luo S, Alam F, Roshanfar M, Cheng L. Benchmarking Robust AI for Microrobot Detection with Ultrasound Imaging. Actuators. 2026; 15(1):16. https://doi.org/10.3390/act15010016

Chicago/Turabian Style

Almaghthawi, Ahmed, Changyan He, Suhuai Luo, Furqan Alam, Majid Roshanfar, and Lingbo Cheng. 2026. "Benchmarking Robust AI for Microrobot Detection with Ultrasound Imaging" Actuators 15, no. 1: 16. https://doi.org/10.3390/act15010016

APA Style

Almaghthawi, A., He, C., Luo, S., Alam, F., Roshanfar, M., & Cheng, L. (2026). Benchmarking Robust AI for Microrobot Detection with Ultrasound Imaging. Actuators, 15(1), 16. https://doi.org/10.3390/act15010016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop