Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot

Covaciu, Florin; Gherman, Bogdan; Al Hajjar, Nadim; Zima, Ionut; Popa, Calin; Pusca, Alexandru; Ciocan, Andra; Vaida, Calin; Iordan, Anca-Elena; Tucan, Paul; Chablat, Damien; Pisla, Doina

doi:10.3390/electronics15020474

Open AccessArticle

Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot

by

Florin Covaciu

^1,2

,

Bogdan Gherman

^1,2,*

,

Nadim Al Hajjar

³,

Ionut Zima

^1,2

,

Calin Popa

³,

Alexandru Pusca

^1,2

,

Andra Ciocan

³

,

Calin Vaida

^1,2

,

Anca-Elena Iordan

^2,4

,

Paul Tucan

^1,2

,

Damien Chablat

^1,5 and

Doina Pisla

^1,2,6

¹

Research Center for Industrial Robots Simulation and Testing—CESTER, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania

²

European University of Technology, European Union

³

Department of Surgery, “Iuliu Hatieganu” University of Medicine and Pharmacy, 400347 Cluj-Napoca, Romania

⁴

Department of Computer Science, Technical University of Cluj-Napoca, 400027 Cluj-Napoca, Romania

⁵

Laboratory of Numerical Sciences in Nantes, Joint Research Unit 6004, Centre National de la Recherche Scientifique (CNRS), École Centrale Nantes, Nantes Université, F-44000 Nantes, France

⁶

Technical Sciences Academy of Romania, B-dul Dacia, 26, 030167 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 474; https://doi.org/10.3390/electronics15020474

Submission received: 3 December 2025 / Revised: 13 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026

(This article belongs to the Special Issue Pushing Boundaries: Innovations in Robotics, Artificial Intelligence, and Extended Reality)

Download

Browse Figures

Versions Notes

Abstract

Manual alignment between the trocar, surgical instrument, and robot during minimally invasive surgery (MIS) can be time-consuming and error-prone, and many existing systems do not provide autonomous localization and pose estimation. This paper presents an artificial intelligence (AI)-assisted, vision-guided framework for automated localization and positioning of the ATHENA parallel surgical robot. The proposed approach combines an Intel RealSense RGB–depth (RGB-D) camera with a You Only Look Once version 11 (YOLO11) object detection model to estimate the 3D spatial coordinates of key surgical components in real time. The estimated coordinates are streamed over Transmission Control Protocol/Internet Protocol (TCP/IP) to a programmable logic controller (PLC) using Modbus/TCP, enabling closed-loop robot positioning for automated docking. Experimental validation in a controlled setup designed to replicate key intraoperative constraints demonstrated submillimeter positioning accuracy (≤0.8 mm), an average end-to-end latency of 67 ms, and a 42% reduction in setup time compared with manual alignment, while remaining robust under variable lighting. These results indicate that the proposed perception-to-control pipeline is a practical step toward reliable autonomous robotic docking in MIS workflows.

Keywords:

parallel surgical robotic; vision-guided control; autonomous trocar docking; markerless 3D localization; RGB-D perception

1. Introduction

Robotic-assisted surgery has become a cornerstone of modern minimally invasive medicine, offering superior dexterity, reduced trauma, and faster postoperative recovery compared with conventional laparoscopic procedures [1,2,3]. Systems such as the da Vinci platform have demonstrated remarkable capabilities; however, many essential phases of intraoperative workflow—particularly those involving the manual alignment between the trocar, laparoscopic instrument, and robotic manipulator—remain strongly operator-dependent. This reliance on manual setup introduces variability, increases surgical preparation time, and may reduce reproducibility in high-precision clinical tasks [4,5,6].

Within the operating room (OR), surgical teams must coordinate the positioning of imaging devices, robotic arms, and surgical instruments in a constrained workspace. Misalignment between the trocar and robotic end effector can lead to additional adjustments, increased cognitive load, and prolonged instrument docking time. These factors have direct ergonomic and safety implications: repeated manual repositioning increases surgeon fatigue, while inaccuracies in trocar–instrument alignment may compromise the delicate tissue manipulation required in complex minimally invasive procedures such as pancreatic or hepatobiliary surgery [7]. Consequently, automated methods that streamline OR workflow and reduce manual intervention have the potential to substantially improve both procedural efficiency and patient safety.

Traditional alignment methods in minimally invasive surgery often rely on visual estimation, direct manual manipulation of robotic arms, or mechanical positioning guides. Such methods have several limitations:

Variability—manual estimation is prone to inter-operator and intra-operator inconsistencies [8];
Lack of real-time feedback—mechanical guides cannot adapt to changes in anatomical configuration or trocar motion;
Sensitivity to user experience—novice surgeons exhibit higher alignment times and greater deviations from the optimal remote center of motion (RCM) trajectory [9].

Moreover, conventional registration techniques—such as surface-based alignment or calibration-frame matching—are difficult to apply intraoperatively due to sterility requirements, limited access, and the dynamic nature of soft tissues.

Several alternative localization modalities have been proposed to address these challenges:

Fiducial-based registration using optical or radiopaque markers provides high accuracy, but it can complicate sterility management and workflow in the operating room because fiducials must be physically introduced and maintained within the sterile field—either attached to instruments, mounted on sterile adapters, or placed near the patient. This introduces additional handling steps (placement, sterile fixation, verification of visibility, and calibration/registration) and may require single-use sterile mounts or repeated re-sterilization of marker holders. During surgery, fiducials can become occluded or contaminated by drapes, hands, instrument motion, or blood/fluid, which can force repositioning or re-registration and interrupt the workflow. For these reasons, markerless approaches are attractive when the goal is to minimize added hardware and procedural steps, while still enabling reliable localization [10].
Electromagnetic tracking systems eliminate line-of-sight constraints but suffer from field distortions caused by surrounding metallic OR equipment, reducing reliability in laparoscopic environments [11].
Optical tracking systems (stereo cameras or infrared reflectors) offer high precision, yet their integration requires additional hardware, careful calibration, and unobstructed fields-of-view—conditions not always achievable in crowded OR setups [12].
Robot encoding-based registration, relying solely on internal kinematics, lacks external environmental awareness and cannot autonomously compensate for trocar displacement or manual instrument repositioning.
Markerless vision-based registration using geometric cues (e.g., stereo/RGB-D point-cloud alignment or iterative closest point (ICP)-type fitting) avoids physical markers and can be accurate when surfaces are well observed; however, performance can degrade with sparse/low-texture geometry, specular/reflective surfaces, depth dropouts, and partial occlusions, and typically requires careful calibration and sufficient overlap between views [13].
Learning-based markerless localization/pose estimation (e.g., convolutional networks for keypoint detection or 6-DoF pose regression) reduces reliance on handcrafted features and can generalize across viewpoints, but usually demands larger and more diverse datasets to avoid overfitting and may be sensitive to domain shift and OR artifacts (blood/fluid, smoke, motion blur, harsh shadows), requiring targeted augmentation and external validation [14].

These limitations highlight the need for a non-invasive, real-time, camera-based intraoperative localization strategy that avoids additional markers, simplifies workflow, and supports seamless integration into existing OR environments.

Unlike open surgery, minimally invasive procedures rely entirely on trocar ports that constrain instrument motion to pivot around a fixed RCM. Any deviation between the robot trajectory and trocar axis leads to increased friction, tissue stress, or compromised precision during dissection and suturing. A camera-based system capable of autonomously identifying the trocar, the surgical instrument, and the robotic mechanism offers several advantages:

Non-contact measurement—avoids sterile field contamination and eliminates the need for fiducial attachments;
Real-time adaptation—updates positioning dynamically as instruments move;
Improved safety—minimizes excessive force at the trocar site;
Optimized ergonomics—reduces surgeon effort by automating repetitive alignment actions;
Enhanced reproducibility—reduces operator-dependent variability in instrument docking.

Deep learning detectors, particularly the YOLO family, have demonstrated robust performance in real-time recognition of surgical instruments, anatomical features, and operative states [15,16], making them suitable candidates for addressing the alignment challenge in minimally invasive robotic docking.

The present study addresses the challenge of real-time, intraoperative, non-invasive localization and positioning of the ATHENA surgical parallel robot relative to the trocar and laparoscopic instrument. The goal is to enable automatic docking without requiring fiducials, calibration frames, or manual fine adjustments. The system must operate near the patient, under variable lighting conditions, and in a cluttered OR environment, while ensuring good accuracy compatible with delicate laparoscopic tasks.

This work proposes an AI-assisted, camera-based framework that integrates Intel RealSense 3D sensing with a YOLO11 deep learning model, enabling autonomous localization of key surgical components. The 3D spatial coordinates extracted in real time are transferred to a PLC through the Modbus TCP, controlling the ATHENA robot for precise closed-loop alignment.

The key contributions of this paper are as follows:

A novel markerless, vision-based method for surgical robot localization using YOLO11 and RealSense 3D sensing.
A complete AI-to-PLC workflow, enabling real-time coordinate extraction, communication, and autonomous closed-loop motion.
A validated automatic docking framework achieving a submillimeter positioning accuracy (≤0.8 mm) and a significantly reduced alignment time (−42%).
An integrated OR-oriented system architecture, designed for multi-patient operation, real-time responsiveness, and enhanced surgical safety.

The paper is structured as follows: following the introduction, Section 2 reviews the related literature. Section 3 provides a comprehensive description of the ATHENA system, detailing its architecture and core functionalities. Also, Section 3 outlines the implementation process and discusses the integration of the system within the application context. Section 4 presents experimental results that demonstrate the system’s performance. Finally, Section 5 offers concluding remarks and identifies potential directions for future research.

2. Related Works

The literature presents numerous robotic systems used in surgical practice, developed to improve the precision of the medical act and to reduce surgical risks. These technologies have made significant progress and have been successfully implemented in various specialties such as urology, gynecology, and general surgery. Previous relevant studies have demonstrated that robotic surgical techniques can optimize traditional laparoscopy by increasing the precision and maneuverability of instruments and providing three-dimensional control and reducing hand tremor, facilitating smoother and safer interventions. At the same time, robotic technologies help to overcome the limitations of conventional laparoscopy, including the lack of tactile feedback and visualization difficulties, and are thus recognized as essential elements to increase the efficiency and safety of minimally invasive procedures [17,18].

A relevant contribution in this direction is presented by Tucan et al., who designed and developed a robotic system dedicated to image-guided brachytherapy in non-resectable liver tumors. Their work highlights the importance of robotic precision in percutaneous interventions and introduces a compact and accurate robotic structure capable of needle insertion with submillimeter accuracy, supported by real-time imaging and experimental validation [19]. Similarly, spherical parallel robotic architectures have been developed to achieve high precision and safety in controlled human-interaction tasks, leveraging their compact structure and advanced motion capabilities. Comparative evaluations have shown that such parallel systems can deliver superior performance in tasks requiring fine control and stability, outperforming traditional mechanisms in terms of accuracy and adaptability [20,21]. The use of artificial intelligence in the operating room has long been limited due to the difficulty of traditional machine learning methods to accurately interpret medical images. Conventional techniques could not provide sufficiently accurate visual recognition to support real-time surgical decisions. However, with the emergence and development of deep learning, computer vision has evolved significantly, enabling accurate identification of objects in images and recognition of patterns in visual data. This evolution has led to a considerable increase in the use of computer vision in the analysis of intraoperative videos, contributing to the automatic recognition of the surgical steps, anatomy, instruments used, and movements performed during procedures. Furthermore, these advances provide valuable support for surgical training, operator performance evaluation, and the development of autonomous or semi-autonomous systems in surgery [22].

Ward et al. analyses the current uses of computer vision based on artificial intelligence in laparoscopic surgery, highlighting applications such as automatic instrument detection, the recognition of surgical steps, and the evaluation of surgeons’ performance. These technologies contribute to increasing the accuracy and safety of minimally invasive procedures, while also discussing the challenges of integrating them into clinical practice [23]. Gumbs et al. provide an analysis of recent advances in computer vision that enable accurate object recognition, anatomical segmentation, and understanding of the surgical context. These technologies facilitate increasingly autonomous surgical actions and contribute to the development of robotic systems capable of performing tasks with minimal human intervention [24].

Luongo et al. propose a deep learning model that analyzes video sequences to automatically recognize suturing gestures in robot-assisted surgery. The results show high accuracy in gesture classification, highlighting the system’s potential to support performance evaluation and training in robotic surgery [25]. Zang et al. presents the development and evaluation of AI models for recognizing surgical phases in inguinal hernia repair surgeries, including the creation of a confirmatory benchmark to validate the accuracy of identifying the stages of surgery. The study explores competitive models for optimizing automatic recognition performance, highlighting the potential of AI in supporting surgical interventions and improving operative workflow management [26,27] proposes an enhanced surgical instrument recognition method using an optimized YOLOv5 model with architectural and preprocessing improvements. Tests on surgical datasets demonstrate superior performance, underscoring its potential to improve safety and automation in computer-assisted interventions. Jearanai et al. presents the development of a deep learning model to guide the safe insertion of optical trocars in minimally invasive surgery. The model analyses endoscopic images in real time to identify anatomical structures and prevent injuries caused by incorrect insertions. The system has been validated on clinical data and has shown promising accuracy in detecting critical moments of the procedure. This innovative approach has the potential to improve patient safety and support intraoperative decisions [28]. Recent research by Rus et al. introduces an artificial intelligence-based hazard detection system for robotic-assisted single-incision oncologic surgery. The system is designed to identify risky surgical situations in real time, improving intraoperative safety. This work exemplifies the synergy between AI and robotics in surgical environments, underlining the role of predictive modeling and hazard classification in avoiding complications and enhancing surgical outcomes [29].

In addition to the robotic platforms and detection techniques discussed previously, an essential research area for enabling automatic positioning is visual servoing and camera-based robot positioning in medical robotics. Visual servoing—particularly pose-based and image-based approaches—has been widely used in medical applications to achieve precise guidance, compensate for intraoperative errors, and adapt to dynamic surgical environments [30]. Vision-based robotic control enables closed-loop motion using image features extracted from intraoperative video streams, providing high positioning accuracy in tasks such as biopsy, autonomous suturing, and safe manipulation of minimally invasive surgical instruments [31].

Another closely related research direction involves surgical scene understanding, which focuses on interpreting the intraoperative environment through instrument recognition, tissue segmentation, and surgical workflow modeling. Deep learning-based architectures, including convolutional neural networks and transformer models, have enabled robust detection of instruments, patient motion tracking, and 3D scene reconstruction, supporting the development of autonomous or semi-autonomous systems in surgery [32]. Within this context, markerless localization of surgical tools and anatomical structures has become an important research domain, offering a non-invasive alternative to marker-based or fiducial-based methods that are often difficult to integrate into clinical workflows and may interfere with sterile field management [33].

Modern techniques in pose estimation and 3D reconstruction have further accelerated progress in this field. These methods are used to estimate the position and orientation of surgical tools, reconstruct anatomical surfaces, and calibrate spatial relationships between the patient, imaging devices, and robotic systems. Advanced registration frameworks—including perspective-n-point (PnP), iterative closest point (ICP), model-based matching, and probabilistic registration—enable robust 2D–3D alignment under variable and dynamic surgical conditions [34].

A fundamental aspect highlighted in the recent literature is the concept of the sensing and perception pipeline, which defines the processing modules responsible for acquiring, preprocessing, and interpreting visual information. Typical systems include stages such as camera calibration (intrinsic, extrinsic, or hand–eye when required), lens distortion correction, noise filtering, region of interest (ROI) selection, and normalization of image characteristics [35]. Depending on the intended application, full calibration may be mandatory (e.g., precision 3D reconstruction) or optional (e.g., detection with deep neural networks), but consistently contributes to reducing projection errors and improving overall accuracy.

Scene preprocessing also involves steps such as undistortion, filtering, background suppression, and illumination correction, which are crucial in minimally invasive surgery, where visual conditions are affected by specular reflections, smoke, fluids, and limited working space. Target detection and segmentation may use artificial markers, anatomical landmarks, surgical tools, or deep learning-based models such as U-Net, Mask-RCNN, or 3D segmentation networks [36].

The 2D–3D pose estimation and registration stage varies depending on the input data:

For RGB images, PnP, triangulation, and geometric modeling are commonly applied;
For point clouds, ICP or probabilistic filters are typically used;
When multimodal data are available, fusion strategies become necessary.

These operations rely heavily on sensor quality and camera configuration.

Depth information is often handled by stereo cameras or RGB-D sensors such as RealSense, Kinect, or ZED, which provide direct 3D point measurements. However, RGB-D fusion still requires depth noise filtering, removal of invalid points, and accurate alignment of color and depth channels [37]. Another critical aspect addressed in recent works is uncertainty modeling, where stochastic factors such as illumination changes, reflections, instrument motion, camera vibrations, variations in robot positioning, or limited field-of-view significantly affect the accuracy of pose estimates. Covariance-based uncertainty propagation has been proposed to quantify these errors and define confidence intervals for robot positioning decisions [38]. Finally, the applicability of vision-based systems in real surgical environments strongly depends on real-time performance requirements. Algorithms must operate within strict latency constraints—typically between 10 and 70 ms—to support dynamic control of surgical instruments without compromising patient safety. Modern GPU-optimized neural networks, and architectures such as YOLO, U-Net, and transformer-based detectors have enabled real-time performance in practical surgical scenarios [39].

Despite substantial progress, significant gaps remain in the literature. Most studies focus on surgical tool detection or workflow recognition, while few provide a fully integrated vision-based perception pipeline connected to a parallel robotic mechanism for automatic positioning. Moreover, many solutions still rely on artificial markers or complex hardware setups, limiting their usability in real clinical environments. These gaps motivate the need for a markerless, real-time, and robust vision-based system capable of delivering precise 3D coordinates for autonomous robot alignment—challenges that are explicitly addressed by the system proposed in this work.

3. Materials and Methods

3.1. ATHENA System Presentation

3.1.1. General Architecture of the Proposed System

Figure 1 illustrates the architecture of the proposed robotic system, designed to guide and control a surgical tool with AI support. The system begins with a stereoscopic 3D camera [40] mounted near the work area, which captures video of three key elements: the parallel module (PM), the laparoscopic instrument (instr.), and the trocar. The video stream is transmitted via USB to a computer, where real-time analysis is performed using the Ultralytics YOLO11 object detection algorithm.

The software, developed in .NET 8 with a graphical user interface, displays the video stream and highlights the detected elements using bounding frames, providing the user with a clear view of the surgical instrument’s position. When the corresponding button is pressed, the system saves the 3D coordinates of these elements to an Excel file. These coordinates are subsequently used to automatically position the surgical robot for instrument insertion through the trocar.

The graphical user interface also provides manual robot-control options via the Modbus protocol. The control unit incorporates a PLC that drives the stepper motors and uses proximity sensors at stroke limits to initialize the zero position of each axis. Additionally, the system offers real-time monitoring of the instrument’s position and includes safety features such as an emergency stop and error reset, ensuring safe and reliable operation during surgical procedures.

The detection model identifies three classes simultaneously—trocar, instrument, and PM—and supports multiple instances per frame. The system can be easily extended to additional instruments by incorporating them into the training dataset.

In the context of this work, the following terms refer to specific system components:

PM—the parallel module of the ATHENA robot;
Instr.—the tip of the laparoscopic instrument detected by the camera;
Trocar—the minimally invasive surgical access port, geometrically defined by the axis of its cannula.

3.1.2. The Structure of the ATHENA Surgical Parallel Robot

The ATHENA robot (Figure 2) has been developed considering some of the key standards used in the development of medical devices, such as ISO 14971 (risk management) and ISO 13485 (quality management), together with IEC 60601-1 (basic safety/essential performance), IEC 60601-1-2 (EMC) for medical electrical equipment, IEC 80601-2-77 (robotically assisted surgical equipment), IEC 62304 (software lifecycle), and IEC 62366-1 (usability engineering/human factors) [40,41,42,43,44,45,46,47].

ATHENA has a modular architecture (Figure 2a) consisting of two parallel modules, each with specific roles and characteristics, [48,49,50,51]. The first module integrates a passive spherical mechanism (SM), equipped with a remote center of motion (RCM). It constrains the surgical instrument, causing it to pass constantly through a fixed point—the RCM—during the procedure. This property is essential for maintaining precision and safety during minimally invasive surgical procedures, as it reduces the risk of injury caused by uncontrolled movements of the instrument.

The passive module provides a total of four degrees of freedom (DOF): three for the orientation of the instrument relative to an OXYZ coordinate system, which coincides with the RCM point, and a fourth degree of freedom for translation along the instrument axis, enabling its insertion and withdrawal. Translation is achieved by replacing one of the spherical mechanism’s revolute joints with a cylindrical joint, denoted C1s, which allows smooth and controlled motion along the longitudinal axis. Such hybrid passive mechanisms with reconfigurable joints are particularly advantageous in surgical applications due to their compact structure and their ability to constrain motion around a fixed anatomical reference point, thereby enhancing safety and dexterity during minimally invasive interventions.

The second module consists of an active parallel mechanism (PM) with four degrees of freedom, consisting of three distinct kinematic chains. The first chain includes two active prismatic joints, q₁ and q₂, connected to links l₁ by universal joints U_1p and U_2p, the links being joined together by a passive revolute joint, R_9P. The second parallel kinematic chain uses the same active joints q₁ and q₂, connected to links l₃ through prismatic revolute joints R_6p and R_7p, and links l₃ are connected through a passive revolute joint, R_4P. This chain is connected to the third chain by the R_5P revolute joint. The third kinematic chain has a serial configuration and comprises an active revolute joint, q₃, as well as two passive revolute joints, R_3P and R_8P, connected by the l₂ links. The surgical instrument is fixed to this parallel module with four degrees of freedom by means of a universal joint, consisting of two passive revolute joints, R_1P and R_2P, which intersect at point P, the center of the revolute joint R_1P.

Figure 2b presents the experimental model of the ATHENA parallel robot. The actuated joints q₁ and q₂ are driven by two B&R 80MPF5.500D114-01 stepper motors (IP65) with incremental encoders and holding brakes. The third actuator (q₃) is a Nanotec SCA5618 (NEMA 23) stepper motor coupled to a 15:1 speed reducer. For homing/end-stop detection we used rectangular inductive proximity sensors LANBAO LE08SN25DPO (PNP normally open), with a 0–2.5 mm sensing range, 10–30 VDC supply, 100 mA output current, and IP67 protection.

The parallel mechanism controls the position and orientation of the surgical instrument, providing four degrees of freedom: the orientation angles ψ, θ, and φ, as well as the insertion length l_ins, or alternatively, the position in space of point E (X_E, Y_E, Z_E) together with the angle φ. These two modes of representing position and orientation are linked by a constraint equation, which ensures the consistency of movements and compliance with the mechanical limits of the system, being fundamental for the precise coordination of all robot elements and for maintaining optimal control during operations:

\{\begin{cases} ψ = atan 2 (Y_{E}, X_{E}) \\ θ = \arccos (Z_{E} / \sqrt{{X_{E}}^{2} + {Y_{E}}^{2} + {Z_{E}}^{2}}) \\ l_{ins} = \sqrt{{X_{E}}^{2} + {Y_{E}}^{2} + {Z_{E}}^{2}} \end{cases}

(1)

In addition to the geometric dimensions of the robot (l1–l4) and the length of the surgical device (l), the coordinates (l₀₁, l₀₂, l₀₃) of the mobile reference system OXYZ relative to the fixed reference system OXYZ must also be known. The position of point P is defined by the following coordinates:

\{\begin{cases} X_{P} = \cos (ψ) \sin (θ) (l - l_{ins}) \\ Y_{P} = \sin (ψ) \cos (θ) (l - l_{ins}) \\ Z_{P} = \cos (ψ) (l - l_{ins}) \end{cases}

(2)

The relationships between the input and output variables of the ATHENA robot are expressed by Equation (3), which forms the basis for calculating the inverse kinematics for the active joints q₁, q₂, q₃, and q₄.

\{\begin{matrix} f_{1} : Y_{P} - (q_{2} + q_{2}) / 2 = 0 \\ f_{2} : {(l_{4} + \frac{\sqrt{4 l_{1}^{2} - {(q_{2} - q_{1})}^{2}}}{2})}^{2} - Z_{P}^{2} - X_{P}^{2} = 0 \\ f_{3} : {(q_{3} + l_{2 \min})}^{2} - {(X_{P} - δ_{1} + δ_{2} + l_{5})}^{2} - {(Z_{P} - δ_{3})}^{2} = 0 \end{matrix}

(3)

where

δ_{1} = \frac{l_{4} X_{P}}{\sqrt{X_{P}^{2} + Z_{P}^{2}}}, δ_{2} = \frac{\sqrt{4 l_{3}^{2} - {(q_{2} - q_{1})}^{2}}}{2}, δ_{3} = \frac{l_{4} Z_{P}}{\sqrt{X_{P}^{2} + Z_{P}^{2}}}

(4)

3.1.3. Design of the Proposed Distributed Software System

Based on the architecture shown in Figure 1 dedicated to the control of the surgical robot, a distributed software application was designed according to UML deployment diagram [52] illustrated in Figure 3. This type of diagram reflects the structure of the distributed system, consisting of two main nodes: a personal computer (running Windows 11) and a PLC (with RTOS), connected to each other via Modbus. These nodes collaborate to train and use an artificial intelligence model for robot control.

The first hardware node is represented by a computer running on the Windows 11 operating system. Inside there are three nodes containing the software components and files necessary for the development and inference process:

The node representing Python scripts includes 2 scripts developed in Python 3.11.13 using the YOLO11 framework: a script dedicated to the training process that generates a file with the trained model and a script responsible for performing classifications.
The node corresponding to the C# application running on the .NET 8 platform, organized according to a Model–View–Presenter architecture [53]. The application, implemented using C#, communicates with the Python scripts via the TCP/IP protocol.
The data storage node contains files essential for the functioning of artificial intelligence algorithms and for interaction with the user.

The second hardware node is a PLC running a real-time operating system (RTOS) and is responsible for direct control of the robotic equipment. The dedicated software is represented by the RobotControl.exe file, implemented and configured in the B&R Automation Studio environment.

The deployment diagram presented provides a clear and structured representation of how the software components are distributed and interact in a hybrid system for AI-assisted surgical robotic control. The proposed architecture provides a clear separation between the user interface level, AI-based processing, and physical control of the equipment.

The graphical user interface within the application, developed in C# and built on WinForms technology, requires the adoption of a Model–View–Presenter (MVP) architecture, due to the efficient separation of responsibilities and facilitation of component testability. The UML class diagram [54], developed based on MVP architecture and presented in Figure 4, is organized into 3 main packages: Model, Presenter, and View.

The Model package defines two classes responsible for managing interaction with external components. The AI_PythonModel class uses the System.Net.Sockets library to communicate via TCP/IP with the Python script dedicated to the classification process. The ModbusPLC class, designed for communication with the PLC, uses the EasyModbus library, implementing the Modbus industrial protocol for real-time data exchange.

The Presenter package includes presentation logic and acts as a mediator between the graphical user interface and the application model. Within this package, two classes and two interfaces are defined. The IArtificialIntelligenceGUI and IRobotControlGUI interfaces define the contracts that the graphical interface components must comply with to be used by the classes in Presenter package. The View package includes two classes that define the graphical interface of the application, implemented using the System.Windows.Forms library. These classes implement the interfaces defined in Presenter package, thus enabling dependency inversion and high testability.

3.2. Implementation and Integration

3.2.1. Development of Software Applications Using .NET 8

The surgical robot system is controlled and monitored through a software application developed using the C# 12 programming language and the .NET 8 platform, with the help of the Visual Studio development environment, Preview 2022 version. The developed software application includes a graphical user interface with two main options: “Artificial Intelligence” and “Robot Control”.

The “Artificial Intelligence” graphical user interface option (Figure 5) contains the following control elements:

Pressing the “Connect\Disconnect” button (Figure 5, 1) establishes the connection between the C# application and the Python script using the TCP/IP protocol;
The “Video Start/Stop” button (Figure 5, 2) is used to start the 3D stereoscopic camera;
The “Screenshot” button (Figure 5, 3) saves the images used to train the YOLOv11 network and the “Saved images” field (Figure 5, 4) indicates their total number;
Pressing the “Start/Stop Detect” button (Figure 5, 5) activates or deactivates the object detection process with the 3D stereoscopic camera;
By pressing the “Save Coords” button (Figure 5, 6), the coordinates of the detected objects are saved in an Excel file;
By pressing the “Automatic Positioning” button (Figure 5, 7), the surgical robot automatically positions itself so that the surgical instrument can be inserted through the trocar;
By using the “Object 1” and “Object 2” sliders (Figure 5, 8), it can extend the ends of the line connecting the two detected objects (trocar and instr.);
To correctly position the surgical instrument so that it enters through the trocar, three elements are detected using the 3D stereoscopic camera and then two lines are drawn: one between the trocar and the instrument (Figure 5, 9) and another between the trocar and the PM (Figure 5, 10);
By pressing the “Exit” button (Figure 5, 11), the interface can be closed.

The “Robot Control” graphical user interface option (Figure 6) contains the following elements:

Modbus Connect—pressing this button connects the user interface to the PLC;
Robot Q Values—displays the positions of the robot’s active joints relative to its coordinate system;
Remote center of motion—displays the RCM (remote center of motion) position of the spherical mechanism relative to its coordinate system;
Tool Center Point—displays the position of the end effector in space relative to the robot’s coordinate system;
Speed—slider used to adjust the speed at which the final effector motions;
Acceleration—slider used to adjust the maximum permissible acceleration;
Robot Status—indicates the status of the robot using green and red colors (green indicates normal operation, while red indicates an error or malfunction);
Power on—pressing this button powers the actuators;
Homing—button that initiates the referencing procedure;
Reset—this button resets the active error;
Emergency Stop—button that triggers the robot shutdown procedure;
Control—allows you to choose between controlling the robot or the active tool, and the choice between “Haptic” and “SpaceMouse” allows you to choose the peripheral device that controls the robot’s positioning by the user.

3.2.2. Development of the Learning Model

In the development process of the learning model, six main stages were followed, as illustrated in Figure 7: data acquisition, data preparation and annotation, intelligent model configuration, model training and validation, model performance evaluation, and comparison of the proposed model with previous versions. The initial stage of the process focuses on data acquisition, where images are captured using a 3D stereoscopic camera, generating RGB frames that include objects of interest. The next stage consists of manually annotating the images with a YOLO-compatible annotation tool, by defining the bounding box coordinates corresponding to the three object classes.

The third stage involves the selection of the detection model (YOLO11m), the identification of earlier model versions (YOLO8m, YOLO9m, and YOLO10m) for comparative analysis, as well as the determination of optimal hyperparameter configurations. The fourth stage involves the training and validation of the four YOLO model versions using a 5-fold cross-validation procedure. In the next stage, the four versions are evaluated using four validation metrics: mAP, precision, recall, and F1-score. The final stage involves comparing the models, emphasizing that the initially selected model (YOLO11m) stands out as the most performant.

In recent years, the YOLO model series has become a leading framework in object detection, consistently advancing the balance between lightweight design and high accuracy. Based on the studies [55,56], among the existing versions of the YOLO model, YOLO11 [56] was chosen. It is an advanced convolutional neural network (CNN) architecture, optimized for real-time object detection. The model introduces several significant improvements over previous versions of the YOLO family, renowned for its superior performance in identifying objects in images and videos by integrating efficient architectural blocks into the three main components of the network: backbone, neck, and head (Figure 8).

The backbone component, responsible for feature extraction, consists of eleven blocks: five CBS blocks, four C3K2 blocks, one SPPF block, and one C2PSA block. The CBS block is composed of a sequential arrangement of three fundamental components: a 2D convolutional layer, a batch normalization layer, and a sigmoid linear unit (SiLU) activation function. The C3k2 block is an efficient extension of the cross-stage partial architecture that divides the input channels into parallel branches, processes one of them through a series of bottleneck or C3k modules depending on the configuration, then performs the concatenation of the results and a final convolution, providing faster feature extraction with fewer parameters than the C2f block in the yolo8 version that it replaced. The SPPF (Spatial Pyramid Pooling—Fast) block facilitates the aggregation of information at different scales and ensures sensitivity to objects of varying sizes. In the final stage of the backbone component, the network head uses C2PSA (Convolutional + Parallel Spatial Attention) blocks, which enable precise object localization through parallel spatial attention mechanisms [57,58].

The neck component plays a central role in aggregating and propagating the features extracted from the backbone to the head of the network, where the actual object detection takes place. It is responsible for combining and recalibrating feature maps at different resolution levels so that information relevant to the detection of small, medium, and large objects is preserved and integrated efficiently. The YOLO11 neck consists of a total of twelve blocks: two CBS blocks, four C3K2 blocks, two Upsample blocks, and four Concat blocks. The role of the Upsample blocks, which increase the resolution of feature maps, is to facilitate the detection of smaller objects and to ensure dimensional alignment between layers. The Concat blocks concatenate feature maps from different branches, combining information extracted from various levels and optimizing the final representation for the head.

Optimized for fast inference, YOLO11 [59] is capable of simultaneously detecting and classifying multiple objects in a single image, maintaining high accuracy even under variable lighting conditions or from different viewing angles. The model integrates advanced techniques such as adaptive anchors and modern regularization methods, which contribute to more accurate localization and an increased generalization ability, including small or hard-to-distinguish objects. In addition, YOLO11 works efficiently on a wide range of hardware platforms, from high-performance servers to mobile devices with limited resources. With this balanced combination of speed, accuracy, and flexibility, YOLO11 stands out as a modern and efficient solution for applications such as video surveillance, robotics, autonomous vehicles, and other intelligent visual detection systems.

Among the five variants of the YOLO11 model, based on the experimental results presented in [60,61], the YOLO11m variant was selected, characterized by a total of 20.1 million parameters. This choice reflects an optimal compromise between feature representation capacity, object detection accuracy, and computational efficiency.

To train and validate intelligent models implemented using the YOLO library, 1000 images were used. The images were captured using a 3D stereoscopic camera, containing objects relevant to robotic applications. The labeling process was performed manually, using a YOLO-compatible annotation tool to accurately mark the positions of the detected objects. The goal of the training was to obtain a network capable of robustly detecting three classes of objects: trocar, SM, and PM under different lighting conditions and viewing angles. Using an initial split of the dataset into 800 training images and 200 validation images, the YOLO11m model hyperparameters were finetuned through a grid search approach. The optimal configuration (Table 1) employed the Adam optimizer, an initial learning rate of 0.0015, a batch size of 8, and 900 training epochs.

After determining the optimal hyperparameter values, the training and validation sets are generated using a five-fold cross-validation procedure. The entire dataset is initially divided into five mutually exclusive subsets, each containing 200 images. In each iteration, one subset is used as the validation set, while the remaining four serve as the training set. The final result is obtained by averaging the performance across the five rounds. Figure 9 illustrates the structure of the five-fold cross-validation process.

After the training is complete, the model performance is evaluated based on the values obtained for various metrics. The precision metric indicates the accuracy [62] of the model’s positive predictions, reflecting the proportion of objects correctly detected. In the case of object detection with YOLO, this metric is calculated based on the IoU (Intersection over Union) value, which measures the degree of overlap between the predicted and actual box. The recall measures the proportion of real objects that the model correctly detected. In the context of YOLO, this metric is the average recall value calculated across all classes and all images in the validation set. The mean average precision (mAP) metric is an overall assessment of the performance of the detection model, integrating the information provided by precision and recall. It indicates how efficiently the model is able to detect and correctly locate all of the objects in the images, regardless of class. The F1-score metric [63] expresses the balance between the accuracy of correct detections (precision) and the model’s ability to identify all relevant objects (recall). It is calculated as the harmonic mean between precision and recall, providing an overview of the model’s performance in object detection.

Table 2 presents the values of the four metrics obtained during the training and validation process for the four YOLO variants, calculated for each of the five iterations corresponding to the k-fold cross-validation procedure. The table also includes the average metric values for each YOLO variant used, along with the corresponding standard deviations (SD). For the YOLO11m model, it can be observed that after 900 training epochs, the average metric values stabilize at very high levels: precision around 0.9878, recall around 0.9849, and mAP reaching 0.9946. These results highlight the model’s exceptional performance and remarkable stability. The high precision indicates that the model produces very few false detections, with almost all identified objects being correct. At the same time, the recall value, close to 1, shows that the model successfully detects nearly all real objects, with a minimal number of omissions (false negatives). The high mAP value confirms that the model predictions are almost perfectly aligned with the true positions of objects in the images, demonstrating strong localization and classification capabilities. The minor fluctuations observed are normal and do not significantly affect the overall performance.

The comparative analysis of the performance of the YOLO11m, YOLO10m, YOLO9m, and YOLO8m models, conducted using a rigorous 5-fold cross-validation protocol, reveals consistent differences in both accuracy and metric stability. The YOLO11m model consistently stands out as the best performer, achieving near-perfect values for mAP, precision, recall, and F1-score, accompanied by extremely low standard deviations. This indicates not only a high detection capability but also superior robustness to variations in the training and validation subsets. The graphical comparison of these four models is presented in Figure 10.

In contrast, the YOLO10m, YOLO9m, and YOLO8m models exhibit progressive decreases in performance, reflected both in the average metric values and in the higher standard deviations. These variations indicate a greater sensitivity to the data structure and reduced stability of the learning process. Although YOLO10m maintains an acceptable performance level, the YOLO9m and YOLO8m models show substantial fluctuations, particularly in mAP and recall, which limit their applicability in scenarios that require high precision.

The proposed YOLO11 model addresses a task-specific, geometry-constrained robotic localization problem, rather than generic intraoperative scene understanding. The detection task involves only three rigid object classes with fixed geometry and limited variability, for which large-scale datasets are not required. A dataset of 1000 RGB-D images was collected to explicitly cover the relevant operational conditions, including variations in illumination, camera angle, working distance, and relative object pose. Overfitting was monitored during training, with training and validation losses remaining closely aligned over 900 epochs, indicating stable convergence. High validation performance was consistently observed. Most importantly, robustness was confirmed at the physical system level through comparison with an independent OptiTrack ground-truth system, yielding submillimeter positional accuracy during autonomous robot positioning.

Overall, the results confirm that YOLO11m represents the optimal solution, offering a superior balance of accuracy, consistency, and generalization capability. The stability of the metrics across the five iterations, combined with the very low standard deviations, indicates that the set of 1000 images used in the experiment is sufficient to obtain a reliable evaluation of the model performance.

3.2.3. System Integration in the ATHENA Robot

As illustrated in Figure 3, the 3D camera provides the coordinates of the points trocar, PM, and instr. belonging to the patient, robot, and instrument, respectively. All coordinates are in the fixed coordinate system belonging to the camera. Using the forward kinematic model, the PM point coordinates in the fixed coordinate system attached to the robot are as follows:

\{\begin{cases} X_{PMR} = X_{P} - (l_{4} - l_{6}) \frac{X_{P}}{\sqrt{X_{P}^{2} + Z_{P}^{2}}} \\ Y_{PMR} = Y_{P} \\ Z_{PMR} = Z_{P} - (l_{4} - l_{6}) \frac{Y_{P}}{\sqrt{X_{P}^{2} + Z_{P}^{2}}} \end{cases}

(5)

Therefore, the coordinates of the camera in the robot fixed coordinate are

\{\begin{cases} X_{CAM} = X_{PMR} + X_{PM} \\ Y_{CAM} = X_{PMR} + Y_{PM} \\ Z_{CAM} = Z_{PMR} + Z_{PM} \end{cases}

(6)

from which all other coordinates of the points provided by the 3D camera can be obtained. Thus, the coordinates of the instr. point belonging to the instrument in the fixed coordinate system of the robot are

\{\begin{cases} X_{I N S T R R} = X_{C A M} + X_{I N S T R} \\ Y_{I N S T R R} = X_{C A M} + Y_{I N S T R} \\ Z_{I N S T R R} = Z_{C A M} + Z_{I N S T R R} \end{cases}

(7)

and the trocar coordinates in the same fixed coordinates system are

\{\begin{cases} X_{T R O C A R R} = X_{C A M} + X_{T R O C A R} \\ Y_{T R O C A R R} = X_{C A M} + Y_{T R O C A R} \\ Z_{T R O C A R R} = Z_{C A M} + Z_{T R O C A R} \end{cases}

(8)

3.2.4. System Data Flow and Experimental Setup

Data Flow Architecture

To illustrate the integration between perception, AI processing, communication, and robot control, Figure 11 presents the complete data flow of the proposed system. The architecture consists of five sequential layers:

Three-dimensional Camera Acquisition: The Intel RealSense D405 camera captures synchronized RGB and depth streams at 60 fps with a spatial resolution of 1280 × 720 pixels.
AI Inference Layer: The RGB + depth frames are transmitted to a Python 3.11 environment, where the trained YOLO11m model performs real-time object detection of the trocar, instrument, and parallel mechanism (PM). The resulting 3D coordinates of detected bounding boxes are computed by combining image pixel positions with depth information from the RealSense API.
Communication Layer (TCP/IP): The processed data—3D coordinates and class labels—are sent through a bidirectional TCP/IP socket to the C#/.NET 8 graphical interface. This ensures asynchronous real-time data transfer between the AI module and the robot control interface, with average latency measured at 67 ms. This value has been determined by decomposing the perception-to-control loop into camera acquisition, neural inference, post-processing, network communication, and PLC execution. Over N = 300 consecutive cycles, the mean end-to-end delay from RGB-D frame reception on the PC to PLC confirmation that the command was applied was 67.0 ± 3.1 ms, comprising 16.7 ± 1.9 ms for camera acquisition (60 fps operation), 14.7 ± 0.6 ms for YOLO11m inference, 5.4 ± 0.4 ms for 3D coordinate extraction and message formatting, 7.8 ± 1.1 ms for the Modbus/TCP write–ack transaction, and 22.4 ± 2.0 ms for PLC execution until the “applied” flag/cycle counter update was observed, yielding a sum of stage means of 67.0 ms.
Control Interface (C# Application): The interface visualizes detections, logs coordinates, and converts them into robot commands. The coordinate transformation algorithm maps camera-frame coordinates into the robot reference frame.
PLC Motion Control: The control commands are transmitted through the Modbus TCP protocol to the B&R PLC running a real-time operating system (RTOS). The PLC manages stepper motor drives and enforces safety constraints (velocity limits, emergency stop).
Robot Execution: The ATHENA parallel robot performs smooth motion following a trapezoidal velocity and acceleration profile, aligning its flange with the detected surgical instrument.

Experimental Setup

Number of 3D Samples: A total of 1200 3D frames were recorded, including 1000 frames used for training and 200 frames for validation and testing. Each frame contained annotations for three object classes: trocar, instrument, and PM.
Lighting Conditions: Experiments were conducted under three illumination levels to simulate realistic operating room variability:
○
Bright (800–1000 lux)—standard laboratory lighting;
○
Moderate (400–600 lux)—typical endoscopic lighting conditions;
○
Low (150–250 lux)—simulated dim operating room scenario.
Camera–Object Distance and Angles: The camera was mounted at 0.6–0.8 m from the robot workspace at adjustable tilt angles of 20°, 30°, and 45° relative to the instrument axis.
Hardware Configuration:
○
CPU: AMD Ryzen 7 9700X (16 cores, 5.7 GHz).
○
GPU: NVIDIA RTX 5080 (16 GB VRAM).
○
RAM: 96 GB DDR5 6400 MHz.
○
Operating System: Windows 11 Pro 64-bit.
○
Storage: 2 TB NVMe SSD.
○
Frame Rate: 60 frames per second.
○
Average Inference Time: 14.7 ms per frame (YOLO11m on GPU).

This configuration allows real-time YOLO11m inference at 14.7 ms per frame, which fulfills real-time requirements.

4. Experimental Results

The proposed algorithm has been integrated into the ATHENA robot control system. After the surgeon positions the robotic instrument according to the pancreatic surgical task, the user pushes the “Automatic Positioning” button, and the robot brings the flange at the instrument site for easy and safe docking. The motion parameters in this case are: v = 10 mm/s and v = 5 mm/s² and a trapezoidal velocity profile has been used. Figure 12 presents the motors’ displacement velocity and acceleration recorded during automatic positioning of the robot close to the surgical instrument. The motion results come from a a simulated scenario in laboratory conditions that is like a real operating scenario.

Figure 13 shows two images captured during the simulation of the instrument attachment in laboratory conditions. Thus, the surgeon positions the robotic instrument within the operating field according to the surgical task, close to the pancreas, with the position decided visually and using an endoscope (Figure 13a). Once the position is decided, the algorithm “reads” the instrument’s coordinates and the ATHENA robot is repositioned in the vicinity of these coordinates automatically (Figure 13b).

The current proof-of-concept relies on a single Intel RealSense RGB-D camera to minimize additional OR hardware, calibration burden, and setup time and to keep the system compact and easy to integrate as a first end-to-end demonstration of the perception-to-control pipeline. This design choice reduces the footprint and simplifies registration between the sensing frame and the robot/PLC coordinate frames, but it can be affected by depth noise, specular reflections, partial occlusions, and field contamination (e.g., blood/fluid droplets), which may reduce detection confidence and depth reliability. To mitigate this, the perception-to-control coupling is conservative: motion updates are applied only when detections remain stable across consecutive frames and depth values are valid for the detected regions; otherwise, the system holds the last safe target (or pauses autonomous updates). Independent of perception, the PLC layer enforces safety constraints (e.g., velocity limits and emergency stop). Future work will extend the system with sensor redundancy (e.g., multi-view RGB-D or complementary sensing) and explicit sensor-health monitoring to improve fault tolerance under operating-room edge cases.

To validate the proposed algorithm, an experimental plan has been devised to compare the positioning of the robot PM near the instrument against a high-precision reference system such as OptiTrack [64]. The experimental setup consists of the ATHENA robot, the Optitrack motion capture system, and a human torso trainer. The main steps in performing the experiments are shown in Figure 14. The Prime 41 cameras of the Optitrack system need to be placed in the vicinity of the ATHENA robot allowing enough space for the system calibration using the magic wand (Figure 15(1)). The three rigid bodies targeted by the robot camera have been defined for OptiTrack using three sets of four markers each: for the trocar (RB_TROCAR), for the instrument body (RB_INSTR) and for the robot flange (RB_PM), Figure 15 (Figure 15(2)). Using Equations (5)–(8), the robot base has been calibrated within the RealSense camera frame. The same robot base needs to be calibrated within the OptiTrack world frame. This was performed using the markers attached to the robot flange and by recording eight poses distributed within the robot workspace. The PM position has been computed once using the robot forward kinematics and then within the OptiTrack frame (the RB_PM). A best fit-transform

{}^{O}T_{B}

between the two-point sets has been computed using the Umeyama/Horn method [64] and validated through RMSE.

The experimental conditions included three lighting levels (bright/medium/low), three camera tilt angles (20°, 30°, 45°), and three distances between camera and workspace (0.5 m, 0.6 m, and 0.7 m). For each object (RB_INSTR, RB_PM) the angular position error was computed as follows:

θ_{i} = \cos^{- 1} ({\hat{u}}_{i}^{RealSense}, {\hat{u}}_{i}^{OptiTrack})

(9)

where

{\hat{u}}_{i}^{RealSense}

and

{\hat{u}}_{i}^{OptiTrack}

represent unit axis vectors of the instrument estimated by the RealSense pipeline and the OptiTrack system, respectively, both expressed in the robot base frame. The histogram of the angular errors

θ_{i}

is presented in Figure 16. Most samples are concentrated at small angles, while only a few cases exhibit larger deviations. This concentration near zero indicates a good agreement between the RealSense-based estimation and the OptiTrack ground truth and confirms that the proposed localization method can reliably recover the orientation of the instrument shaft for the considered workspace and operating conditions. Over the 48 trials, the angular error between the instrument axis estimated by RealSense and the OptiTrack reference was

{0.82}^{\circ} \pm {0.31}^{\circ}

(as the mean value ± the standard deviation-SD), with an RMSE of

{0.88}^{\circ}

and a maximum error value of

{1.76}^{\circ}

, indicating close agreement between the RealSense-based estimate and the OptiTrack ground truth.

In addition to orientation, the translational accuracy of the proposed method has been evaluated. For each trial, the Euclidean distance (Equation (10)) between the instrument tip position estimated from the RealSense camera and the corresponding OptiTrack position was computed in the robot base frame. The resulting position error was

0.54 \pm 0.19

mm (mean ± SD), with an RMSE of

0.57

mm and a maximum of

0.80

mm over all trials.

d_{i} = ‖P_{i}^{RealSense} - P_{i}^{OptiTrack}‖

(10)

All evaluations were performed in a controlled laboratory environment designed to closely replicate the spatial configuration and operational constraints of the intraoperative field. The setup emulated realistic trocar–instrument geometry, working distances, and illumination conditions, enabling consistent and clinically relevant assessment of detection accuracy and autonomous robot positioning. While the proposed perception module remained stable under the range of lighting variations evaluated in our laboratory setup, the operating room can present more extreme conditions such as under/over-exposure, strong directional shadows, specular highlights from wet tissue/metal, partial occlusions by staff, and contamination of the field-of-view (e.g., blood or fluid droplets). These factors may reduce detection confidence and depth reliability and therefore could degrade localization accuracy. In future work, we will extend the dataset and validation protocol to explicitly include such edge cases and quantify performance degradation (precision/recall/mAP and failure rates), and we will implement conservative safety gating (confidence thresholds and temporal consistency checks) to prevent motion commands from being issued when visual uncertainty is high.

The YOLO11 model was trained and evaluated on a dataset collected in a single laboratory environment, which limits environmental variability and may lead to optimistic performance estimates. While we use a held-out split and data augmentation to reduce overfitting, generalization to unseen settings (different rooms, lighting, occlusions, and background clutter) has not yet been quantified. Future work will include multi-site or multi-environment data acquisition and cross-domain evaluation (e.g., training in one setup and testing in a different setup), accompanied by robustness metrics and failure-case analysis prior to clinical translation.

The proposed platform currently functions as a fully integrated proof-of-concept system, combining 3D sensing, AI-based perception, coordinate transformation, and PLC-driven robot control. The experimental platform used a high-end GPU workstation to accelerate model training and iterative development; however, this is not a requirement for the final deployed device. Training is an offline process and can be performed on a dedicated development machine, while real-time inference can be executed on a smaller and cheaper computer unit. For example, the GPU used in the prototype (GeForce RTX 5080), 2788 San Tomas Expressway Santa Clara, CA 95051 USA, has an MSRP starting at USD 999 (with retail pricing varying by market/availability), whereas embedded edge-AI alternatives such as Jetson Orin NX 16GB are commercially available at approximately EUR 791.80 for the module (excluding carrier board and enclosure). The RGB-D sensor is also a commodity component; for the Intel RealSense D405, prices vary significantly by channel (e.g., around EUR 335 at some EU retailers versus higher marketplace pricing from distribution channels). In addition, the PLC/industrial control interface can be implemented with widely available compact PLCs; as a representative example, Siemens S7-1200 CPU 1212C is listed with an RRP around EUR 288 (configurations vary). Overall, excluding the mechanical structure and the robot itself, an indicative cost for the perception-and-control electronics (camera + compute + PLC interface + basic integration hardware) is of the order of a few thousand Euros, depending on the selected compute platform (workstation GPU vs. edge AI module), enclosure/medical-grade integration, and production scale.

5. Conclusions and Future Work

In this work, we presented an end-to-end, vision-guided pipeline for autonomous alignment and docking in a minimally invasive surgery context, combining RGB-D sensing, real-time deep learning-based detection, and a closed-loop positioning strategy integrated with industrial control (PLC communication). The proposed approach supports reliable 3D localization of the relevant elements in the scene and converts perception outputs into actionable motion commands, enabling consistent and repeatable alignment while maintaining real-time behavior suitable for intraoperative workflows. Experimental results in a controlled setup demonstrate that the system can achieve submillimeter positioning accuracy together with real-time performance, showing the feasibility of using a compact depth camera and a lightweight detection backbone as the perception core of an automated docking module. Overall, the contribution is not only the detection component, but the complete perception-to-control integration that enables practical autonomous positioning without relying on external tracking infrastructure.

Although the proposed system achieved submillimeter positioning accuracy and real-time performance, the evaluation was performed in a controlled laboratory setup designed to replicate key intraoperative constraints; therefore, the current platform should be regarded as a fully integrated proof-of-concept. Future work will include on-site assessment in a hospital environment to verify compatibility with real operating-room workflow. This will focus on practical aspects that cannot be fully reproduced in the laboratory, such as the available space around the patient, sterility and draping requirements, occlusions caused by staff and equipment, camera placement constraints, and integration with existing clinical routines. Based on surgeon and OR-staff feedback, we will refine the mechanical design (compactness, mounting strategy, drape/sterilization strategy, cable management) and enhance software reliability (confidence-based gating, fault handling, user interaction, and safety monitoring), followed by broader clinical validation.

Author Contributions

Conceptualization, B.G., F.C. and C.V.; methodology, F.C., A.-E.I. and B.G.; software, F.C., A.-E.I. and I.Z.; validation, N.A.H., A.C. and C.P.; formal analysis, D.C. and A.P.; investigation, F.C., A.-E.I., B.G. and C.V.; resources, N.A.H.; data curation, C.P. and A.C.; writing—original draft preparation, F.C., A.-E.I. and B.G.; writing—review and editing, D.P. and P.T.; visualization, P.T.; supervision, N.A.H. and D.C.; project administration, D.P.; funding acquisition, D.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the project “Romanian Hub for Artificial Intelligence-HRIA”, Smart Growth, Digitization and Financial Instruments Program, MySMIS no. 351416.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tan, A.; Ashrafian, H.; Scott, A.J.; Mason, S.E.; Harling, L.; Athanasiou, T.; Darzi, A. Robotic surgery: Disruptive innovation or unfulfilled promise? A systematic review and meta-analysis of the first 30 years. Surg. Endosc. 2016, 30, 4330–4352. [Google Scholar] [CrossRef]
George, E.I.; Brand, T.C.; LaPorta, A.; Marescaux, J.; Satava, R.M. Origins of Robotic Surgery: From Skepticism to Standard of Care. JSLS J. Soc. Laparoendosc. Surg. 2018, 22, e2018-00039. [Google Scholar] [CrossRef] [PubMed]
Morrell, A.L.G.; Morrell, A.C.; Mendes, J.M.F.; Tustumi, F.; Silva, L.G.O.; Morrell, A. The history of robotic surgery and its evolution: When illusion becomes reality. Rev. Do Colégio Bras. De Cir. 2021, 48, e20202798. [Google Scholar] [CrossRef]
Reddy, K.; Gharde, P.; Tayade, H.; Patil, M.; Reddy, L.S.; Surya, D. Advancements in Robotic Surgery: A Comprehensive Overview of Current Utilizations and Upcoming Frontiers. Cureus 2023, 15, e50415. [Google Scholar] [CrossRef] [PubMed]
Ashrafian, H.; Clancy, O.; Grover, V.; Darzi, A. The evolution of robotic surgery: Surgical and anaesthetic aspects. BJA Br. J. Anaesth. 2017, 119, 72–84. [Google Scholar] [CrossRef] [PubMed]
Iftikhar, M.; Saqib, M.; Zareen, M.; Mumtaz, H. Artificial intelligence: Revolutionizing robotic surgery: Review. Ann. Med. Surg. 2024, 86, 5401–5409. [Google Scholar] [CrossRef] [PubMed]
Kamtam, D.N.; Shrager, J.B.; Malla, S.D.; Lin, N.; Cardona, J.J.; Kim, J.J.; Hu, C. Deep learning approaches to surgical video segmentation and object detection: A scoping review. Comput. Biol. Med. 2025, 194, 110482. [Google Scholar] [CrossRef]
Aghazadeh, F.; Zheng, B.; Tavakoli, M.; Rouhani, H. Motion Smoothness-Based Assessment of Surgical Expertise: The Importance of Selecting Proper Metrics. Sensors 2023, 23, 3146. [Google Scholar] [CrossRef] [PubMed]
Sánchez-Margallo, J.A.; Sánchez-Margallo, F.M.; Pagador Carrasco, J.B.; Oropesa García, I.; Gómez Aguilera, E.J.; Moreno del Pozo, J. Usefulness of an Optical Tracking System in Laparoscopic Surgery for Motor Skills Assessment. Cirugía Española 2014, 92, 421–428. [Google Scholar] [CrossRef][Green Version]
Taleb, A.; Guigou, C.; Leclerc, S.; Lalande, A.; Bozorg-Grayeli, A. Image-to-Patient Registration in Computer-Assisted Surgery of Head and Neck: State-of-the-Art, Perspectives, and Challenges. J. Clin. Med. 2023, 12, 5398. [Google Scholar] [CrossRef]
Lugez, E.; Sadjadi, H.; Pichora, D.R.; Ellis, R.E.; Akl, S.G.; Fichtinger, G. Electromagnetic Tracking in Surgical and Interventional Environments: Usability Study. Int. J. Comput. Assist. Radiol. Surg. 2015, 10, 253–262. [Google Scholar] [CrossRef] [PubMed]
Kral, F.; Puschban, E.J.; Riechelmann, H.; Freysinger, W. Comparison of Optical and Electromagnetic Tracking for Navigated Lateral Skull Base Surgery. Int. J. Med. Robot. 2013, 9, 247–252. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Olson, E. AprilTag 2: Efficient and robust fiducial detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 4193–4198. [Google Scholar] [CrossRef]
Hein, J.; Seibold, M.; Bogo, F.; Farshad, M.; Pollefeys, M.; Fürnstahl, P.; Navab, N. Towards markerless surgical tool and hand pose estimation. Int. J. Comput. Assist. Radiol. Surg. 2021, 16, 799–808. [Google Scholar] [CrossRef] [PubMed]
Pan, X.; Bi, M.; Wang, H.; Ma, C.; He, X. DBH-YOLO: A Surgical Instrument Detection Method Based on Feature Separation in Laparoscopic Surgery. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 2215–2225. [Google Scholar] [CrossRef]
Peng, J.; Chen, Q.; Kang, L.; Jie, H.; Han, Y. Autonomous Recognition of Multiple Surgical Instrument Tips Based on Arrow OBB-YOLO Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–13. [Google Scholar] [CrossRef]
Picozzi, P.; Nocco, U.; Labate, C.; Gambini, I.; Puleo, G.; Silvi, F.; Pezzillo, A.; Mantione, R.; Cimolin, V. Advances in Robotic Surgery: A Review of New Surgical Platforms. Electronics 2024, 13, 4675. [Google Scholar] [CrossRef]
Williamson, T.; Song, E.E. Robotic Surgery Techniques to Improve Traditional Laparoscopy. JSLS J. Soc. Laparoendosc. Surg. 2022, 26, e2022-00002. [Google Scholar] [CrossRef]
Tucan, P.; Vaida, C.; Horvath, D.; Caprariu, A.; Burz, A.; Gherman, B.; Iakab, S.; Pisla, D. Design and Experimental Setup of a Robotic Medical Instrument for Brachytherapy in Non-Resectable Liver Tumors. Cancers 2022, 14, 5841. [Google Scholar] [CrossRef]
Vaida, C.; Plitea, N.; Carbone, G.; Birlescu, I.; Ulinici, I.; Pisla, A.; Pisla, D. Innovative development of a spherical parallel robot for upper limb rehabilitation. Int. J. Mech. Robot. Syst. 2018, 4, 256. [Google Scholar] [CrossRef]
Zhang, Z.; Meng, Q.; Cui, Z.; Yao, M.; Shao, Z.; Tao, B. Machine Learning Applications in Parallel Robots: A Brief Review. Machines 2025, 13, 565. [Google Scholar] [CrossRef]
Paracchini, S.; Taliento, C.; Pellecchia, G.; Tius, V.; Tavares, M.; Borghi, C.; Buda, A.A.; Bartoli, A.; Bourdel, N.; Vizzielli, G. Artificial Intelligence in the Operating Room: A Systematic Review of AI Models for Surgical Phase, Instruments and Anatomical Structure Identification. Acta Obstet. Gynecol. Scand. 2025, 104, 2054–2064. [Google Scholar] [CrossRef] [PubMed]
Ward, T.M.; Mascagni, P.; Ban, Y.; Rosman, G.; Padoy, N.; Meireles, O.; Hashimoto, D.A. Computer vision in surgery. Surgery 2021, 169, 1253–1256. [Google Scholar] [CrossRef]
Gumbs, A.A.; Grasso, V.; Bourdel, N.; Croner, R.; Spolverato, G.; Frigerio, I.; Illanes, A.; Hilal, M.A.; Park, A.; Elyan, E. The Advances in Computer Vision That Are Enabling More Autonomous Actions in Surgery: A Systematic Review of the Literature. Sensors 2022, 22, 4918. [Google Scholar] [CrossRef] [PubMed]
Luongo, F.; Hakim, R.; Nguyen, J.H.; Anandkumar, A.; Hung, A.J. Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery 2021, 169, 1240–1244. [Google Scholar] [CrossRef] [PubMed]
Zang, C.; Turkcan, M.K.; Narasimhan, S.; Cao, Y.; Yarali, K.; Xiang, Z.; Szot, S.; Ahmad, F.; Choksi, S.; Bitner, D.P.; et al. Surgical Phase Recognition in Inguinal Hernia Repair-AI-Based Confirmatory Baseline and Exploration of Competitive Models. Bioengineering 2023, 10, 654. [Google Scholar] [CrossRef]
Jiang, K.; Pan, S.W.; Yang, L.; Yu, J.; Lin, Y.; Wang, H.Q. Surgical Instrument Recognition Based on Improved YOLOv5. Appl. Sci. 2023, 13, 11709. [Google Scholar] [CrossRef]
Jearanai, S.; Wangkulangkul, P.; Sae-Lim, W.; Cheewatanakornkul, S. Development of a deep learning model for safe direct optical trocar insertion in minimally invasive surgery: An innovative method to prevent trocar injuries. Surg. Endosc. 2023, 37, 7295–7304. [Google Scholar] [CrossRef]
Rus, G.; Andras, I.; Vaida, C.; Crisan, N.; Gherman, B.; Radu, C.; Tucan, P.; Iakab, S.; Hajjar, N.A.; Pisla, D. Artificial Intelligence-Based Hazard Detection in Robotic-Assisted Single-Incision Oncologic Surgery. Cancers 2023, 15, 3387. [Google Scholar] [CrossRef]
Azizian, M.; Khoshnam, M.; Najmaei, N.; Patel, R.V. Visual Servoing in Medical Robotics: A Survey Part I: Endoscopic Direct Vision Imaging—Techniques and Applications. Int. J. Med. Robot. Comput. Assist. Surg. 2014, 10, 263–274. [Google Scholar] [CrossRef]
Pandya, A.; Reisner, L.A.; King, B.; Lucas, N.; Composto, A.; Klein, M.; Ellis, R.D. A Review of Camera Viewpoint Automation in Robotic and Laparoscopic Surgery. Robotics 2014, 3, 310–329. [Google Scholar] [CrossRef]
Maier-Hein, L.; Vedula, S.S.; Speidel, S.; Navab, N.; Kikinis, R.; Park, A.; Eisenmann, M.; Feussner, H.; Forestier, G.; Giannarou, S.; et al. Surgical Data Science for Next-Generation Interventions. Nat. Biomed. Eng. 2017, 1, 691–696. [Google Scholar] [CrossRef]
Ahmed, F.A.; Yousef, M.; Ahmed, M.A.; Ali, H.O.; Mahboob, A.; Ali, H.; Shah, Z.; Aboumarzouk, O.; Al Ansari, A.; Balakrishnan, S. Deep Learning for Surgical Instrument Recognition and Segmentation in Robotic-Assisted Surgeries: A Systematic Review. Artif. Intell. Rev. 2025, 58, 1. [Google Scholar] [CrossRef]
Allan, M.; Ourselin, S.; Hawkes, D.J.; Kelly, J.D.; Stoyanov, D. 3-D Pose Estimation of Articulated Instruments in Robotic Minimally Invasive Surgery. IEEE Trans. Med. Imaging 2018, 37, 1204–1213. [Google Scholar] [CrossRef]
Doignon, C.; Nageotte, F.; Maurin, B.; Krupa, A. Pose Estimation and Feature Tracking for Robot Assisted Surgery with Medical Imaging. In Unifying Perspectives in Computational and Robot Vision; Kragic, D., Kyrki, V., Eds.; Lecture Notes in Electrical Engineering; Springer: Boston, MA, USA, 2008; Volume 8, pp. 79–101. [Google Scholar]
Hasan, M.K.; Calvet, L.; Rabbani, N.; Bartoli, A. Detection, Segmentation, and 3D Pose Estimation of Surgical Tools Using Convolutional Neural Networks and Algebraic Geometry. Med. Image Anal. 2021, 70, 101994. [Google Scholar] [CrossRef] [PubMed]
Habert, S.; Eck, U.; Fallavollita, P.; Parent, S.; Navab, N.; Cheriet, F. Application of an RGBD Augmented C-Arm for Minimally Invasive Scoliosis Surgery Assistance. Healthc. Technol. Lett. 2017, 4, 179–183. [Google Scholar] [CrossRef] [PubMed]
Simpson, A.L.; Ma, B.; Vasarhelyi, E.M.; Borschneck, D.P.; Ellis, R.E.; Stewart, A.J. Computation and Visualization of Uncertainty in Surgical Navigation. Int. J. Med. Robot. Comput. Assist. Surg. 2014, 10, 332–343. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Cai, T.; Chang, F.; Cheng, X. Real-Time Surgical Instrument Detection in Robot-Assisted Surgery Using a Convolutional Neural Network Cascade. Healthc. Technol. Lett. 2019, 6, 275–279. [Google Scholar] [CrossRef]
ISO 14971:2019; Medical Devices—Application of Risk Management to Medical Devices. ISO: Geneva, Switzerland, 2019.
ISO 13485:2016; Medical Devices—Quality Management Systems—Requirements for Regulatory Purposes. ISO: Geneva, Switzerland, 2016.
IEC 60601-1:2005+AMD1:2012+AMD2:2020 CSV; Medical Electrical Equipment—Part 1: General Requirements for Basic Safety and Essential Performance. IEC: Geneva, Switzerland, 2020.
IEC 60601-1-2:2014+AMD1:2020 CSV; Medical Electrical Equipment—Part 1–2: General Requirements for Basic Safety and Essential Performance—Collateral Standard: Electromagnetic Disturbances—Requirements and Tests. IEC: Geneva, Switzerland, 2020.
IEC 80601-2-77:2019+AMD1:2023 CSV; Medical Electrical Equipment—Part 2–77: Particular Requirements for the Basic Safety and Essential Performance of Robotically Assisted Surgical Equipment. IEC: Geneva, Switzerland, 2023.
IEC 62304:2006+AMD1:2015 CSV; Medical Device Software—Software Life Cycle Processes. IEC: Geneva, Switzerland, 2015.
IEC 62366-1:2015+AMD1:2020 CSV; Medical Devices—Part 1: Application of Usability Engineering to Medical Devices. IEC: Geneva, Switzerland, 2020.
Intel RealSense Camera D405. Available online: www.realsenseai.com/products/stereo-depth-camera-d405/ (accessed on 15 May 2025).
Vaida, C.; Gherman, B.; Tucan, P.; Birlescu, I.; Chablat, D.; Pisla, D. Parallel Robotic System for Pancreatic Minimally Invasive Surgery. Patent A/00116/20, March 2024. [Google Scholar]
Vaida, C.; Birlescu, I.; Gherman, B.; Condurache, D.; Chablat, D.; Pisla, D. An analysis of higher-order kinematics formalisms for an innovative surgical parallel robot. Mech. Mach. Theory 2025, 209, 105986. [Google Scholar] [CrossRef]
Tucan, P.; Ciocan, A.; Gherman, B.; Radu, C.; Vaida, C.; Hajjar, N.A.; Chablat, D.; Pisla, D. Design Optimization of a Parallel Robot for Laparoscopic Pancreatic Surgery Using a Genetic Algorithm. Appl. Sci. 2025, 15, 4383. [Google Scholar] [CrossRef]
Iordan, A.E.; Covaciu, F. Improving Design of a Triangle Geometry Computer Application using a Creational Pattern. Acta Tech. Napoc. Appl. Math. Mech. Eng. 2020, 63, 73–78. [Google Scholar]
Sukarsa, I.; Piarsa, I.; Putra, I. Application of MVP Architecture in Developing Android-Based Seminar Ticket Booking Applications. J. RESTI 2020, 4, 513–520. [Google Scholar] [CrossRef]
Tazin, A.; Kokar, M. UML Class Diagram Classification Using Category Theory. J. Softw. Eng. Appl. 2025, 18, 217–248. [Google Scholar] [CrossRef]
Ramos, L.; Sappa, A. A comprehensive analysis of YOLO architectures for tomato leaf disease identification. Sci. Rep. 2025, 15, 26890. [Google Scholar] [CrossRef] [PubMed]
Chen, F.; Zhang, Y.; Fu, L.; Hua, R.; Zhang, Q.; Bi, S. A Comparative Review of the Next-Generation YOLO Models: YOLOv10 and YOLO11. J. Comput. Sci. Artif. Intell. 2025, 3, 1–6. [Google Scholar] [CrossRef]
Ultralytics YOLO11: Real-Time Object Detection Model. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 15 May 2025).
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
He, L.H.; Zhou, Y.Z.; Liu, L.; Zhang, Y.Q.; Ma, J.H. Research on the directional bounding box algorithm of YOLO11 in tailings pond identification. Measurement 2025, 253, 117674. [Google Scholar] [CrossRef]
Lee, Y.-S.; Patil, M.; Kim, J.; Seo, Y.B.; Ahn, D.; Kim, G.-D. Hyperparameter Optimization for Tomato Leaf Disease Recognition Based on YOLOv11m. Plants 2025, 14, 653. [Google Scholar] [CrossRef]
Teng, H.; Wang, Y.; Li, W.; Chen, T.; Liu, Q. Advancing Rice Disease Detection in Farmland with an Enhanced YOLOv11 Algorithm. Sensors 2025, 25, 3056. [Google Scholar] [CrossRef]
Muscalagiu, I.; Popa, H.E.; Negru, V. Improving the Performances of Asynchronous Search Algorithms in Scale-Free Networks using the Nogoood Processor Technique. Comput. Inform. 2015, 34, 254–274. [Google Scholar]
Panoiu, M.; Ivascanu, P.; Panoiu, C. Analysis of Operating Regimes and THD Forecasting in Steelmaking Plant Power Systems using Advanced Neural Architectures. Mathematics 2025, 13, 3692. [Google Scholar] [CrossRef]
Pisla, D.; Gherman, B.; Tucan, P.; Pisla, A.; Al Hajjar, N.; Cailean, A.; Vaida, C. On the accuracy assessment of a parallel robot for the minimally invasive cancer treatment. J. Eng. Sci. Innov. 2024, 9, 253–264. [Google Scholar] [CrossRef]
Eggert, D.; Lorusso, A.; Fisher, R. Estimating 3-D rigid body transformations: A comparison of four major algorithms. Mach. Vis. Appl. 1997, 9, 272–290. [Google Scholar] [CrossRef]

Figure 1. The vision-guided control flow for autonomous trocar docking of the ATHENA robot.

Figure 2. The ATHENA parallel robot: (a) Kinematic structure. (b) Experimental model.

Figure 3. UML deployment diagram.

Figure 4. UML class diagram corresponding to C# application.

Figure 5. Graphical user interface—artificial intelligence.

Figure 6. Graphical user interface—robot control.

Figure 7. Processing flow of the intelligent object detection module.

Figure 8. Model architecture of YOLO11.

Figure 9. Graphical representation of 5-fold cross-validation.

Figure 10. Comparing average metrics values.

Figure 11. Complete data flow from 3D camera acquisition to robot motion control.

Figure 12. Time history diagram for motor displacement (in green), velocity (in magenta), and acceleration (in red) of the ATHENA robot.

Figure 13. Experimental setting: (a) Manual position of the surgical instrument. (b) Instrument attachment to the robot after automatic positioning.

Figure 14. Stepwise representation of the experimental validation of the proposed methodology.

Figure 15. Experimental validation setup using the OptiTrack motion capture system.

Figure 16. Histogram of the angular error between the instrument axis estimated by the proposed RealSense-based vision pipeline and the corresponding ground-truth axis measured with the OptiTrack motion-capture system.

Table 1. Optimized hyperparameters.

Hyperparameter	Search Domain	Used Value
epochs	{500, 600, 700, 800, 900, 1000}	900
batch	{8, 16, 32}	8
optimizer	{“Adam”, “AdamW”}	“Adam”
initial learning rate	{0.001, 0.0015, 0.002, 0.0025, 0.003}	0.0015

Table 2. Metric values.

Model	Metric	Iteration 1	Iteration 2	Iteration 3	Iteration 4	Iteration 5	Average	SD
YOLO11m	mAP	0.99500	0.99421	0.99500	0.99500	0.99421	0.99468	0.00043
	Precision	0.99194	0.98288	0.97851	0.99542	0.99048	0.98785	0.00694
	Recall	0.99445	0.98333	0.97220	0.99692	0.97760	0.98490	0.01064
	F1-score	0.99319	0.98310	0.97534	0.99617	0.98400	0.98636	0.00837
YOLO10m	mAP	0.94313	0.92652	0.94823	0.92714	0.91788	0.93258	0.01264
	Precision	0.88011	0.90442	0.86409	0.87306	0.81917	0.86817	0.03122
	Recall	0.93333	0.87833	0.91862	0.90113	0.93661	0.91360	0.02420
	F1-score	0.90594	0.89118	0.89052	0.88687	0.87396	0.88970	0.01143
YOLO9m	mAP	0.84884	0.85594	0.81359	0.82295	0.80324	0.82891	0.02268
	Precision	0.78451	0.76561	0.74764	0.83242	0.80768	0.78757	0.03355
	Recall	0.76988	0.84440	0.85007	0.80436	0.86667	0.82708	0.03932
	F1-score	0.77713	0.80308	0.79557	0.81815	0.83614	0.80601	0.02240
YOLO8m	mAP	0.82057	0.76631	0.71485	0.76554	0.75876	0.76521	0.03756
	Precision	0.75652	0.69234	0.77212	0.76396	0.77687	0.75236	0.03444
	Recall	0.78333	0.71561	0.75759	0.78333	0.81667	0.77131	0.03754
	F1-score	0.76969	0.70378	0.76479	0.77352	0.79627	0.76161	0.03451

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Covaciu, F.; Gherman, B.; Al Hajjar, N.; Zima, I.; Popa, C.; Pusca, A.; Ciocan, A.; Vaida, C.; Iordan, A.-E.; Tucan, P.; et al. Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot. Electronics 2026, 15, 474. https://doi.org/10.3390/electronics15020474

AMA Style

Covaciu F, Gherman B, Al Hajjar N, Zima I, Popa C, Pusca A, Ciocan A, Vaida C, Iordan A-E, Tucan P, et al. Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot. Electronics. 2026; 15(2):474. https://doi.org/10.3390/electronics15020474

Chicago/Turabian Style

Covaciu, Florin, Bogdan Gherman, Nadim Al Hajjar, Ionut Zima, Calin Popa, Alexandru Pusca, Andra Ciocan, Calin Vaida, Anca-Elena Iordan, Paul Tucan, and et al. 2026. "Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot" Electronics 15, no. 2: 474. https://doi.org/10.3390/electronics15020474

APA Style

Covaciu, F., Gherman, B., Al Hajjar, N., Zima, I., Popa, C., Pusca, A., Ciocan, A., Vaida, C., Iordan, A.-E., Tucan, P., Chablat, D., & Pisla, D. (2026). Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot. Electronics, 15(2), 474. https://doi.org/10.3390/electronics15020474

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Computer Vision-Based Automated Localization and Positioning of the ATHENA Parallel Surgical Robot

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. ATHENA System Presentation

3.1.1. General Architecture of the Proposed System

3.1.2. The Structure of the ATHENA Surgical Parallel Robot

3.1.3. Design of the Proposed Distributed Software System

3.2. Implementation and Integration

3.2.1. Development of Software Applications Using .NET 8

3.2.2. Development of the Learning Model

3.2.3. System Integration in the ATHENA Robot

3.2.4. System Data Flow and Experimental Setup

4. Experimental Results

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI