Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy

Sibilano, Elena; Delprete, Claudia; Marvulli, Pietro Maria; Brunetti, Antonio; Marino, Francescomaria; Lucarelli, Giuseppe; Battaglia, Michele; Bevilacqua, Vitoantonio

doi:10.3390/app151910665

Open AccessArticle

Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy

by

Elena Sibilano

¹

,

Claudia Delprete

¹

,

Pietro Maria Marvulli

¹

,

Antonio Brunetti

¹

,

Francescomaria Marino

¹

,

Giuseppe Lucarelli

^2,3

,

Michele Battaglia

^2,4

and

Vitoantonio Bevilacqua

^1,5,*

¹

Department of Electrical and Information Engineering, Polytechnic University of Bari, Via Orabona 4, 70126 Bari, Italy

²

Urology, Andrology and Kidney Transplantation Unit, Department of Precision and Regenerative Medicine and Ionian Area, University of Bari “Aldo Moro”, 70121 Bari, Italy

³

Urology Unit, I.R.C.C.S. Istituto Tumori “Giovanni Paolo II”, 70124 Bari, Italy

⁴

Urology Unit, Mater Dei Hospital, 70124 Bari, Italy

⁵

The BioRobotics Institute, Scuola Superiore Sant’Anna, 56025 Pontedera, Italy

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10665; https://doi.org/10.3390/app151910665

Submission received: 5 September 2025 / Revised: 25 September 2025 / Accepted: 29 September 2025 / Published: 2 October 2025

Download

Browse Figures

Versions Notes

Abstract

Robot-assisted radical prostatectomy (RARP) has become the most prevalent treatment for patients with organ-confined prostate cancer. Despite superior outcomes, suboptimal vesicourethral anastomosis (VUA) may lead to serious complications, including urinary leakage, prolonged catheterization, and extended hospitalization. A precise localization of both the surgical needle and the surrounding vesical and urethral tissues to coadapt is needed for fine-grained assessment of this task. Nonetheless, the identification of anatomical structures from endoscopic videos is difficult due to tissue distortions, changes in brightness, and instrument interferences. In this paper, we propose and compare two Deep Learning (DL) pipelines for the automatic segmentation of the mucosal layers and the suturing needle in real RARP videos by exploiting different architectures and training strategies. To train the models, we introduce a novel, annotated dataset collected from four VUA procedures. Experimental results show that the nnU-Net 2D model achieved the highest class-specific metrics, with a Dice Score of 0.663 for the mucosa class and 0.866 for the needle class, outperforming both transformer-based and baseline convolutional approaches on external validation video sequences. This work paves the way for computer-assisted tools that can objectively evaluate surgical performance during the critical phase of suturing tasks.

Keywords:

robot-assisted surgery; deep learning; semantic segmentation; surgical skill assessment

1. Introduction

Prostate cancer is the second most common malignant neoplasm and the fifth leading cause of cancer-related mortality among men worldwide [1]. For patients with organ-confined disease, radical prostatectomy (RP), which involves the complete removal of the prostate gland and seminal vesicles with or without pelvic lymphadenectomy, is the standard surgical treatment [2].

In recent years, robot-assisted radical prostatectomy (RARP) has emerged as the preferred approach due to its superior perioperative outcomes [3]. Reduced blood loss, fewer complications, improved early continence and sexual function, lower rates of positive surgical margins, shorter hospital stays, and quicker recovery have all been linked to RARP compared to open or laparoscopic RP [4]. Despite these benefits, the increased complexity of the procedure necessitates surgeons to acquire and maintain highly skilled technical abilities.

Among the most technically demanding steps of RARP is vesicourethral anastomosis (VUA), which requires the creation of a watertight and tension-free suture with minimal tissue trauma to ensure proper healing [5]. In addition to bladder neck dissection and neurovascular bundle preservation, VUA is a key contributor to postoperative success. Even with robotic assistance, suboptimal anastomosis can result in serious complications, including urinary leakage, prolonged catheterization, incontinence, and bladder neck sclerosis [6]. Notably, overall complication rates for RARP remain as high as 17.8%, even after the learning curve has been overcome [7,8]. Given the crucial role of suturing quality in determining surgical outcomes of RARP, objective and fine-grained assessment of suturing performance has become a central focus in robotic surgical training [9]. In this context, traditional methods for evaluating surgical competence, such as the number of procedures performed or the total time spent on the console, often fail to provide reliable feedback on actual performance [10].

More structured evaluation tools, including the Objective Structured Assessment of Technical Skills (OSATS) [11], the Global Operative Assessment of Laparoscopic Skills (GOALS) [12], the Multiple Objective Measures of Skill (MOMS) [13], and the Robotic-Assisted Surgery Competency Evaluation (RACE) [10], have been widely adopted to overcome these issues. However, these protocols heavily depend on expert human assessment, being time-consuming, labor-intensive, and inherently subjective. For instance, moderate inter-rater reliability for these established scales was reported, with coefficients often in the 0.4–0.7 range depending on task and rater training [14]. This is coupled with further burden when assessments rely on manual scoring and checklists, and with increased time and cost for expert surgical supervision [15].

This underscores the need for automated, objective tools for evaluating surgical performance efficiently and reliably, enhancing training quality, and ultimately improving patient outcomes.

Intelligent systems based on Deep Learning (DL) techniques can represent valuable tools to support skill assessment of training surgeons [16]. By learning discriminative features from complex, high-dimensional data, such as surgical videos or kinematic motion logs, these systems have demonstrated high reliability in recognizing surgical gestures, phases, and even predicting surgical skill levels [17,18].

In particular, image-based techniques have shown significant success in surgical data analysis, enabling accurate object detection (e.g., robotic instruments [19]), task segmentation (e.g., procedural step identification [20]), and action recognition (e.g., suturing gestures [21]).

These methods mainly leverage DL architectures such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), which have recently become the state-of-the-art for medical image analysis, allowing a deeper understanding of the surgical scene. However, training such models requires hundreds of annotated data, which are often difficult to obtain due to the complexity of surgical scenes and the need for domain expertise in annotation. As a result, many studies rely on simulated environments or phantom datasets [22,23,24], which lack clinical complexity and variability, thus limiting their applicability in real-world settings. Furthermore, recent studies have emphasized the importance of combining computational efficiency with sufficiently high segmentation accuracy to enable clinically meaningful real-time performance both in endoscopic segmentation pipelines [25] and in frameworks targeting online surgical skill assessment [26].

In this paper, we address these limitations by collecting and annotating at pixel-level a novel dataset of endoscopic videos from RARP procedures, specifically focused on the suturing phase during VUA. We then develop two segmentation pipelines, one using a training-from-scratch approach and one using transfer learning, for the joint segmentation of mucosal tissues and suturing needle. Specifically, we employ three CNN-based models, i.e., U-Net [27], 2D and 3D nnU-net [28], and a pre-trained Vision Transformer, i.e., EndoVit [29], and compare their performance in delineating key elements of the robotic suturing phase. Specifically, the contributions of this work are summarized below:

We formulate and tackle the clinically relevant pixel-level segmentation of mucosal tissue and suture needles specifically during the VUA step of RARP, using real annotated endoscopic videos combined with a DL-based framework;
We compare two pipelines employing convolutional- and transformer-based models for the fine-grained segmentation of endoscopic images, assessing the feasibility of applying DL methods to guide the development of objective assessment tools for robotic surgical skills;
We study the impact of transfer learning, data augmentation, and task-related loss functions on the delineation of subtle structures, and report considerations on latency and reproducibility of the proposed methods.

To our knowledge, this is the first work addressing this dual segmentation task in real clinical RARP videos, laying the foundation for a downstream analysis of suturing quality in robotic surgery.

2. Problem Formulation

In the context of RARP, the quality of VUA is responsible for preventing urinary leakage and stricture formation, and consequently for preserving continence. The basic idea is to obtain a perfect adaptation of the urethra to the reconstructed bladder neck, without compromising the integrity of the external sphincter [30]. To achieve high-quality anastomosis, a meticulous coaptation of the urethral and bladder mucosa is required, where the tissues are accurately aligned and sutured without excessive trauma [31,32]. To clarify this point, Figure 1 shows a schematic example of the anastomotic procedure (Figure 1a) and the final expected result of the adapted bladder and urethra (Figure 1b). In this context, one discriminative feature for assessing anastomotic quality during RARP is the correctness of suture needle placement and penetration through mucosal layers. Errors in this step, such as skipping the mucosa, asymmetric bites, or excessive tissue traction, may compromise suture integrity and lead to unwanted clinical outcomes.

Achieving fine-grained assessment of this surgical step requires the precise localization of both the surgical needle and the mucosal tissue. Semantic segmentation of these entities can enable a frame-by-frame analysis of the interaction between the needle and the tissue, as well as the characterization of the tissue in the region of interest.

3. Related Work

Recent research has focused on the development of automated tools for assessing surgical performance in robot-assisted procedures. This section provides an overview of the latest DL models for both the assessment of suture quality in simulated surgical environments and the segmentation of real-world surgical scenes in endoscopic data.

3.1. Deep Learning for Suture Quality Assessment

In the field of surgical skill evaluation, suture represents a key subtask in all procedures, as it requires a high degree of precision and dexterity in tissue handling, factors that are critical for surgical success and postoperative outcomes. Various methods have been developed to assess suturing in open surgery. These algorithms are designed to assess both the overall quality of the suture through a global score [34], as well as specific aspects such as needle insertion and extraction points, suture placement accuracy, detection of tissue damage, and the time required to complete the task [35,36]. Other studies have assessed the surgeon’s performance during the execution of the surgical task by considering gestures involved in the suturing process [23,37,38], or the quality of the suture once it has been completed, also employing post-processing image analysis pipelines [39]. In this context, CNN models have commonly been used to identify regions of interest and perform real-time segmentation of suture components. Specifically, advanced architectures such as U-Net [27] and Mask R-CNN [40] have been used for the semantic segmentation of images and videos [41,42,43,44] of continuous and interrupted sutures to identify elementary structures used in subsequent processing stages to calculate geometric quality indices. Despite their effectiveness on simulated or phantom-based environments, these frameworks lack the complexity of real endoscopic environments, representing a consistent gap to practical deployment. A summary of works, their objectives, methodologies, and limitations is provided in Table 1.

3.2. Deep Learning for Surgical Scene Segmentation

Recent innovation in medical image analysis, such as attention mechanisms and vision–language models [45,46], have outlined complementary directions for improving downstream tasks under clinical constraints. A particularly challenging task is represented by endoscopic image segmentation, in which the most recent DL models have shown impressive results [47] In this context, the adoption of DL techniques has also brought advances in the semantic segmentation of surgical scenes from endoscopic videos in robotic surgery. Early efforts focused primarily on surgical instrument segmentation, which constitutes a relatively tractable task due to the distinct features of operational tools. Notably, the MICCAI EndoVis Challenge series [48,49] provided benchmark datasets that accelerated research in both instrument segmentation and tracking, fostering the development of increasingly accurate models. Architectures such as U-Net and DeepLabv3+ have been widely adopted, with studies like Shvets et al. [19] demonstrating high performance in binary and multi-class segmentation of robotic instruments.

Beyond tool segmentation, semantic delineation of anatomical structures represents a greater challenge due to visual similarity between tissues, deformation under manipulation, occlusions, and variable lighting. While some approaches have been proposed for organ segmentation in laparoscopic procedures [50,51], the availability of annotated datasets remains a limiting factor. The release of CholecSeg8k [52], a large-scale dataset for laparoscopic cholecystectomy, has enabled multi-class segmentation of both tools and anatomy, but few comparable resources exist for robot-assisted procedures. The EndoVis 2018 challenge [53] expanded the task to multi-class segmentation of both Da Vinci tools and porcine anatomy in robotic kidney transplants, marking the first attempt at holistic surgical scene understanding. However, porcine data is significantly simpler than human tissues, limiting the applicability of the developed methods [53]. While results suggested certain advantages of CNN-based approaches in this context, since the seminal work of Vaswani et al. [54], transformer architectures have achieved state-of-the-art results across diverse domains, including natural language processing, computer vision, remote sensing, and object detection. Building on this foundation, numerous Transformer variants have been developed to address task-specific challenges and efficiency constraints, further extending the applicability of the architecture [55,56,57]. The ability to model long-range dependencies and capture complex patterns across heterogeneous data modalities underscores their potential value for surgical data science, motivating their increasing adoption in medical image analysis. In this regard, a similar attempt as our work was presented by Pak et al. [58], who compared convolutional- and transformer-based models for the semantic segmentation of surgical instruments, bladder, prostate, vas and seminal vesicle in 3000 frames extracted from RARP procedures. They obtained high values of Dice Scores on all classes, suggesting improved performances of CNN models compared to Transformers. In a subsequent work, the same authors investigated the possibility of applying a reinforced U-Net model for real-time segmentation in RARP procedures [59]. Nonetheless, both of their studies focused on organ-level rather than tissue-level segmentation, examining different steps of RARP which did not include VUA.

Consequently, semantic segmentation in robotic prostatectomy, particularly involving delicate structures and surgical needle during suturing, remains an underexplored but critical area for advancing objective assessment of surgical skills and improving training protocols. A summary of works, their objectives, methodologies, and limitations is provided in Table 2.

Overall, prior studies have mainly focused either on the assessment of suture quality in simplified or simulated settings, or on the segmentation of surgical instruments and organs in endoscopic videos. However, the semantic segmentation of tissue-level structures and surgical needles during VUA in robotic prostatectomy remains largely unexplored. This gap is particularly relevant for enabling objective evaluation of surgical performance in real clinical environments. Our work addresses this need by developing and comparing convolutional and transformer-based Deep Learning pipelines specifically designed for endoscopic videos of VUA.

4. Materials and Methods

4.1. Data Collection and Preprocessing

Four videos of RARP procedures performed in the last 4 years at the Mater Dei Hospital of Bari were retrospectively collected. Prostatectomies were performed by the same expert surgeon through a transabdominal approach using daVinci Surgical System robotic platform (da VINCI^® X™ IS4200- Intuitive Surgical, Inc., 1266 Kifer Road Sunnyvale, CA 94086, USA). Videos were recorded at 30 fps using a 2D recorder (MediCapture^® Inc., Plymouth Meeting, PA, USA ) with a resolution of 896 × 720 pixels. All videos were shared as raw video material without identifying metadata. Therefore, the need for ethical approval was waived. Endoscopic videos were inspected under the supervision of the same expert surgeon (>500 RARP procedures performed with the same da VINCI^® X™ system) to identify the time intervals where VUA was performed. Furthermore, video sequences encompassing the reconstruction of the posterior rhabdo-sphincter, commonly known as Rocco stitch, were selected. This surgical step, which is widely used to provide support to the urethral sphincter complex posteriorly and to reduce tension on the anastomosis, is performed right before VUA [60], within the same endoscopic environment with a clear view of mucosal layers and suturing needle. Hence, for the purposes of this study, both procedures were considered suitable to build the dataset.

Video clips were resampled at 10 fps since this rate ensures sufficient diversity between consecutive frames while avoiding redundant images [61]. Each frame was manually tagged by the authors using CVAT (https://www.cvat.ai (accessed on 3 March 2025)) annotation framework and carefully reviewed by a senior urologist.

The final dataset consists of a total of 4,681 annotated images, each corresponding to a single frame and containing annotations of the mucosa and the surgical needle. Given the purpose of this work, bladder and urethral mucosa, which have the same morphological characteristics, were considered as a single class.

Figure 2 shows one representative frame for each video with an overlay of the original image and the corresponding annotated mask.

The dataset is structured for multi-class segmentation and includes the following classes: (i) bladder/urethral mucosa; (ii) suturing needle; all the remaining pixels were labeled as background. Table 3 reports the composition of the complete dataset, including the total number of original frames extracted from each video, and the mean percentage of pixels per frame for each target class in the ground truth masks. The number of annotated frames per sequence varies considerably, ranging from 140 to 2813 frames. The average pixel coverage of the mucosa ranges from 0.64% to 2.01%, while the presence of the needle is more variable and in some sequences completely absent (e.g., Video 3, where the needle is not visible in the annotated segment). The standard deviations reflect the high variability in visibility and size of these structures across frames, which is typical in dynamic surgical environments.

4.2. Segmentation Pipelines

In this paper, we propose and compare two segmentation pipelines based on DL models to provide an accurate identification of the mucosal layers and the suturing needle during VUA. The first pipeline employs a training-from-scratch approach with three convolutional models. The second pipeline involves a Transformer-based model with the use of transfer learning to overcome the limited size of the dataset.

4.2.1. Convolutional Models

In the first pipeline, three CNN models, namely U-Net [27], nnU-Net 2D, and nnU-Net 3D [28], were trained from scratch. U-Net is a convolutional neural network specifically designed for medical image segmentation, with a characteristic U-shaped encoder-decoder architecture, while nnU-Net provides self-configuring frameworks that automatically adapt their architectures and training protocols to a given dataset, and has been widely adopted for segmentation tasks. These architectures were chosen for their robustness in medical image analysis as well as their effective computational costs [62]. In addition, since three-dimensional approaches often incur higher memory/latency, which conflicts with intraoperative constraints, we limited the experiments to the self-adapting 3D version of nnU-Net, which can leverage short-range clips as input.

The U-Net architecture consists of 4 downsampling steps and 4 upsampling steps. The network starts with 64 feature channels and doubles them at each downsampling step, reaching up to 1024 channels at the bottleneck. Each block uses 3 × 3 convolutions followed by Batch Normalization and ReLU activation.

nnU-Net supports both 2D and 3D configurations, enabling flexibility for different imaging modalities and spatial resolutions. In our study, we trained nnU-Net in two distinct configurations:

2D Configuration: The model was trained using individual frames extracted from video sequences. This approach allowed us to leverage a larger number of training samples by treating each frame as an independent input.
3D Configuration: In this setting, we trained the model on short video clips as volumetric inputs, preserving the temporal and spatial continuity across consecutive frames.

Since nnU-Net automatically adapts the architecture of the network that has to be trained, a padding on each spatial dimension is performed to be divisible by a downsampling factor. Due to the kernel size and the forwarding of the image through the network, the resolution changes from 720 × 896 to 768 × 896 for the 2D configuration, and to 32 × 256 × 256 for the 3D configuration. The 2D model includes multiple stages with progressively increasing dimension of feature maps from 32 to 1024, and uses 2D convolutional layers, whereas the 3D model uses 7 stages with feature maps up to 320, and 3D convolutional layers. Both models employ ReLU activations. Each encoder and decoder stage includes two convolutional layers. The 2D nnU-Net architecture comprises approximately 33 million trainable parameters, whereas the 3D version includes about 30 million parameters. A complete overview of the 2D nnU-Net model architecture is provided in Figure 3.

A shared component across both models is the loss function employed during training. We adopted a composite objective function that combines a robust Cross-Entropy Loss (

L_{CE}

) with a Soft Dice Loss (

L_{SD}

). This combination enables pixel-level supervision as well as direct optimization of region-level overlap. The overall loss function is defined as follows:

L_{total} = L_{CE} + L_{SD}

(1)

The Cross-Entropy Loss takes into account pixel-wise classification and is defined as follows:

L_{CE} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} g_{i, c} log (p_{i, c})

(2)

where

p_{i, c}

is the predicted probability for class c of pixel i,

g_{i, c}

is the one-hot encoded ground truth, N is the total number of pixels and C is the total number of classes.

The Soft Dice Loss directly maximizes spatial overlap between prediction and ground truth. Its standard probabilistic formulation is

L_{SD} = 1 - \frac{1}{C} \sum_{c = 1}^{C} \frac{2 \sum_{i = 1}^{N} p_{i, c} \cdot g_{i, c} + ϵ}{\sum_{i = 1}^{N} p_{i, c} + \sum_{i = 1}^{N} g_{i, c} + ϵ}

(3)

where

p_{i, c}

,

g_{i, c}

, c, C and N are defined as in

L_{CE}

. The Soft Dice loss can be rewritten in terms of true positives (TP), false positives (FP), and false negatives (FN) as:

L_{SD} = 1 - 2 \cdot \frac{\sum_{c = 1}^{C} {TP}_{c}}{\sum_{c = 1}^{C} ({TP}_{c} + {FP}_{c} + {FN}_{c}) + ϵ}

(4)

where

${TP}_{c} = \sum_{i = 1}^{N} p_{i, c} \cdot g_{i, c}$ ,
${FP}_{c} = \sum_{i = 1}^{N} p_{i, c} \cdot (1 - g_{i, c})$ ,
${FN}_{c} = \sum_{i = 1}^{N} (1 - p_{i, c}) \cdot g_{i, c}$ ,

and $ϵ$ is a small constant for numerical stability.

4.2.2. Transformer Model

To overcome the limitations posed by the size of the dataset, we employed the pre-trained EndoViT model [29]. The original model, which follows an encoder–decoder design inspired by Vision Transformers, was trained on a large heterogeneous dataset of over 700,000 endoscopic images to gain information of the surgical environment. Specifically, during pretraining, EndoViT employs a Masked Image Modeling (MIM) strategy, where input images are first split into non-overlapping patches that are embedded into a sequence of tokens. A large proportion of these patches are randomly masked, and the encoder, based on a transformer backbone, processes the visible tokens to learn contextualized representations. The decoder reconstructs the missing patches by minimizing a per-patch Mean Squared Error loss. The encoder consists of multiple transformer layers that apply self-attention mechanisms to capture long-range dependencies across image regions. To improve pretraining efficiency, a layer-wise learning rate decay is applied, assigning higher learning rates to layers closer to the input and lower rates to layers near the latent representation. Furthermore, stochastic weight averaging is adopted in the final pretraining epochs to improve model generalization. After pretraining, the decoder is discarded, and the encoder is fine-tuned as a feature extractor for downstream tasks, including semantic segmentation, action triplet recognition, and surgical phase recognition.

In our study, the pretrained EndoViT encoder was integrated into a Dense Prediction Transformer (DPT) architecture, replacing the original DPT encoder while retaining its randomly initialized decoder. The DPT framework aggregates multi-scale features extracted from different transformer layers and progressively refines them through a decoder to produce dense pixel-wise predictions. This design enables the extraction of both global contextual information and fine-grained local features, which is crucial for accurately segmenting complex surgical scenes. The transformer used in this study comprises 335 million parameters. Despite the substantially larger size of the transformer-based architecture, leveraging a pretrained EndoViT encoder significantly reduces the computational cost of training and enables the exploitation of domain-specific representations, ultimately improving overall efficiency. This fine-tuning process is illustrated in Figure 4.

In the original EndoViT implementation, the loss function for semantic segmentation consisted of a combination of Cross-Entropy Loss and Focal Loss, where the training initially employed standard Cross-Entropy, progressively switching to Focal Loss after a predefined number of epochs. In this work, a different loss function was adopted to better address class imbalance and segmentation stability on the new dataset. A Generalized Dice Loss

L_{G D}

was employed, which directly optimizes a class-weighted Dice formulation while incorporating stabilization terms to mitigate numerical instabilities in highly imbalanced medical datasets. The

L_{G D}

optimizes a modified generalized Dice formulation, where each class is weighted by a value

w_{c}

which is inversely proportional to its total volume within the batch:

w_{c} = \frac{1}{N_{c} + ϵ}

(5)

where

N_{c}

is the total ground truth volume (i.e., number of pixels) for class c, and

ϵ = 10^{- 6}

is a small stabilization constant to prevent division by zero. To prevent unstable gradients when some classes are absent in a batch, the weights are set to zero for any class not present in the current batch. The loss is then computed as Equation (4), with class-specific weights

w_{c}

applied to each term, yielding

L_{G D} = 1 - 2 \cdot \frac{\sum_{c} w_{c} \cdot {TP}_{c}}{\sum_{c} w_{c} \cdot ({TP}_{c} + {FP}_{c} + {FN}_{c}) + ϵ}

(6)

The overall Stable Generalized Dice Loss (

L_{S G D}

) is then computed by combining a standard Cross-Entropy Loss with the Dice component using a balancing coefficient

α

:

L_{S G D} = α \cdot L_{C E} + (1 - α) \cdot L_{G D}

(7)

All images were normalized using dataset-specific mean and standard deviation values, which were calculated by computing, for each color channel (R, G, B), the mean and standard deviation of all pixel values across the entire dataset.

As in the original implementation, data augmentation techniques were applied during training to improve model generalization. Specifically, each input image was randomly resized and cropped with a scale factor uniformly sampled between 0.6 and 1.0. In addition, random horizontal flipping was applied with a probability of 0.5.

4.3. Training Setup and Evaluation Metrics

The dataset was split patient-wise in training and test sets. Specifically, sequences from videos 1, 2, and 3 were used as training samples, with 9 sequences from video 1 used for validation. Video 4 was used as unseen data for testing the models. Overall, the training set consisted of 3875 images, the validation set of 357 images, and the test set of 449 images.

Experiments were conducted on an NVIDIA RTX A6000 GPU (NVIDIA Corporation, Santa Clara, CA, USA), and the pipelines were developed using Python 3.12 with PyTorch (v 2.3), and nnU-Net (v2).

As in the original paper, for EndoViT, an input resolution of

256 \times 448

was used, and fixed random seed was adopted in all experiments to ensure reproducibility. Full training configurations and hyperparameters of the models are reported in Table 4.

Data augmentation techniques were applied during training to improve model generalization. Specifically, for all models, each input image was randomly resized and cropped using a RandomResizedCrop operation with a scale factor uniformly sampled between 0.6 and 1.0. In addition, random horizontal flipping was applied with a probability of 0.5. Given the strong pixel imbalance for the mucosa and needle classes, evaluation metrics were computed on the cumulative test set using Intersection over Union (IoU) (Equation (8)), Dice Score (Dice) (Equation (9)) and Generalized Dice Score (GDS) (Equation (10)) metrics to measure the segmentation performance.

Specifically, IoU is defined as:

IoU = \frac{A_{O v e r l a p}}{A_{U n i o n}}

(8)

where

A_{O v e r l a p}

is the area of overlap between the expected segmentation and ground truth, and

A_{U n i o n}

is the area of union between the expected segmentation and ground truth. This metric is normalized in the interval [0, 1], with 0 meaning that there is no overlap, and 1 meaning a perfectly superimposed segmentation. Each predicted pixel is compared with the corresponding pixel in the ground truth segmentation image.

Dice score is a widely adopted metric for evaluating overlap between predicted and ground truth segmentations. It is defined as

Dice = \frac{2 \cdot | X \cap Y |}{| X | + | Y |}

(9)

where X denotes the model’s predicted masks and Y the ground-truth labels. Dice score ranges from 0 to 1, with a score of 1 indicating perfect agreement between prediction and ground truth.

To address the strong class imbalance, particularly between the background, mucosa, and needle classes, we also computed the GDS [63]. Unlike the standard Dice Score, which may be biased toward larger structures, GDS introduces a class-specific weighting mechanism that reduces the influence of dominant classes.

GDS is defined as

GDS = \frac{2 \sum_{i = 1}^{C} w_{i} \sum_{j = 1}^{N} t_{j, i} p_{j, i}}{\sum_{i = 1}^{C} w_{c} (\sum_{j = 1}^{N} t_{j, i} + \sum_{j = 1}^{N} p_{j, i})}

(10)

where C is the number of classes, N is the number of pixels,

p_{j, i}

and

t_{j, i}

represent the predicted and target tensors, and

w_{i}

is a weight associated with class i, typically defined as

w_{i} = \frac{1}{{(\sum_{i = 1}^{C} t_{j, i})}^{2}}

.

This weighting strategy mitigates the impact of large structures dominating the metric by amplifying the contribution of underrepresented classes.

5. Results

The performance of multiple segmentation models, namely U-Net, nnU-Net 2D, nnU-Net 3D, and EndoVit, was quantitatively evaluated using IoU (8), Dice (9) and GDS (10) metrics. For each model, both the mean IoU and the mean Dice scores were calculated across all classes to provide an overall assessment of the quality of the segmentation. In addition, class-wise IoU and Dice Scores were reported to enable a more detailed analysis of performance in each anatomical or contextual region of interest. This evaluation framework allows for a comprehensive comparison between the models, highlighting their ability to segment fine structures (e.g., needle and mucosa). Performance metrics for all models are reported in Table 5.

nnU-Net 2D provided the best overall segmentation accuracy, achieving the highest average and class-specific results. In particular, it outperformed the other models on the clinically relevant needle class (IoU = 0.763; Dice = 0.866), while also yielding the strongest mucosa scores (IoU = 0.495; Dice = 0.663). EndoViT achieved lower results, with a GDS of 0.393, but outperformed both nnU-Net 3D (GDS = 0.347) and the baseline U-Net (GDS = 0.328), also providing a sharper needle segmentation (Dice = 0.748).

In addition to the segmentation accuracy, the computational efficiency of each model was evaluated. The average inference time per frame was approximately 0.12 s for EndoViT, 0.47 s for nnU-Net 3D, 0.28 s for nnU-Net 2D and 0.22 s for U-Net. Qualitative results on selected samples from the dataset are also provided to visually assess the segmentation performance and the spatial alignment between predictions and ground truth (Figure 5). We deliberately selected frames exhibiting common endoscopic artifacts including motion blur (camera/instrument), specular highlights, occlusions, illumination changes, and active bleeding, representing challenging scenarios.

6. Discussion

The study proposed and compared two DL-based segmentation pipelines for the joint identification of mucosal tissues and suturing needle in endoscopic videos of RARP procedures. The analysis was conducted on a custom annotated dataset derived from endoscopic recordings of four RARP procedures performed by the same expert surgeon at the one clinical center. Video sequences corresponding to vesicourethral anastomosis (VUA) and the preceding Rocco stitch steps were selected under expert surgical supervision, ensuring visual consistency and relevance of anatomical targets.

The CNN-based pipeline, utilizing U-Net and nnU-Net, required tailored architectural configuration and loss function design. nnU-Net automatically adapted to the input data, supporting both 2D and 3D variants: the 2D configuration enabled efficient training by treating each frame independently, whereas the 3D setting preserved temporal coherence across consecutive frames. The transformer-based pipeline integrated a pretrained EndoViT [29] encoder with a Dense Prediction Transformer decoder, leveraging transfer learning to address the limited size and specificity of the dataset.

The best overall segmentation performance was achieved by the nnU-Net 2D model, which outperformed all other architectures in terms of global metrics (mIoU = 0.749, mDice = 0.841) and class-specific accuracy for the needle (IoU = 0.763, Dice = 0.866) and the mucosa (IoU = 0.495, Dice = 0.663). This suggests that the automatic configuration capabilities of nnU-Net were well-suited for the limited and heterogeneous dataset. However, given the complexity of the model, the inference time was relatively high (0.28 s/frame), which could limit its integration in real-time surgical skill assessment frameworks. Prior works employing kinematic data achieved near real-time assessment of stylistic behavior in surgical skills within time intervals of 0.25 s [64], although a minimum interval of 4 s have been shown to provide sufficient information for actionable feedback [26]. EndoViT provided a better trade-off in inference time (0.12 s/frame), achieving good mIoU (0.627) and mDice (0.735) scores, but lower class-specific metrics. Despite not reaching the top performance, EndoViT demonstrated robustness across both classes and was particularly effective in maintaining segmentation quality under high visual variability.

These results confirm the potential of transformer-based architectures, especially when combined with pretrained encoders and adaptive decoders. Nonetheless, the transfer learning and data augmentation strategies adopted were not sufficient to overcome the limited resources in terms of training data, supporting previous findings [58].

The 3D version of nnU-Net underperformed compared to its 2D counterpart, possibly due to short video clips with limited temporal coherence. The nature of endoscopic videos, which often exhibit high variability between adjacent frames due to camera movement, dynamic lighting, instrument occlusions, and rapid changes in tissue appearance, can lead the model to extract volumetric features that are not very informative, potentially resulting in misclassifications. For this reasons, under conditions of high inter-frame variability and limited temporal coherence, a well-optimized 2D model may be better suited than its 3D counterpart for surgical video segmentation tasks. Finally, the baseline U-Net showed the lowest performance across all metrics, in particular for the mucosa class (Dice = 0.117), highlighting the need for more advanced architectures or automated tuning mechanisms to handle complex surgical scenes and class imbalance effectively.

Overall, the results reinforce the value of automated pipelines and tailored loss functions in addressing the challenges of real-world surgical scene segmentation. While transformer-based models are promising, CNN-based approaches such as nnU-Net still offer a strong baseline when properly configured.

In addition to quantitative metrics, qualitative results provide visual confirmation of segmentation effectiveness under realistic conditions. Figure 5 documents typical failure modes and routine cases across models. In A (specular highlights on instruments), U-Net detects the needle but misses most mucosa; nnU-Net 2D and nnU-Net 3D underestimate mucosa and fragment the needle, while EndoViT produces small false-positive mucosa islands near the instrument shaft. In B and C (relatively clean frames without marked artifacts), predictions are largely consistent across models: nnU-Net 2D provides the closest match to the ground truth for the mucosa, EndoViT is comparable with occasional small isolated blobs, nnU-Net 3D is slightly conservative with minor mask shrinkage, and U-Net tends to undersegment or output needle-only responses. In D (motion blur and partial occlusion), U-Net misses portions of the needle and nnU-Net 3D fragments it, whereas nnU-Net 2D still captures mucosa and most of the needle and EndoViT tracks the needle with a small spatial shift. In E (combined specular highlights and motion blur), all models degrade: U-Net often misses the mucosa entirely, nnU-Net 2D and EndoViT detect small patches with reduced area and occasional speckles, and nnU-Net 3D underpredicts both classes. Figure 6 illustrates representative examples from video 4 of the dataset, showing both the input images and the overlay maps of ground truth and predictions generated by the nnU-Net 2D model.

These examples underscore the model’s ability to accurately segment mucosal boundaries and detect the needle even in complex visual scenarios, including partial occlusions, specularities, and variable lighting. However, occasional false negatives, especially at the edges of thin structures such as the needle, suggest the need for further refinement of anatomical details.

Despite the promising performance, certain limitations should be acknowledged. The dataset used in this study was relatively small and derived from four surgical procedures. Although a leave-one-video-out validation strategy was employed to promote generalizability, the development of larger and more diverse datasets remains essential for ensuring robustness across surgical settings, patient anatomies, and imaging conditions. Furthermore, the dataset exhibited high inter-frame variability in both anatomical visibility and instrument presence, reflecting the dynamic nature of real-world surgical scenes. In addition, the segmentation task proved particularly challenging due to the extreme class imbalance present in the annotated data. As highlighted in Table 3, both mucosal tissue and needle occupy only a minimal fraction of the total image area, typically below 2% and, in some cases, well under 1%. This low label density is a known difficulty in medical image segmentation, as the model must accurately detect small, high-precision regions within a predominately background-dominated field. Such imbalance increases the risk of biased predictions and requires the adoption of tailored loss functions and architectural adjustments to ensure reliable performance.

7. Conclusions

Understanding surgical scenes through image processing techniques remains a complex task, primarily due to the highly dynamic and variable nature of intraoperative video data. Moreover, the field lacks large-scale, annotated datasets, as producing high-quality manual labels for supervised learning is time-consuming, expensive, and requires expert knowledge. In this study, two DL segmentation pipelines targeting endoscopic videos of vesicourethral anastomosis (VUA) were developed and evaluated, comparing convolutional models with transformer-based architectures.

Quantitative results demonstrated that the nnU-Net 2D model achieved the highest segmentation accuracy across all metrics. The model proved effective in identifying both mucosal tissue and the suturing needle, despite significant class imbalance and visual variability in the dataset.

These findings confirm the feasibility of applying DL-based approaches to surgical scene understanding in real clinical settings. Semantic segmentation of anatomical structures and instruments represents a fundamental step towards the development of automated systems for evaluating surgical performance. Such advancements have the potential to support real-time surgical skill assessment and enhance training programs.

Future work will focus on expanding and diversifying annotated datasets, investigating strategies to improve model robustness and generalization, and developing efficient implementations suitable for real-time clinical integration. In particular, the implementation of lightweight segmentation models, as well as the introduction of techniques such as knowledge distillation, will be explored to reduce model complexity while maintaining high performance.

In addition, while the current work focuses primarily on segmentation accuracy as a foundational step, we plan to explore the clinical relevance and predictive utility of the segmentation outputs in follow-up works, also considering that studies have demonstrated the clinical relevance of segmentation results for improving the precision of catheter identification in endoscopic frames during RARP for in-vivo Augmented Reality applications [65]. Integrating such information into a comprehensive framework could support surgical training, competency assessment, and intraoperative decision making.

Author Contributions

Conceptualization, V.B. and E.S.; methodology, E.S., V.B. and A.B.; software, E.S., P.M.M. and C.D.; validation, V.B., M.B., F.M. and A.B.; investigation, V.B. and G.L.; resources, M.B. and G.L.; data curation, E.S., M.B. and G.L.; writing—original draft preparation, E.S., P.M.M., C.D. and A.B.; writing—review and editing, V.B., F.M. and M.B.; supervision, V.B., A.B., M.B., G.L. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded under the National Recovery and Resilience Plan (NRRP), Mission 4-Component 2-Investment 1.1 “Fondo per il Programma Nazionale di Ricerca e Progetti di Rilevante Interesse Nazionale (PRIN)”, funded by the European Union-Next Generation EU, project “Learn a Cognitive framework to build a Robotic PROCtor (LeCoR-PROC)”, CUP: D53D23015970001.

Institutional Review Board Statement

This study was conducted following the ethical standards of the institutional (IRB) and/or research committee and with the 1964 Helsinki declaration, and its latter amendments or comparable ethical standards.

Informed Consent Statement

As a retrospective study, written informed consent was waived. All patient information was removed, and the dataset was completely anonymized.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author.

Acknowledgments

This work was supported by the NRRP project “BRIEF—Biorobotics Research and Innovation Engineering Facilities”, Mission 4: “Istruzione e Ricerca”, Component 2: “Dalla ricerca all’impresa”, Investment 3.1: “Fondo per la realizzazione di un sistema integrato di infrastrutture di ricerca e innovazione”, CUP: J13C22000400007, funded by European Union—NextGenerationEU. The authors express their gratitude to Ilenia Maria Novia for her contribution in the preparation of the dataset used in the study.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Cornford, P.; van den Bergh, R.C.; Briers, E.; Van den Broeck, T.; Brunckhorst, O.; Darraugh, J.; Eberli, D.; De Meerleer, G.; De Santis, M.; Farolfi, A.; et al. EAU-EANM-ESTRO-ESUR-ISUP-SIOG guidelines on prostate cancer—2024 update. Part I: Screening, diagnosis, and local treatment with curative intent. Eur. Urol. 2024, 86, 148–163. [Google Scholar] [CrossRef] [PubMed]
Carbonara, U.; Minafra, P.; Papapicco, G.; De Rienzo, G.; Pagliarulo, V.; Lucarelli, G.; Vitarelli, A.; Ditonno, P. Xi nerve-sparing robotic radical perineal prostatectomy: European single-center technique and outcomes. Eur. Urol. Open Sci. 2022, 41, 55–62. [Google Scholar] [CrossRef]
Binder, J.; Kramer, W. Robotically-assisted laparoscopic radical prostatectomy. BJU Int. 2001, 87, 128–132. [Google Scholar] [CrossRef]
Carbonara, U.; Srinath, M.; Crocerossa, F.; Ferro, M.; Cantiello, F.; Lucarelli, G.; Porpiglia, F.; Battaglia, M.; Ditonno, P.; Autorino, R. Robot-assisted radical prostatectomy versus standard laparoscopic radical prostatectomy: An evidence-based analysis of comparative outcomes. World J. Urol. 2021, 39, 3721–3732. [Google Scholar] [CrossRef] [PubMed]
Burttet, L.M.; Varaschin, G.A.; Berger, A.K.; Cavazzola, L.T.; Berger, M. Prospective evaluation of vesicourethral anastomosis outcomes in robotic radical prostatectomy during early experience in a university hospital. Int. Braz. J. Urol. 2017, 43, 1176–1184. [Google Scholar] [CrossRef]
Chapman, S.; Turo, R.; Cross, W. Vesicourethral anastomosis using V-Loc™ barbed suture during robot-assisted radical prostatectomy. Cent. Eur. J. Urol. 2011, 64, 236. [Google Scholar] [CrossRef]
Zorn, K.C.; Widmer, H.; Lattouf, J.B.; Liberman, D.; Bhojani, N.; Trinh, Q.D.; Sun, M.; Karakiewicz, P.I.; Denis, R.; El-Hakim, A. Novel method of knotless vesicourethral anastomosis during robot-assisted radical prostatectomy: Feasibility study and early outcomes in 30 patients using the interlocked barbed unidirectional V-LOC180 suture. Can. Urol. Assoc. J. 2011, 5, 188. [Google Scholar] [CrossRef]
Hakimi, A.A.; Faleck, D.M.; Sobey, S.; Ioffe, E.; Rabbani, F.; Donat, S.M.; Ghavamian, R. Assessment of complication and functional outcome reporting in the minimally invasive prostatectomy literature from 2006 to the present. BJU Int. 2012, 109, 26–30. [Google Scholar] [CrossRef]
Haque, T.F.; Knudsen, J.E.; You, J.; Hui, A.; Djaladat, H.; Ma, R.; Cen, S.; Goldenberg, M.; Hung, A.J. Competency in Robotic Surgery: Standard Setting for Robotic Suturing Using Objective Assessment and Expert Evaluation. J. Surg. Educ. 2024, 81, 422–430. [Google Scholar] [CrossRef]
Khan, H.; Kozlowski, J.D.; Hussein, A.A.; Sharif, M.; Ahmed, Y.; May, P.; Hammond, Y.; Stone, K.; Ahmad, B.; Cole, A.; et al. Use of Robotic Anastomosis Competency Evaluation (RACE) tool for assessment of surgical competency during urethrovesical anastomosis. Can. Urol. Assoc. J. 2019, 13, E10–E16. [Google Scholar] [CrossRef] [PubMed]
Anderson, D.D.; Long, S.; Thomas, G.W.; Putnam, M.D.; Bechtold, J.E.; Karam, M.D. Objective Structured Assessments of Technical Skills (OSATS) does not assess the quality of the surgical result effectively. Clin. Orthop. Relat. Res. 2016, 474, 874–881. [Google Scholar] [CrossRef]
Gumbs, A.A.; Hogle, N.J.; Fowler, D.L. Evaluation of resident laparoscopic performance using global operative assessment of laparoscopic skills. J. Am. Coll. Surg. 2007, 204, 308–313. [Google Scholar] [CrossRef] [PubMed]
Mackay, S.; Datta, V.; Chang, A.; Shah, J.; Kneebone, R.; Darzi, A. Multiple Objective Measures of Skill (MOMS): A new approach to the assessment of technical ability in surgical trainees. Ann. Surg. 2003, 238, 291–300. [Google Scholar] [CrossRef] [PubMed]
Alibhai, K.M.; Fowler, A.; Gawad, N.; Wood, T.J.; Raîche, I. Assessment of laparoscopic skills: Comparing the reliability of global rating and entrustability tools. Can. Med. Educ. J. 2022, 13, 36–45. [Google Scholar] [CrossRef]
Chen, J.; Cheng, N.; Cacciamani, G.; Oh, P.; Lin-Brande, M.; Remulla, D.; Gill, I.S.; Hung, A.J. Objective assessment of robotic surgical technical skill: A systematic review. J. Urol. 2019, 201, 461–469. [Google Scholar] [CrossRef]
Lam, K.; Chen, J.; Wang, Z.; Iqbal, F.M.; Darzi, A.; Lo, B.; Purkayastha, S.; Kinross, J.M. Machine learning for technical skill assessment in surgery: A systematic review. NPJ Digit. Med. 2022, 5, 24. [Google Scholar] [CrossRef]
Hung, A.J.; Chen, J.; Gill, I.S. Automated performance metrics and machine learning algorithms to measure surgeon performance and anticipate clinical outcomes in robotic surgery. JAMA Surg. 2018, 153, 770–771. [Google Scholar] [CrossRef]
Hung, A.J.; Ma, R.; Cen, S.; Nguyen, J.H.; Lei, X.; Wagner, C. Surgeon automated performance metrics as predictors of early urinary continence recovery after robotic radical prostatectomy—A prospective bi-institutional study. Eur. Urol. Open Sci. 2021, 27, 65–72. [Google Scholar] [CrossRef] [PubMed]
Shvets, A.A.; Rakhlin, A.; Kalinin, A.A.; Iglovikov, V.I. Automatic instrument segmentation in robot-assisted surgery using deep learning. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 624–628. [Google Scholar] [CrossRef]
Kitaguchi, D.; Takeshita, N.; Matsuzaki, H.; Hasegawa, H.; Igaki, T.; Oda, T.; Ito, M. Deep learning-based automatic surgical step recognition in intraoperative videos for transanal total mesorectal excision. Surg. Endosc. 2022, 36, 1143–1151. [Google Scholar] [CrossRef]
Luongo, F.; Hakim, R.; Nguyen, J.H.; Anandkumar, A.; Hung, A.J. Deep learning-based computer vision to recognize and classify suturing gestures in robot-assisted surgery. Surgery 2021, 169, 1240–1244. [Google Scholar] [CrossRef]
Lajkó, G.; Nagyne Elek, R.; Haidegger, T. Endoscopic image-based skill assessment in robot-assisted minimally invasive surgery. Sensors 2021, 21, 5412. [Google Scholar] [CrossRef] [PubMed]
Funke, I.; Mees, S.T.; Weitz, J.; Speidel, S. Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 1217–1225. [Google Scholar] [CrossRef]
Lavanchy, J.L.; Zindel, J.; Kirtac, K.; Twick, I.; Hosgor, E.; Candinas, D.; Beldi, G. Automation of surgical skill assessment using a three-stage machine learning algorithm. Sci. Rep. 2021, 11, 5197. [Google Scholar] [CrossRef]
Mohammed, E.; Khan, A.; Ullah, W.; Khan, W.; Ahmed, M.J. Efficient Polyp Segmentation via Attention-Guided Lightweight Network with Progressive Multi-Scale Fusion. ICCK Trans. Intell. Syst. 2025, 2, 95–108. [Google Scholar]
Anh, N.X.; Nataraja, R.M.; Chauhan, S. Towards near real-time assessment of surgical skills: A comparison of feature extraction techniques. Comput. Methods Programs Biomed. 2020, 187, 105234. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P.F.; Kohl, S.; Wasserthal, J.; Koehler, G.; Norajitra, T.; Wirkert, S.; et al. nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation. arXiv 2018, arXiv:1809.10486. [Google Scholar]
Batić, D.; Holm, F.; Özsoy, E.; Czempiel, T.; Navab, N. EndoViT: Pretraining vision transformers on a large collection of endoscopic images. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 1085–1091. [Google Scholar] [CrossRef]
Gillitzer, R.; Thüroff, J. Technical advances in radical retropubic prostatectomy techniques for avoiding complications. Part II: Vesico-urethral anastomosis and nerve-sparing prostatectomy. BJU Int. 2003, 92, 178–184. [Google Scholar] [CrossRef] [PubMed]
Webb, D.R.; Sethi, K.; Gee, K. An analysis of the causes of bladder neck contracture after open and robot-assisted laparoscopic radical prostatectomy. BJU Int. 2009, 103, 957–963. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.; Soni, P.K.; Chandna, A.; Parmar, K.; Gupta, P.K. Mucosal coaptation technique for early urinary continence after robot-assisted radical prostatectomy: A comparative exploratory study. Cent. Eur. J. Urol. 2021, 74, 528. [Google Scholar]
Vis, A.N.; van der Poel, H.G.; Ruiter, A.E.; Hu, J.C.; Tewari, A.K.; Rocco, B.; Patel, V.R.; Razdan, S.; Nieuwenhuijzen, J.A. Posterior, anterior, and periurethral surgical reconstruction of urinary continence mechanisms in robot-assisted radical prostatectomy: A description and video compilation of commonly performed surgical techniques. Eur. Urol. 2019, 76, 814–822. [Google Scholar] [CrossRef]
Handelman, A.; Keshet, Y.; Livny, E.; Barkan, R.; Nahum, Y.; Tepper, R. Evaluation of suturing performance in general surgery and ocular microsurgery by combining computer vision-based software and distributed fiber optic strain sensors: A proof-of-concept. Int. J. Comput. Assist. Radiol. Surg. 2020, 15, 1359–1367. [Google Scholar] [CrossRef]
Kil, I.; Eidt, J.F.; Singapogu, R.B.; Groff, R.E. Assessment of Open Surgery Suturing Skill: Image-based Metrics Using Computer Vision. Int. J. Comput. Assist. Radiol. Surg. 2024, 81, 983–993. [Google Scholar] [CrossRef] [PubMed]
Yamada, T.; Suda, H.; Yoshitake, A.; Shimizu, H. Development of an Automated Smartphone-Based Suture Evaluation System. J. Surg. Educ. 2022, 79, 802–808. [Google Scholar] [CrossRef] [PubMed]
Liu, D.; Li, Q.; Jiang, T.; Wang, Y.; Miao, R.; Shan, F.; Li, Z. Towards unified surgical skill assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9522–9531. [Google Scholar]
Nagaraj, M.B.; Namazi, B.; Sankaranarayanan, G.; Scott, D.J. Developing artificial intelligence models for medical student suturing and knot-tying video-based assessment and coaching. Surg. Endosc. 2023, 37, 402–411. [Google Scholar] [CrossRef]
Frischknecht, A.C.; Kasten, S.J.; Hamstra, S.J.; Perkins, N.C.; Gillespie, R.B.; Armstrong, T.J.; Minter, R.M. The objective assessment of experts’ and novices’ suturing skills using an image analysis program. Acad. Med. 2013, 88, 260–264. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Noraset, T.; Mahawithitwong, P.; Dumronggittigule, W.; Pisarnturakit, P.; Iramaneerat, C.; Ruansetakit, C.; Chaikangwan, I.; Poungjantaradej, N.; Yodrabum, N. Automated measurement extraction for assessing simple suture quality in medical education. Expert Syst. Appl. 2024, 241, 122722. [Google Scholar] [CrossRef]
Mansour, M.; Cumak, E.N.; Kutlu, M.; Mahmud, S. Deep learning based suture training system. Surg. Open Sci. 2023, 15, 1–11. [Google Scholar] [CrossRef]
Lee, D.H.; Kwak, K.S.; Lim, S.C. A Neural Network-based Suture-tension Estimation Method Using Spatio-temporal Features of Visual Information and Robot-state Information for Robot-assisted Surgery. Int. J. Control. Autom. Syst. 2023, 21, 4032–4040. [Google Scholar] [CrossRef]
Hoffmann, H.; Funke, I.; Peters, P.; Venkatesh, D.K.; Egger, J.; Rivoir, D.; Röhrig, R.; Hölzle, F.; Bodenstedt, S.; Willemer, M.C.; et al. AIxSuture: Vision-based assessment of open suturing skills. Int. J. Comput. Assist. Radiol. Surg. 2024, 19, 1045–1052. [Google Scholar] [CrossRef]
Li, X.; Li, M.; Yan, P.; Li, G.; Jiang, Y.; Luo, H.; Yin, S. Deep learning attention mechanism in medical image analysis: Basics and beyonds. Int. J. Netw. Dyn. Intell. 2023, 2, 93–116. [Google Scholar] [CrossRef]
Li, X.; Li, L.; Jiang, Y.; Wang, H.; Qiao, X.; Feng, T.; Luo, H.; Zhao, Y. Vision-Language Models in medical image analysis: From simple fusion to general large models. Inf. Fusion 2025, 118, 102995. [Google Scholar] [CrossRef]
Salman, T.; Gazis, A.; Ali, A.; Khan, M.H.; Ali, M.; Khan, H.; Shah, H.A. ColoSegNet: Visual Intelligence Driven Triple Attention Feature Fusion Network for Endoscopic Colorectal Cancer Segmentation. ICCK Trans. Intell. Syst. 2025, 2, 125–136. [Google Scholar] [CrossRef]
Allan, M.; Shvets, A.; Kurmann, T.; Zhang, Z.; Duggal, R.; Su, Y.H.; Rieke, N.; Laina, I.; Kalavakonda, N.; Bodenstedt, S.; et al. 2017 robotic instrument segmentation challenge. arXiv 2019, arXiv:1902.06426. [Google Scholar] [CrossRef]
Roß, T.; Reinke, A.; Full, P.M.; Wagner, M.; Kenngott, H.; Apitz, M.; Hempe, H.; Mindroc-Filimon, D.; Scholz, P.; Tran, T.N.; et al. Comparative validation of multi-instance instrument segmentation in endoscopy: Results of the ROBUST-MIS 2019 challenge. Med. Image Anal. 2021, 70, 101920. [Google Scholar] [CrossRef]
Scheikl, P.M.; Laschewski, S.; Kisilenko, A.; Davitashvili, T.; Müller, B.; Capek, M.; Müller-Stich, B.P.; Wagner, M.; Mathis-Ullrich, F. Deep learning for semantic segmentation of organs and tissues in laparoscopic surgery. In Current Directions in Biomedical Engineering; De Gruyter: Berlin/Heidelberg, Germany, 2020; Volume 6, p. 20200016. [Google Scholar]
Kolbinger, F.R.; Rinner, F.M.; Jenke, A.C.; Carstens, M.; Krell, S.; Leger, S.; Distler, M.; Weitz, J.; Speidel, S.; Bodenstedt, S. Anatomy segmentation in laparoscopic surgery: Comparison of machine learning and human expertise–an experimental study. Int. J. Surg. 2023, 109, 2962–2974. [Google Scholar] [CrossRef] [PubMed]
Hong, W.Y.; Kao, C.L.; Kuo, Y.H.; Wang, J.R.; Chang, W.L.; Shih, C.S. Cholecseg8k: A semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. arXiv 2020, arXiv:2012.12453. [Google Scholar]
Allan, M.; Kondo, S.; Bodenstedt, S.; Leger, S.; Kadkhodamohammadi, R.; Luengo, I.; Fuentes, F.; Flouty, E.; Mohammed, A.; Pedersen, M.; et al. 2018 robotic scene segmentation challenge. arXiv 2020, arXiv:2001.11190. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Chen, M.; Peng, H.; Fu, J.; Ling, H. Autoformer: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 12270–12280. [Google Scholar]
Guo, J.; Li, T.; Shao, M.; Wang, W.; Pan, L.; Chen, X.; Sun, Y. CeDFormer: Community Enhanced Transformer for Dynamic Network Embedding. In Proceedings of the International Workshop on Discovering Drift Phenomena in Evolving Landscapes, Barcelona, Spain, 26 August 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 37–53. [Google Scholar]
Kehinde, T.; Adedokun, O.J.; Joseph, A.; Kabirat, K.M.; Akano, H.A.; Olanrewaju, O.A. Helformer: An attention-based deep learning model for cryptocurrency price forecasting. J. Big Data 2025, 12, 81. [Google Scholar] [CrossRef]
Pak, S.; Park, S.G.; Park, J.; Choi, H.R.; Lee, J.H.; Lee, W.; Cho, S.T.; Lee, Y.G.; Ahn, H. Application of deep learning for semantic segmentation in robotic prostatectomy: Comparison of convolutional neural networks and visual transformers. Investig. Clin. Urol. 2024, 65, 551–558. [Google Scholar] [CrossRef]
Park, S.G.; Park, J.; Choi, H.R.; Lee, J.H.; Cho, S.T.; Lee, Y.G.; Ahn, H.; Pak, S. Deep Learning Model for Real-time Semantic Segmentation During Intraoperative Robotic Prostatectomy. Eur. Urol. Open Sci. 2024, 62, 47–53. [Google Scholar] [CrossRef]
Rocco, F.; Carmignani, L.; Acquati, P.; Gadda, F.; Dell’Orto, P.; Rocco, B.; Casellato, S.; Gazzano, G.; Consonni, D. Early continence recovery after open radical prostatectomy with restoration of the posterior aspect of the rhabdosphincter. Eur. Urol. 2007, 52, 376–383. [Google Scholar] [CrossRef]
Oh, N.; Kim, B.; Kim, T.; Rhu, J.; Kim, J.; Choi, G.S. Real-time segmentation of biliary structure in pure laparoscopic donor hepatectomy. Sci. Rep. 2024, 14, 22508. [Google Scholar] [CrossRef] [PubMed]
Kamtam, D.N.; Shrager, J.B.; Malla, S.D.; Lin, N.; Cardona, J.J.; Kim, J.J.; Hu, C. Deep learning approaches to surgical video segmentation and object detection: A Scoping Review. Comput. Biol. Med. 2025, 194, 110482. [Google Scholar] [CrossRef] [PubMed]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, 14 September 2017; Proceedings 3. Springer: Berlin/Heidelberg, Germany, 2017; pp. 240–248. [Google Scholar]
Ershad, M.; Rege, R.; Majewicz Fey, A. Automatic and near real-time stylistic behavior assessment in robotic surgery. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 635–643. [Google Scholar] [CrossRef] [PubMed]
Tanzi, L.; Piazzolla, P.; Porpiglia, F.; Vezzetti, E. Real-time deep learning semantic segmentation during intra-operative surgery for 3D augmented reality assistance. Int. J. Comput. Assist. Radiol. Surg. 2021, 16, 1435–1445. [Google Scholar] [CrossRef]

Figure 1. Schematic example of a robotic VUA procedure. (a) Initial placement of sutures to approximate the posterior bladder neck and urethra, thereby reinforcing the posterior support and facilitating tension-free anastomosis. (b) Completed anastomosis showing adaptation of the urethra to the bladder. Reprinted with permission from Ref. [33]. 2025 Elsevier.

Figure 2. Representative frames extracted from endoscopic video sequences used for annotation. The images depict steps of VUA and Rocco stitch reconstruction steps during RARP, highlighting the presence of bladder and urethral mucosa (green) as well as the suturing needle (blue).

Figure 3. Architecture of the 2D nnU-Net model used in this work. The network receives a 3-channel input image of size 768 × 896 and progressively downsamples the feature maps through a series of convolutional layers with increasing depth. The encoder path includes 8 stages with 2 convolutional blocks per stage, employing stride-2 convolutions to reduce spatial resolution. The decoder path performs upsampling through transpose convolutions followed by standard convolutions, combining features via skip connections from the encoder. The final output is generated through a 1 × 1 convolution layer producing the segmentation map.

Figure 4. Fine-tuning pipeline of the EndoViT model for semantic segmentation. The input image is divided into non-overlapping patches and processed by a pretrained EndoViT encoder. The extracted features are passed to a downstream segmentation model, i.e., a DPT decoder, to produce dense pixel-wise predictions. The output is compared against the ground truth using the Stable Generalized Dice Loss.

Figure 5. Qualitative results of each segmentation model. Each row shows the original input image, the ground truth mask with mucosa in green and the needle in blue and the raw model prediction using the same color scheme. Different endoscopic conditions are depicted: (A) Specular highlights on instruments; (B,C) No visible artifact; (D) Motion blur and partial occlusion; (E) Specular highlights and motion blur.

Figure 6. Qualitative segmentation results produced by the nnU-Net 2D model on selected frames from video 4 of the dataset. The two columns show the original input image and the overlay of ground truth and prediction on the input image. The overlay images highlights the pixel-wise agreement and disagreement between ground truth and prediction: true positives (correct predictions) are shown in green, false negatives (missed ground truth) in red, and false positives (incorrect predictions) in blue.

Table 1. Summary of recent works on Deep Learning for suture quality assessment.

Study	Objective	Methodology	Limitations
Handelman et al. [34]	Assessment of suture quality and stitching flow in general and ocular surgery	Image processing for end-product sutures	Simulated and experimental setup; not real-time
Kil et al. [35] Yamada et al. [36]	Identification of insertion/extraction points, accuracy, tissue damage, task time	Image- and video-based metrics	Simulated settings; no endoscopic data; focus on open surgery
Liu et al. [37] Funke et al. [23] Nagaraj et al. [38]	Automated assessment of surgical skill and gestures from video	Multi-path framework (skill aspects), 3D ConvNet (skill classification), CNN-based error detection (instrument/knot)	Focus on gestures; limited datasets; mostly simulated/training settings
Frischknecht et al. [39]	Objective assessment of expert vs. novice suturing skills	Image processing for end-product sutures (stitch length, bite size, travel, orientation, symmetry)	Post-hoc evaluation; not real-time; tested on limited sample
Noraset et al. [41] Mansour et al. [42] Lee et al. [43] Hoffmann et al. [44]	Automated analysis of sutures for quality and skill assessment	CNN-based approaches: instance segmentation (suture geometry), image classification (success/failure), spatio-temporal models (tension prediction), video classification (skill levels)	Simulated/phantom or limited datasets; lack of endoscopic and clinical variability

Table 2. Summary of recent works on Deep Learning for surgical scene segmentation.

Study	Objective	Methodology	Limitations
Shvets et al. [19]	Instrument segmentation	CNN-based models	Focus on tools; limited anatomical context
Scheikl et al. [50] Kolbinger et al. [51]	Semantic segmentation of organs/tissues in laparoscopic surgery	CNN- and Transformer-based models	Laparoscopic focus; limited data and anatomical coverage
Hong et al. [52]	Multi-class segmentation in cholecystectomy	CNN-based models	Laparoscopic focus; not validated in robotic surgery
Allan et al. [53]	Tool and anatomy segmentation in kidney transplant	CNN-based models	Porcine data; simpler than human tissues; limited anatomical realism
Pak et al. [58] Gonpark et al. [59]	Segmentation of instruments and organs in RARP	CNN- and transformer-based models; Reinforced U-Net for real-time segmentation	Organ-level focus; limited dataset diversity

Table 3. Dataset structure: frame count and mask pixel statistics per annotated sequence. For each video, the table reports the number of frames, the mean and standard deviation of the pixel percentage per frame occupied by the mucosa and the needle.

Video	Sequence	# Frames	Mucosa (% ± SD)	Needle (% ± SD)
1	22	1279	$0.64 \pm 0.45$ %	$0.32 \pm 0.28$ %
2	11	2813	$1.26 \pm 1.19$ %	$0.49 \pm 0.56$ %
3	1	140	$2.01 \pm 0.61$ %	-
4	3	449	$1.67 \pm 1.67$ %	$0.37 \pm 0.56$ %

Table 4. Comparison of training configurations for the proposed models.

Parameter	Model
Parameter	EndoViT	U-Net	nnU-Net 2D	nnU-Net 3D
Input resolution	$256 \times 448$	$384 \times 448$	$768 \times 896$	$32 \times 256 \times 256$
Loss function	Stable Generalized Dice	Dice	CE + Soft Dice	CE + Soft Dice
Optimizer	AdamW	RMSprop	SGD	SGD
Base learning rate	$5 \times 10^{- 4}$	$1 \times 10^{- 5}$	$1 \times 10^{- 2}$	$1 \times 10^{- 2}$
Weight decay	0	$1 \times 10^{- 8}$	$3 \times 10^{- 5}$	$3 \times 10^{- 5}$
Drop path rate	0.1	0	0	0
Batch size	64	8	16	2
Number of epochs	20	20	200	200
Normalization	Z-score	Z-score	Z-score	Z-score

Table 5. Quantitative segmentation results of the models on the test set. Reported metrics include mIoU, mDice, and class-specific IoU and Dice values for mucosa and needle structures.

Models	mIoU	mDice	GDS	IoU Mucosa	Dice Mucosa	IoU Needle	Dice Needle	Inference Time (s/frame)
U-Net	0.512	0.590	0.328	0.062	0.117	0.491	0.658	0.22
nnU-Net 2D	0.749	0.841	0.555	0.495	0.663	0.763	0.866	0.28
nnU-Net 3D	0.589	0.696	0.347	0.252	0.403	0.529	0.692	0.47
EndoVit	0.627	0.735	0.393	0.301	0.463	0.598	0.748	0.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sibilano, E.; Delprete, C.; Marvulli, P.M.; Brunetti, A.; Marino, F.; Lucarelli, G.; Battaglia, M.; Bevilacqua, V. Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy. Appl. Sci. 2025, 15, 10665. https://doi.org/10.3390/app151910665

AMA Style

Sibilano E, Delprete C, Marvulli PM, Brunetti A, Marino F, Lucarelli G, Battaglia M, Bevilacqua V. Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy. Applied Sciences. 2025; 15(19):10665. https://doi.org/10.3390/app151910665

Chicago/Turabian Style

Sibilano, Elena, Claudia Delprete, Pietro Maria Marvulli, Antonio Brunetti, Francescomaria Marino, Giuseppe Lucarelli, Michele Battaglia, and Vitoantonio Bevilacqua. 2025. "Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy" Applied Sciences 15, no. 19: 10665. https://doi.org/10.3390/app151910665

APA Style

Sibilano, E., Delprete, C., Marvulli, P. M., Brunetti, A., Marino, F., Lucarelli, G., Battaglia, M., & Bevilacqua, V. (2025). Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy. Applied Sciences, 15(19), 10665. https://doi.org/10.3390/app151910665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Strategies for Semantic Segmentation in Robot-Assisted Radical Prostatectomy

Abstract

1. Introduction

2. Problem Formulation

3. Related Work

3.1. Deep Learning for Suture Quality Assessment

3.2. Deep Learning for Surgical Scene Segmentation

4. Materials and Methods

4.1. Data Collection and Preprocessing

4.2. Segmentation Pipelines

4.2.1. Convolutional Models

4.2.2. Transformer Model

4.3. Training Setup and Evaluation Metrics

5. Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI