Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries

ElMoaqet, Hisham; Qaddoura, Hamzeh; Ryalat, Mutaz; Almtireen, Natheer; Abdulbaki Alshirbaji, Tamer; Jalal, Nour Aldeen; Neumuth, Thomas; Moeller, Knut

doi:10.3390/app15116121

Open AccessArticle

Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries

by

Hisham ElMoaqet

^1,*

,

Hamzeh Qaddoura

¹,

Mutaz Ryalat

¹

,

Natheer Almtireen

¹

,

Tamer Abdulbaki Alshirbaji

^2,3

,

Nour Aldeen Jalal

^3,4

,

Thomas Neumuth

³ and

Knut Moeller

^2,5,6

¹

Department of Mechatronics Engineering, German Jordanian University, Amman 11180, Jordan

²

Institute of Technical Medicine (ITeM), Furtwangen University, 78054 Villingen-Schwenningen, Germany

³

Innovation Center Computer Assisted Surgery (ICCAS), University of Leipzig, 04103 Leipzig, Germany

⁴

Erbe Elektromedizin GmbH, 72072 Tübingen, Germany

⁵

Department of Microsystems Engineering, University of Freiburg, 79110 Freiburg, Germany

⁶

Department of Mechanical Engineering, University of Canterbury, Christchurch 8041, New Zealand

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6121; https://doi.org/10.3390/app15116121

Submission received: 20 March 2025 / Revised: 18 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025

(This article belongs to the Special Issue Recent Advances in and Applications of Medical Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

The integration of Context-Aware Systems (CASs) in Future Operating Rooms (FORs) aims to enhance surgical workflows and outcomes through real-time data analysis. CASs require accurate classification of surgical tools, enabling the understanding of surgical actions. This study proposes a novel deep learning approach for surgical tool classification based on combining convolutional neural networks (CNNs), Feature Fusion Modules (FFMs), Squeeze-and-Excitation (SE) networks, and Bidirectional long-short term memory (BiLSTM) networks to capture both spatial and temporal features in laparoscopic surgical videos. We explored different modeling scenarios with respect to the location and number of SE blocks for multi-label surgical tool classification in the Cholec80 dataset. Furthermore, we analyzed a single-label surgical tool classification model using a simplified and computationally less expensive architecture compared to the multi-label problem setting. The single-label classification model showed an improved overall performance compared to the proposed multi-label classification model due to the increased complexity of identifying multiple tools simultaneously. Nonetheless, our results demonstrated that the proposed CNN-SE-FFM-BiLSTM multi-label model achieved competitive performance to state-of-the-art methods with excellent performance in detecting tools with complex usage patterns and in minority classes. Future work should focus on optimizing models for real-time applications, and broadening dataset evaluations to improve performance in diverse surgical environments. These improvements are crucial for the practical implementation of such models in CASs, ultimately aiming to enhance surgical workflows and patient outcomes in FORs.

Keywords:

surgical tool classification; context-aware systems; tool classification; feature fusion; long-short-term-memory

1. Introduction

The evolution of surgery has been remarkable, stretching from rudimentary practices to the advanced techniques we see today [1]. The development of Future Operating Rooms (FORs) increasingly relies on Context-Aware Systems (CASs), which streamline surgical workflows and enhance patient outcomes through real-time data analysis [2]. CAS integration in FORs will enhance knowledge-based systems with data-driven surgical treatments. These data-driven treatments necessitate collaborative interactions for multi-perspective knowledge sharing among medical teams, such as surgical and anesthetic teams [3,4]. In this context, CASs must be capable of envisioning workflows within the OR and understanding the current situation by fusing data from various perspectives, including both procedural and patient-related data [5].

The emerging field of Surgical Data Science (SDS) leverages big data analysis to model surgical processes and develop CASs, representing a significant leap forward in the advancement of surgical procedures [6]. The automatic identification of surgical tools is a crucial component for modelling the surgical processes. Thus, the recognition of surgical tools is addressed in this work. However, Minimally Invasive Surgery (MIS) presents unique challenges, particularly due to the narrow operative space and limited field of view [7,8]. These constraints make precise navigation of surgical tools critical. Moreover, this problem involves other challenging aspects such as rapid camera movements, variable tissues, and obstructive elements like smoke and blood, as illustrated in Figure 1. These challenges hinder the performance of existing methodologies. Combining spatial and temporal features can overcome the limitations of previous approaches and provide more accurate and robust tool classification in MIS environments.

In this work, we propose a novel architecture to identify surgical tools in laparoscopic images. Our approach employs a modified DenseNet121 architecture [9], to learn spatial features, integrated with a Bidirectional LSTM (BiLSTM) network to capture temporal dependencies in laparoscopic videos. We developed two deep learning multi-label models for surgical tools. The first model incorporates multiple Squeeze-and-Excitation (SE) attention modules, a Feature Fusion Module (FFM) and a BiLSTM layer. The second model employs a single SE block after the FFM, followed by a BiLSTM layer. Additionally, we explored a simpler and computationally less-expensive single-label surgical tool classification model for classifying surgical tools in cholecystectomy images, where only one tool is present at a time.

The primary objectives of this study are the following:

To develop a novel architecture that effectively combines spatial and temporal feature extraction for surgical tool classification.
To investigate the effect of integrating a single SE block, multiple SE attention modules and combining features at different levels on classification performance.
To explore the classification performance of the proposed approaches on multi-label and single-label data.
To compare our method’s performance against state-of-the-art approaches in surgical tool classification.

2. State of the Art

Recent approaches have primarily employed deep learning techniques, with convolutional neural networks (CNNs) serving as a foundation, often combined with other methods to enhance performance and address specific challenges in the field. Alshirbaji et al. introduced a hierarchical neural network architecture that incorporates both spatial and temporal information. Their method combines a CNN with two long short-term memory (LSTM) models to capture temporal relationships across subsequent laparoscopic images. Evaluation on the Cholec80 dataset using Monte Carlo six-fold cross-validation showed a 3.00% improvement in mean average precision (mAP) compared to baseline methods [10]. However, the authors noted that integrating the three models into an end-to-end framework could potentially offer additional discriminative spatiotemporal properties.

To address the challenge of imbalanced datasets, Alshirbaji et al. [11] proposed in another study a CNN-based approach utilizing resampling methods and a loss-sensitive learning strategy. This work highlighted the bias in CNN training caused by imbalanced distribution of the training data. As a result, CNNs struggle with minority classes such as scissors and irrigator in the Chlolec80 data set. The proposed approaches by Alshirbaji et al. showed improved performance in categorizing rarely used surgical instruments.

In another study, data augmentation methods were explored to improve model generalization [12]. Uniform, random, or distinctive backdrop patterns and artificially produced images were used as training patterns. Their experiments showed that classification accuracy increased by 10% when trained with augmented data. Interestingly, training on datasets with identical backgrounds rather than random backgrounds allowed the CNN model to focus more effectively on areas containing target items such as surgical tools. However, the low performance on original background datasets indicated a significant dependence on background patterns and weak generalization capacity, highlighting the ongoing challenges in developing robust models for surgical tool classification.

Attention mechanisms have also been explored to improve surgical tool detection. Jalal et al. [13] proposed a deep learning technique for surgical tool localization in laparoscopic images, which was trained on binary tool presence data. The model uses an attention-based CNN, from which gradient class activation maps (Grad-CAMs) were extracted. The Grad-CAMs were then processed to generate bounding boxes around the surgical instruments. The performance of this method was evaluated using the Dice similarity coefficient. Results showed that the proposed attention–CNN achieved a mean tool localization precision of 72.4%, significantly outperforming the base CNN model, which had a precision of 28.3%. However, the method lacked high accuracy for certain instruments such as irrigators, clippers, and scissors.

Recent work has also focused on multi-task learning and feature fusion. Jin et al. [14] developed a Multi-Task Recurrent Convolutional Network with Correlation Loss (MTRCNet-CL) to simultaneously improve phase recognition and surgical instrument identification. This approach demonstrated superior performance on the Cholec80 dataset, achieving a mAP of 89.1% for tool presence identification.

3. Materials and Methods

3.1. System Architecture

The proposed architecture for multi-label laparoscopic surgical tool classification (illustrated in Figure 2) is built upon a DenseNet121 backbone, enhanced with Squeeze-and-Excitation (SE) blocks and a feature fusion module and then followed by a bidirectional LSTM layer. The pre-trained DenseNet121 was employed after modifying its architecture to incorporate SE blocks at key stages of the network. These SE blocks adaptively recalibrate channel-wise feature responses, enhancing the model’s representation power. The features extracted by the modified DenseNet121 are then processed through a series of adaptive average pooling operations to create fixed-size feature maps. These multi-scale features are concatenated through the feature fusion module and then fed into a bidirectional LSTM layer, which captures temporal dependencies and context information across the laparoscopic video.

For the single-label model, we considered a simpler and less computationally expensive architecture with one SE block while dropping the BiLSTM network, as shown in Figure 3.

3.1.1. Backbone CNN Model

Initial tests between VGG-16, ResNet-50, and DenseNet-121 were conducted on a small set of the Cholec80 dataset. DenseNet-121 outperformed both VGG-16 and ResNet-50, making it the chosen backbone for this model. DenseNet-121 is a CNN composed of 121 layers, organized into four dense blocks with 6, 12, 24, and 16 layers, respectively. The output feature map sizes after each dense block are 128, 256, 512, and 1024, respectively [9,15]. Each dense block consists of Batch Normalization, ReLU activation, and 3 × 3 convolution operations, collectively forming a “bottleneck layer”. Transition layers, containing batch normalization, 1 × 1 convolution, and 2 × 2 average pooling layers, separate these dense blocks to reduce dimensionality and prepare for the next block. Traditionally, the network ends with a global average pooling (GAP) layer, followed by a fully connected (FC) layer and a softmax layer for classification tasks. The standard input image size for DenseNet-121 is 224 × 224 × 3 [16]. One of its most notable features is the use of dense connections, where each layer receives direct connections from all preceding layers and passes its own feature maps to subsequent layers. This feature reuse reduces the number of parameters, making the network more efficient. The problem of vanishing gradients, a common issue in deep architectures, is alleviated by the improved flow of information and gradients throughout the network due to these dense connections [17].

3.1.2. Multistage Feature Extraction

Our model employs a multistage feature extraction approach, leveraging the hierarchical nature of the DenseNet-121 architecture. Features are extracted from different stages of the modified DenseNet-121, specifically after the second, third, and fourth dense blocks. This strategy allows the model to capture a diverse range of features, from low-level textures and edges to high-level semantic information. The multistage feature extraction enhances the model’s ability to detect and classify surgical tools across various scales and complexities.

3.1.3. Squeeze-and-Excitation (SE) Attention Modules

Squeeze-and-Excitation (SE) blocks are incorporated into the DenseNet-121 architecture to enhance the model’s focus on crucial features related to surgical tool detection. SE blocks improve CNN performance by selectively emphasizing important feature channels while diminishing less relevant ones. This is achieved through two main operations: “squeeze” and “excitation”. The squeeze operation compresses the spatial dimensions of the feature maps into a single channel-wise descriptor using GAP [18]. This descriptor

z \in R^{C}

is computed as follows:

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(1)

where

u_{c}

is the

c

-th channel of the input feature map

U \in R^{H \times W \times C}

.

Here,

H

and

W

denote the height and width of the feature map, respectively, and

C

is the number of channels.

z_{c}

is the global average pooled value of the

c

-th channel, obtained by averaging all the spatial elements in that channel.

The excitation operation then recalibrates the feature channels by passing this descriptor through two FC layers. The resulting output

s

is computed as follows:

s = F_{e x} (z, W) = σ (W_{2} δ (W_{1} z))

(2)

where

δ

refers to the ReLU activation function,

σ

is the sigmoid activation function,

W_{1} \in R^{\frac{C}{r \times C}}

and

W_{2} \in R^{\frac{C \times C}{r}}

are the weights of the two FC layers, and

r

is the reduction ratio used to reduce the dimensionality in the first FC layer.

The output

s

, containing the channel-wise recalibrated weights, is used to scale the original feature maps, enabling the network to prioritize more relevant features.

The reduction factor

r

for the SE blocks is set to 16. By integrating these SE modules, the model can better distinguish between critical tool-related features and background information, leading to enhanced classification performance.

3.1.4. Temporal Modeling

To capture temporal dependencies and context information across the video frames, a BiLSTM network is employed. The fused features from the previous stage are fed into this bidirectional LSTM layer. The LSTM processes the sequence of fused features

X = (x_{1}, \dots, x_{T})

in both forward and backward directions, as follows:

h_{t}^{f} = {L S T M}_{f} (x_{t}, h_{t + 1}^{f})

(3)

h_{t}^{b} = {L S T M}_{b} (x_{t}, h_{t - 1}^{b})

(4)

where

h_{t}^{f}

and

h_{t}^{b}

represent the hidden states at time step

t

in the forward and backward directions, respectively. The final output for each time step is obtained by concatenating these forward and backward hidden states:

h_{t} = [h_{t}^{f}; h_{t}^{b}]

(5)

By processing the sequence in both forward and backward directions, the model can leverage both past and future context, which is crucial for understanding the temporal dynamics of surgical procedures. This temporal modeling enhances the model’s ability to detect and classify tools consistently across consecutive frames, improving overall performance and reducing frame-to-frame fluctuations in predictions.

3.2. Model Evaluation

3.2.1. Dataset

The Cholec80 dataset is a widely used benchmark in the field of computer vision. It consists of a collection of 80 videos captured from different patients and hospitals. The dataset comprises videos that were recorded at a frame rate of 25 frames per second (fps) [19]. However, the tool labelling was performed at a frame rate of 1 fps, in which seven tools are defined, namely: Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator and Specimen Bag. These videos provide a comprehensive introduction to the surgical procedure, covering all aspects of the use of surgical tools, anatomy, and surgical progression. The availability of the Cholec80 dataset has facilitated advances in surgical video analysis and computer-assisted surgery. It facilitates the development and evaluation of various algorithms, including deep learning-based algorithms, for tasks such as surgical stage identification, tool classification, and activity recognition. The first 40 videos were used for training and the last 40 videos were used for evaluation.

3.2.2. Evaluation Criteria

The primary evaluation metric for the multi-label classification task of surgical tool classification is the mean Average Precision (mAP). The mAP was calculated by first computing the average precision (AP) for each tool, which is the area under the precision–recall curve, and then averaging the AP scores across all tool classes.

Precision

(P)

is defined as the ratio of true positives

(T P)

to the sum of true positives and false positives

(F P)

:

P = \frac{T P}{T P + F P}

(6)

Recall

(R)

is defined as the ratio of true positives

(T P)

to the sum of true positives and false negatives

(F N)

:

R = \frac{T P}{T P + F N}

(7)

AP for a given class can be expressed as follows:

A P = \int_{0}^{1} P r e c i s i o n (r) d r

(8)

where

r

represents recall, and

P r e c i s i o n (r)

is the precision at a given recall level.

The mAP is then calculated by averaging the AP scores across all tool classes:

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(9)

where

N

is the number of tool classes, and

{A P}_{i}

is the Average Precision for the

i

-th class.

3.2.3. Training Setup

For the multi-label approach, two model architectures were implemented and compared. Both models were based on the DenseNet121 architecture, with the primary difference being the placement of the SE blocks. The first model architecture, referred to as CNN-SE-FFM-BiLSTM, is shown in Figure 2. The second model, termed CNN-FFM-SE-BiLSTM, incorporated a single SE block after the feature fusion layer, followed by a BiLSTM layer to capture temporal dependencies. As for the single-label approach, illustrated in Figure 2, it consists of the second model without the BiLSTM layer due to the absence of temporal dependencies. Table 1 presents a summary of the implemented models.

The models were trained on an NVIDIA RTX 4090 GPU. The software environment included Python 3.12 with PyTorch 2.7.0 for deep learning model implementation and training, Torchvision for data transformations and image handling, and scikit-learn for computing evaluation metrics. The training process was conducted up to 25 epochs, with early stopping applied based on validation loss to prevent overfitting.

3.2.4. Training Enhancements

All hyperparameters were optimized using the Optuna framework [20]. Optuna was chosen for hyperparameter optimization due to its efficiency and ability to automate the search process, reducing the need for time-consuming manual tuning. Data augmentation techniques were implemented to all experiments including resizing to 224 × 224, normalizing to ImageNet statistics. Also, all images were subjected to random horizontal flipping and ±15° in-plane rotation.

Two loss functions were evaluated: binary cross-entropy (BCE) and focal loss (FL) [21,22]. BCE remains the go-to choice for multi-label recognition and has been typically used in previous studies. On the other hand, the focal loss is designed to optimize overall misclassified examples while reducing the loss contribution from well-classified samples. This can be particularly useful in this study due to the imbalanced dataset, where the model may otherwise become overly confident in the majority classes, ignoring the minority classes.

For learning rate adaptation and model fine-tuning during the training process, the cosine-annealing learning-rate (LR) scheduler and the reduced learning rate on plateau (ReduceLROnPlateau) scheduler were investigated [23,24]. The cosine learning-rate scheduler adjusts the learning rate following a cosine curve, gradually reducing the rate over each cycle which is particularly useful in fine-tuning the model more carefully as it approaches convergence. On the other hand, the ReduceLROnPlateau scheduler adaptively adjusts the learning rate based on the validation performance, ensuring optimal convergence. The ReduceLROnPlateau scheduler decreases the learning rate by a factor of 0.1 if the validation loss plateaued for more than five epochs.

4. Results

4.1. Model Performance

The performance of the two main architectures, CNN-SE-FFM-BiLSTM and CNN-FFM-SE-BiLSTM, was evaluated on the tool presence classification task. The models were trained and validated on the Cholec80 dataset, and their performance was assessed using average precision metrics for each tool. Figure 4 shows the performance of both models. As demonstrated in Figure 4, the CNN-SE-FFM-BiLSTM demonstrated better overall performance compared to the CNN-FFM-SE-BiLSTM model. The boost in performance is clearly pronounced in tools with more complex patterns of usage, especially the scissors and irrigator which are less represented in the data set compared to other tools and are held near the grasper, which led to them being confused with it in some instances or even occluded by it in some other cases. Table 2 presents accuracy, precision, recall and F1 score calculated at a threshold of 0.5 for the CNN-SE-FFM-BiLSTM model.

The single-label model achieved an outstanding performance with only one SE block placed after the feature fusion module and without the need for additional temporal features extracted by the BiLSTM network in the multi-label setting. Figure 4 demonstrates the performance for the CNN-FFM-SE single-label model evaluated on single tool images of the Cholec80 data set. Figure 4 also presents the performance of the CNN-SE-FFM-BiLSTM model to allow comparison of the performance of single-label and multi-label classification approaches.

4.2. Ablation Study

An ablation study was conducted to analyze the contribution of various components of the proposed multi-label model and the single-label architecture for the overall classification performance in both problems. The results of the study are summarized in Table 3 and Table 4, respectively, validating the proposed design and network block selections in both settings.

For the multi-label classification problem, our experiments showed that the BCE loss coupled with the cosine-annealing learning-rate (LR) scheduler achieved slightly higher mAP performance (mAP =94.63) compared to focal loss (FL) with the same learning-rate scheduler (mAP= 94.49). As for Focal Loss with ReduceLROnPlateau scheduler, the mAP was a bit lower, at mAP of 94.27. So, while no major difference is noticed, the cosine-annealing LR scheduler with BCE loss function remains optimal for the multi-label setting.

In contrast, the single-label task suffers from more pronounced class imbalance and less overall training data samples (the single-label model operates only on single tool images). As a result, the best performance for the single-label task was achieved with Focal Loss and ReduceLROnPlateau scheduler (mAP = 95.32). Ablation runs on the single-label model show significant increases in the average precision of minority tools such as Scissors and Clipper by 13–21% compared to the standard BCE loss, and an overall increase in mAP by 10%.

4.3. Computation Time and Efficiency

The total computation time for the experiments was 5.7 h with training performed on an NVIDIA RTX 4090 GPU (see Table 5). The models were trained for up to 25 epochs, with early stopping applied based on validation loss to prevent overfitting. The models demonstrated efficient convergence, particularly when employing the optimized hyperparameters identified through the Optuna framework.

The inclusion of multiple SE blocks before the fusing of the mid to high level features led to a slightly improved performance compared to one SE block after the fused features. Nonetheless, both scenarios needed the same training time of 1.60 h as well as the same inference time of 1.33 ms per image in both multi-label settings based on a 32-frame batch size.

To assess feasibility for real-time deployment, the proposed CNN-SE-FFM-BiLSTM model was also evaluated using a 1-frame batch size, simulating single-frame inference typical in surgical assistance systems. Under this setting, the proposed CNN-SE-FFM-BiLSTM model required only 2.90 GFLOPs per image, had 8.27 million parameters, and consumed 0.27 GB of GPU memory. It achieved an average inference latency of 11.58 ms, corresponding to 86.3 FPS, compared to 25 FPS for the original video recording in the Cholec80 dataset.

4.4. Hyperparameter Optimization

The hyperparameters optimized via Optuna were critical in achieving the best performance. The final values for the learning rate, batch size, focal loss parameters, and mixup alpha are as follows:

Learning Rate (lr): 0.002
Weight Decay: 0.0009
Batch Size: 64
Focal Loss Alpha: 0.4208
Focal Loss Gamma: 4.8091
Mixup Alpha: 0.7659

These hyperparameters resulted in the best combination of accuracy, precision, recall, and F1 score across all evaluated models.

5. Discussion

This work introduces CNN-based architectures for classifying surgical tools in laparoscopic images. Two training strategies were explored which are training with images containing one tool (single-label training) and training with images sampled from complete procedure videos at a rate of 1 Hz. Furthermore, the study demonstrates the benefits of attention modules, feature fusion and temporal modeling for classifying the type of surgical tool in laparoscopic images.

The results in Figure 4 show a significant variation in the AP of different tools across the tested architectures. The CNN-SE-FFM-BiLSTM model generally outperformed the other model, particularly for tools like Scissors, Clipper and Irrigator, achieving AP values of 90.73%, 97.62% and 91.70%, respectively. The model’s effectiveness can be attributed to its use of multiple SE blocks within the CNN layers, which enhance feature representation.

The CNN-FFM-SE-BiLSTM model, while slightly underperforming compared to CNN-SE-FFM-BiLSTM, still showed promising results, especially in detecting frequently used tools like Hook, with AP values of 99.57% (see Figure 4). This model showed lower performance on tools like Scissors and Irrigator (89.12% and 90.46% AP, respectively). The classification performance was improved by introducing multiple SE blocks before fusing mid level to high level features which made the CNN-SE-FFM-BiLSTM model the best performer for the multi-label classification problem over all surgical instruments.

The proposed models demonstrate that the inclusion of SE blocks and temporal modeling can significantly enhance the classification accuracy of certain tools, though challenges remain in consistently detecting tools that are less frequently used or more likely to be obscured in the surgical field.

Compared to state-of-the-art methods, the proposed CNN-SE-FFM-BiLSTM architecture demonstrated competitive performance. The mean average precision (mAP) for the CNN-SE-FFM-BiLSTM model is 94.63%, which marks a significant improvement over earlier approaches like ToolMod [11], MTRC [14], and Nwoye [25] (see Table 6). These results demonstrate the efficacy of combining spatial and temporal features to classify surgical tools.

In comparison to the CNN-LSTM model in [12], our approach shows a comparable performance in mAP, validating the benefit of integrating SE blocks with feature fusion modules. However, the performance on minority classes like Scissors and Irrigator still indicates an area where further improvements are necessary.

While the proposed models showed competitive performance, a slight but noticeable difference in performance was observed between the single-label and the multi-label approaches. The CNN-FFM-SE model, when applied in a single-label context, achieved excellent performance, with a mean average precision (mAP) of 95.32%. However, when the model was adapted for multi-label classification, as seen in the CNN-SE-FFM-BiLSTM and CNN-FFM-SE-BiLSTM architectures, the performance dropped, with mAP values of 94.63% and 94.26%, respectively. This drop in performance can be attributed to the increased complexity of multi-label classification. In multi-label scenarios, the model must simultaneously identify multiple tools that may be present in a single frame, often under challenging conditions such as tool overlap, occlusion, or varying scales. This is significantly more complex than single-label classification, where only one tool is present at a time. The presence of multiple tools introduces additional challenges, such as the need for more precise feature discrimination and the handling of inter-class dependencies, which the current architecture might not fully capture. The proposed approach helped to mitigate the increased complexities in multi-label classification compared to single-label classification through the inclusion of multiple SE blocks before fusing the mid to high level features followed by a BiLSTM network.

Despite the strong performance of deep neural networks, they remain vulnerable to adversarial attacks where imperceptible modifications to input data can lead to incorrect model predictions [26]. This vulnerability presents a critical limitation in the deployment of AI models in clinical settings and highlights the need for further research into model robustness.

Finally, future work will explore how the proposed architecture generalizes across different surgical procedures and datasets, as the current study is limited to Cholec80. Also, investigating alternative temporal modules, such as temporal convolutions or Transformers, could further enhance temporal modeling and scalability, potentially improving both overall and per-tool classification performance.

6. Conclusions

This study introduced novel deep learning architectures for surgical tool classification, integrating DenseNet121 with SE blocks and BiLSTM layers to capture spatial and temporal features. Our results showed that the CNN-SE-FFM-BiLSTM model performed well in single-label and multi-label classification scenarios with a slight drop in overall performance in the multi-label setting compared to the single-label setting. This highlights the complexity of identifying multiple tools simultaneously and the continuous need for more refined approaches in multi-label settings.

The models were competitive with state-of-the-art methods, but further improvements are necessary, particularly in handling imbalanced classes and optimizing for real-time applications. Future work should focus on enhancing feature fusion techniques, broadening dataset evaluations, and refining models for practical use in surgical environments, aiming to improve both surgical workflows and patient outcomes.

Author Contributions

Conceptualization, H.Q., T.A.A., N.A.J, H.E and K.M.; methodology, H.Q, N.A.J., T.A.A. and H.E.; software, H.Q. and H.E.; validation, T.A.A., N.A.J., H.Q., H.E. and M.R.; formal analysis, H.Q., T.A.A., N.A.J., H.E., M.R., N.A and K.M.; investigation, H.Q., H.E., N.A.J., T.A.A., M.R. and N.A.; resources, H.E., M.R., N.A and K.M.; data curation, H.Q., N.A.J., T.A.A. and H.E.; writing—original draft preparation, H.Q. and H.E.; writing—review and editing, H.E., T.A.A., N.A.J., M.R., T.N. and K.M.; visualization, H.Q., T.A.A., N.A.J., M.R. and H.E.; supervision, T.N. and K.M.; project administration, H.E., M.R., N.A and K.M.; funding acquisition, H.E., M.R., N.A. and K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jordanian University Study Mobility Program and the German Federal Ministry of Research and Education (BMBF) under grant CoHMed/IntelliMed 13 FH5 I05 IA and grant CoHMed/PersonaMed B3 FH5 I09 IA.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study was Cholec80. The Cholec80 dataset is available (http://camma.u-strasbg.fr/datasets/ (accessed on 1 June 2024) from the respected publisher upon request.

Conflicts of Interest

Author Nour Aldeen Jalal was employed by the company Erbe Elektromedizin GmbH, Germany. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

FORs	Future Operating Rooms
CASs	Context-Aware Systems
SDS	Surgical Data Science
MIS	Minimally Invasive Surgery
CAI	Computer-Assisted Intervention
STD	Surgical Tool Detection
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
SE	Squeeze-and-Excitation
FFM	Feature Fusion Module
mAP	Mean Average Precision
Grad-CAMs	Gradient Class Activation Maps
AP	Average Precision
P	Precision
TP	True Positives
FP	False Positives
FN	False Negatives
FPN	Feature Pyramid Network
MTRCNet-CL	Multi-Task Recurrent Convolutional Network with Correlation Loss
MSF-FPN	Multiscale Information Fusion with Feature Pyramid Network
EFFNet	Enhanced Feature-Fusion Network
RDM	Refined Detection Modules
CDM	Coarse Detection Modules
GAP	Global Average Pooling
FC	Fully Connected
FPS	Frames Per Second
ESAD	Endoscopic Surgeon Action Detection

References

The Evolution of Surgery. MountainView Hospital. Available online: https://mountainview-hospital.com/about/newsroom/the-evolution-of-surgery (accessed on 13 October 2023).
Bharathan, R.; Aggarwal, R.; Darzi, A. Operating room of the future. Best Pract. Res. Clin. Obstet. Gynaecol. 2013, 27, 311–322. [Google Scholar] [CrossRef] [PubMed]
Maier-Hein, L.; Eisenmann, M.; Sarikaya, D.; März, K.; Collins, T.; Malpani, A.; Fallert, J.; Feussner, H.; Giannarou, S.; Mascagni, P.; et al. Surgical Data Science–from Concepts toward Clinical Translation. Med. Image. Anal. 2022, 76, 102306. [Google Scholar] [CrossRef] [PubMed]
Neumann, J.; Uciteli, A.; Meschke, T.; Bieck, R.; Franke, S.; Herre, H.; Neumuth, T. Ontology-based surgical workflow recognition and prediction. J. Biomed. Inform. 2022, 136, 104240. [Google Scholar] [CrossRef] [PubMed]
Jalal, N.A.; Abdulbaki Alshirbaji, T.; Laufer, B.; Docherty, P.D.; Russo, S.G.; Neumuth, T.; Möller, K. Effects of Intra-Abdominal Pressure on Lung Mechanics during Laparoscopic Gynaecology. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, Mexico, 1–5 November 2021; pp. 2091–2094. [Google Scholar]
Nagy, D.Á.; Rudas, I.; Haidegger, T. Surgical Data Science, an emerging field of medicine. In Proceedings of the 2017 IEEE 30th Neumann Colloquium (NC), Budapest, Hungary, 24–25 November 2017. [Google Scholar] [CrossRef]
Sarikaya, D.; Corso, J.J.; Guru, K.A. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Trans. Med. Imag. 2017, 36, 1542–1549. [Google Scholar] [CrossRef] [PubMed]
Maier-Hein, L.; Vedula, S.; Speidel, S.; Navab, N.; Kikinis, R.; Park, A.; Eisenmann, M.; Feussner, H.; Forestier, G.; Giannarou, S. Surgical Data Science: Enabling Next Generation Surgery. arXiv, 2017; arXiv:1701.06482. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Abdulbaki Alshirbaji, T.; Jalal, N.A.; Docherty, P.D.; Neumuth, T.; Möller, K. A deep learning spatial-temporal framework for detecting surgical tools in laparoscopic videos. Biomed. Signal Process. Control. 2021, 68, 102801. [Google Scholar] [CrossRef]
Abdulbaki Alshirbaji, T.; Jalal, N.A.; Möller, K. Surgical tool classification in laparoscopic videos using convolutional neural network. Curr. Dir. Biomed. Eng. 2018, 4, 407–410. [Google Scholar] [CrossRef]
Abdulbaki Alshirbaji, T.; Ding, N.; Jalal, N.A.; Möller, K. The effect of background pattern on training a deep convolutional neural network for surgical tool detection. Proc. Autom. Med. Eng. 2020, 1, 024. [Google Scholar]
Jalal, N.A.; Arabian, H.; Abdulbaki Alshirbaji, T.; Docherty, P.D.; Neumuth, T.; Moeller, K. Analyzing attention convolutional neural network for surgical tool localization: A feasibility study. Curr. Dir. Biomed. Eng. 2022, 8, 548–551. [Google Scholar] [CrossRef]
Jin, Y.; Li, H.; Dou, Q.; Chen, H.; Qin, J.; Fu, C.W.; Heng, P.A. Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 2020, 59, 101572. [Google Scholar] [CrossRef] [PubMed]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
CS231 N Convolutional Neural Networks for Visual Recognition. Available online: http://cs231n.github.io/convolutional-networks/ (accessed on 13 October 2023).
Chablani, M. DenseNet. Towards Data Sci. 25 August 2017. Available online: https://medium.com/data-science/densenet-2810936aeebb (accessed on 1 June 2024).
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar] [CrossRef]
Papers with Code–Cholec80 Dataset. Available online: https://paperswithcode.com/dataset/cholec80 (accessed on 17 June 2023).
Takuya, A.; Shotaro, S.; Toshihiko, Y.; Takeru, O.; Masanori, K. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19). Association for Computing Machinery, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
Géron, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2022. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Loshchilov, I.; Frank, H. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Thakur, A.; Gupta, M.; Sinha, D.K.; Mishra, K.K.; Venkatesan, V.K.; Guluwadi, S. Transformative breast Cancer diagnosis using CNNs with optimized ReduceLROnPlateau and Early stopping Enhancements. Int. J. Comput. Intell. Syst. 2024, 17, 14. [Google Scholar]
Nwoye, C.I.; Mutter, D.; Marescaux, J.; Padoy, N. Weakly supervised convolutional LSTM approach for tool tracking in laparoscopic videos. Int. J. Comput. Assist. Radiol. Surg. 2019, 14, 1059–1067. [Google Scholar] [CrossRef] [PubMed]
Hyun, K.; Jeong, J.W. AdvU-Net: Generating Adversarial Example Based on Medical Image and Targeting U-Net Model. J. Sens. 2022, 1, 4390413. [Google Scholar]

Figure 1. Examples of challenges in surgical tool detection: (a) blood and tissue occlusion, (b) motion blur, (c) variable lighting, (d) smoke and fog obstruction.

Figure 2. The overall framework of our proposed method for multi-label tool classification.

Figure 3. Single-label tool classification model.

Figure 4. Average precision results of each tool and mean average precision (mAP) for all evaluated architectures: CNN-FM SE, CNN-FM-SE-BiLSTM, and CNN-SE-FFM-BiLSTM.

Table 1. Description of the evaluated approaches.

Approach	Description
CNN-SE-FFM-BiLSTM	Three SE blocks within the CNN layers followed by FFM and a BiLSTM layer to capture temporal dependencies.
CNN-FFM-SE-BiLSTM	A single SE block after the FFM, followed by the BiLSTM layer.
CNN-FFM-SE	A single-label approach consisting of the DenseNet-121 architecture followed by FFM and a single SE block

Table 2. Performance metrics for the proposed CNN-SE-FFM-BiLSTM model architecture across all tools.

Tool	Accuracy	Precision	Recall	F1 Score
Grasper	84.49	77.07	94.98	85.10
Bipolar	99.36	97.39	89.22	85.10
Hook	97.68	97.57	98.29	97.93
Scissors	99.27	72.50	89.94	80.28
Clipper	99.63	93.78	93.10	93.44
Irrigator	98.81	85.65	88.51	87.06
Specimen Bag	98.92	90.55	90.88	90.71
Mean	96.88	87.79	92.13	89.66

Table 3. Results of ablation study for multi-label architecture. Average Precision (AP) tool performance for the CNN model, CNN with BiLSTM (CNN-BiLSTM), CNN with SE blocks, CNN with SE blocks and feature fusion (CNN-SE-FFM), and Full CNN-SE-FFM-BiLSTM.

Tool	CNN	CNN-BiLSTM	CNN-SE	CNN-SE-FFM	CNN-SE-FFM-BiLSTM
Grasper	88.81	90.58	89.51	89.98	90.65
Bipolar	93.33	94.27	95.85	95.42	95.71
Hook	99.44	99.55	99.57	99.58	99.60
Scissors	76.83	86.50	87.35	86.26	90.73
Clipper	93.70	96.93	97.23	97.10	97.62
Irrigator	85.76	90.23	90.70	89.57	91.70
Specimen Bag	93.05	95.71	96.84	96.12	96.38
Mean	90.13	93.40	93.87	93.43	94.63

Table 4. Ablation study for single-label architecture: Average Precision (AP) tool performance for the single-label approach; bare CNN, CNN with feature fusion (CNN-FFM), and full model CNN-FFM-SE.

Tool	CNN	CNN-FFM	CNN-FFM-SE
Grasper	71.50	93.45	93.82
Bipolar	80.37	96.87	97.51
Hook	94.99	98.21	98.47
Scissors	49.41	91.20	93.61
Clipper	74.48	96.28	94.89
Irrigator	54.57	93.47	94.00
Specimen Bag	72.72	94.34	94.96
Mean	71.18	94.83	95.32

Table 5. Computation time for all evaluated architectures.

Approach	Computation Time (h)
CNN-SE-FFM-BiLSTM	1.60
CNN-FFM-SE-BiLSTM	1.60
CNN-FFM-SE	3.20

Table 6. AP per tool comparison between different state-of-the-art approaches and our multi-label approaches.

Tool	ToolMod [11]	CNN-SE-MSF-LSTM [10]	MTRC [14]	Nwoye [25]	CNN-SE-FFM-BiLSTM	CNN-FFM-SE-BiLSTM
Grasper	68.71	91.00	84.7	99.7	90.65	89.77
Bipolar	92.36	97.30	90.1	95.6	95.71	95.96
Hook	93.50	99.80	95.6	99.8	99.60	99.57
Scissors	63.07	90.30	86.7	86.9	90.73	89.12
Clipper	60.75	97.40	89.8	97.5	97.62	97.87
Irrigator	66.91	95.60	88.2	74.7	91.70	90.46
Specimen Bag	72.20	98.30	88.9	96.1	96.38	96.90
Mean	73.93	95.60	89.1	92.9	94.63	94.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

ElMoaqet, H.; Qaddoura, H.; Ryalat, M.; Almtireen, N.; Abdulbaki Alshirbaji, T.; Jalal, N.A.; Neumuth, T.; Moeller, K. Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries. Appl. Sci. 2025, 15, 6121. https://doi.org/10.3390/app15116121

AMA Style

ElMoaqet H, Qaddoura H, Ryalat M, Almtireen N, Abdulbaki Alshirbaji T, Jalal NA, Neumuth T, Moeller K. Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries. Applied Sciences. 2025; 15(11):6121. https://doi.org/10.3390/app15116121

Chicago/Turabian Style

ElMoaqet, Hisham, Hamzeh Qaddoura, Mutaz Ryalat, Natheer Almtireen, Tamer Abdulbaki Alshirbaji, Nour Aldeen Jalal, Thomas Neumuth, and Knut Moeller. 2025. "Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries" Applied Sciences 15, no. 11: 6121. https://doi.org/10.3390/app15116121

APA Style

ElMoaqet, H., Qaddoura, H., Ryalat, M., Almtireen, N., Abdulbaki Alshirbaji, T., Jalal, N. A., Neumuth, T., & Moeller, K. (2025). Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries. Applied Sciences, 15(11), 6121. https://doi.org/10.3390/app15116121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning Architectures for Single-Label and Multi-Label Surgical Tool Classification in Minimally Invasive Surgeries

Abstract

1. Introduction

2. State of the Art

3. Materials and Methods

3.1. System Architecture

3.1.1. Backbone CNN Model

3.1.2. Multistage Feature Extraction

3.1.3. Squeeze-and-Excitation (SE) Attention Modules

3.1.4. Temporal Modeling

3.2. Model Evaluation

3.2.1. Dataset

3.2.2. Evaluation Criteria

3.2.3. Training Setup

3.2.4. Training Enhancements

4. Results

4.1. Model Performance

4.2. Ablation Study

4.3. Computation Time and Efficiency

4.4. Hyperparameter Optimization

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI