Development of a Deep Learning-Based Algorithm to Detect the Distal End of a Surgical Instrument

This work aims to develop an algorithm to detect the distal end of a surgical instrument using object detection with deep learning. We employed nine video recordings of carotid endarterectomies for training and testing. We obtained regions of interest (ROI; 32 × 32 pixels), at the end of the surgical instrument on the video images, as supervised data. We applied data augmentation to these ROIs. We employed a You Only Look Once Version 2 (YOLOv2) -based convolutional neural network as the network model for training. The detectors were validated to evaluate average detection precision. The proposed algorithm used the central coordinates of the bounding boxes predicted by YOLOv2. Using the test data, we calculated the detection rate. The average precision (AP) for the ROIs, without data augmentation, was 0.4272 ± 0.108. The AP with data augmentation, of 0.7718 ± 0.0824, was significantly higher than that without data augmentation. The detection rates, including the calculated coordinates of the center points in the centers of 8 × 8 pixels and 16 × 16 pixels, were 0.6100 ± 0.1014 and 0.9653 ± 0.0177, respectively. We expect that the proposed algorithm will be efficient for the analysis of surgical records.


Introduction
In recent years, deep learning techniques have been adopted in several medical fields. The image classification [1][2][3][4][5][6][7][8], object detection [9][10][11] and semantic segmentation [12][13][14][15][16][17] techniques have been applied to diagnostic medical images, such as X-rays, computed tomography, ultrasound and magnetic resonance imaging, to support diagnosis and/or image acquisition. Since minimally invasive surgical procedures, such as robotic and endoscopic surgery, in which the operative field is viewed through an endoscopic camera, is an emerging field in medicine, surgical video-based analyses have also become a growing research topic, as they realize markerless and sensorless analysis without infringement of surgical safety regulations. These research approaches were used for the objective assessment of surgical proficiency, education, workflow optimization and risk management [18][19][20]. To date, several techniques have been proposed for the recognition and analysis of surgical instrument motion and behavior. Jo et al. [21] proposed a new real-time detection algorithm for surgical instrument detection using convolutional neural networks (CNNs). Zhao et al. [22] also proposed an algorithm for surgical instrument tracking, based on deep learning with line detection and a spatio-temporal context. A review of the literature [23] was recently reported for surgical tool detection studies. However, there have still been a few related works that deal with micro-neurosurgical or microvascular specialties, because these surgeries require special surgical finesse and instruments, in which robotic and endoscopic means could not have been employed [24].
We previously reported "tissue motion" measurement results, as representative of the "gentle" handling of tissue during exposure of the carotid artery in a carotid endarterectomy (CEA). Therein, we employed off-the-shelf software for the video-based analysis [25]. However, we performed the analysis manually or semi-automatically; therefore, the analysis process was only retrospective, and quite time-consuming. Consequently, we could not realize real-time analysis. Based on these issues, we have introduced a deep learning technique for the analysis, because it is expected to realize automated analysis for real-time monitoring and feedback. Besides, for the analysis of surgical performance, a deep learning technique is expected to be more objective [18,19].
In this work, we focus on surgical instrument motion tracking during a CEA, as motion tracking information could be utilized to avoid surgical risk (e.g., unexpected tissue injury), as well as the objective assessment of surgical performance. For this purpose, we target the distal end of a surgical instrument, as this tip is the most important site that directly interacts with patient tissue (i.e., dissection, coagulation and grasping of the tissues) and surgical materials (i.e., holding of needles and cotton). Although there have been some studies that show the procedures for the detection of the shape or certain parts of surgical instruments (i.e., pose estimation of instruments) [21][22][23][24]26], we focus on the detection of the distal end as a goal in this work. This work aims to develop an algorithm to detect the distal end of a surgical instrument, using object detection with deep learning. We anticipate that the proposed algorithm can be used as an object detection technique with deep learning.

Subjects and Format of Video Records
We analyzed video records for nine patients who underwent a CEA, retrospectively. The video records were captured throughout the entire operation. The surgical video was recorded using a microscope-mounted video camera device. The video data was saved in the audio-video interlaced (AVI) format at 30 fps. Note that the institutional review board of Hokkaido University Hospital approved this work.

Preprocessing of Images
We preprocessed the dataset, as illustrated in Figure 1. The scene of the dataset was limited to the state in which the common, internal and external carotid arteries were revealed for the operation ( Figure 2). The scene was determined by the logic that it would be the most likely to involve surgical risk during a CEA; the surgical video records contain many scenes, depending on the surgical procedures. We converted the AVI files to joint photographic experts group (JPEG) files. In total, we converted 512 images per patient to JPEG, from the full-length videos at 30 fps, to be able to detect the subtle motion of the surgical instrument. We cropped the video images of the shorter side to a square based on the longer side, because the AVI files were obtained using cameras from different vendors with different image ratios, i.e., 4:3 or 16:9. We padded the shorter side of the images with zeros, i.e., black. We performed these processes in-house with MATLAB software (The MathWorks, Inc., Natick, MA, USA).

Dataset
We created the supervised data to detect the end of the surgical instruments using in-house software ( Figure 3). The software allowed the definition of an arbitrary 32 × 32-pixel ROI at the end of the surgical instrument. The ROI data were output as a text file that included the object name and ROI coordinates. The object name was set to "tip" for one-class object detection. Typically, object names are assigned to several objects in multi-class detection. However, in this work, we have only one, since the detection target is only the end of the surgical instrument. Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 11

Dataset
We created the supervised data to detect the end of the surgical instruments using in-house software ( Figure 3). The software allowed the definition of an arbitrary 32 × 32-pixel ROI at the end of the surgical instrument. The ROI data were output as a text file that included the object name and ROI coordinates. The object name was set to "tip" for one-class object detection. Typically, object names are assigned to several objects in multi-class detection. However, in this work, we have only one, since the detection target is only the end of the surgical instrument.
We divided the supervised data into nine subsets for nested cross-validation [27]. Even though there were several different ways to separate the dataset, we assigned the supervised data to training

Dataset
We created the supervised data to detect the end of the surgical instruments using in-house software ( Figure 3). The software allowed the definition of an arbitrary 32 × 32-pixel ROI at the end of the surgical instrument. The ROI data were output as a text file that included the object name and ROI coordinates. The object name was set to "tip" for one-class object detection. Typically, object names are assigned to several objects in multi-class detection. However, in this work, we have only one, since the detection target is only the end of the surgical instrument.
We divided the supervised data into nine subsets for nested cross-validation [27]. Even though there were several different ways to separate the dataset, we assigned the supervised data to training We divided the supervised data into nine subsets for nested cross-validation [27]. Even though there were several different ways to separate the dataset, we assigned the supervised data to training and test data. Each patient contributed 512 images; we used 4096 images from eight patients for training, and the remaining 512 images for testing ( Figure 4). Each subset was an independent combination of eight patients for training and one patient for test images, to prevent the mixing of patient images between training and testing images within the subsets. We used the test dataset for the created model's evaluation, and not for the training process evaluation. To effectively learn for the and test data. Each patient contributed 512 images; we used 4096 images from eight patients for training, and the remaining 512 images for testing ( Figure 4). Each subset was an independent combination of eight patients for training and one patient for test images, to prevent the mixing of patient images between training and testing images within the subsets. We used the test dataset for the created model's evaluation, and not for the training process evaluation. To effectively learn for the training dataset, we performed data augmentation [28][29][30][31] using image rotation from −90° to 90° in 5° steps, as illustrated in Figure 5.   and test data. Each patient contributed 512 images; we used 4096 images from eight patients for training, and the remaining 512 images for testing ( Figure 4). Each subset was an independent combination of eight patients for training and one patient for test images, to prevent the mixing of patient images between training and testing images within the subsets. We used the test dataset for the created model's evaluation, and not for the training process evaluation. To effectively learn for the training dataset, we performed data augmentation [28][29][30][31] using image rotation from −90° to 90° in 5° steps, as illustrated in Figure 5.

Training Images for Model Creation
We developed the software for object detection with a deep learning technique via in-house MATLAB software; we used a deep learning-optimized machine with an Nvidia Quadro P5000 graphics card (Nvidia Corporation, Santa Clara, CA, USA), 8.9 Tera floating-point single-precision operations per second, 288 GB/s memory bandwidth, and 16 GB memory per board. We performed the image training as transfer learning by initial weight using You Only Look Once Version 2 (YOLOv2) [32], with the MATLAB deep learning Toolbox and Computer Vision System Toolbox. The training model hyperparameters were as follows: maximum training epochs-10; initial learning rate-0.00001; mini-batch size-96. We used stochastic gradient descent with momentum for optimization with an initial learning rate. We set the momentum and L2 regulation to 0.9 and 0.0001, respectively. We performed image training nine times based on the training subsets in Figure 2.

Evaluation of Created Models
We incorporated the predicted bounding boxes into the MATLAB software, to reveal the region of the distal end of the surgical instrument as a bounding box. We also evaluated the detection of the region of the distal end using average precision (AP), log-average miss rate (LAMR), and frame per second (FPS) for the efficiency of the created model. We examined the bounding boxes based on the supervised ROI, according to the "evaluateDetectionPrecision" and "evaluateDetectionMissRate" function in the MATLAB Computer Vision Toolbox.

Algorithm and Evaluation of Distal End Detection
The proposed algorithm to detect the distal end of a surgical instrument used the central coordinates of the predicted bounding boxes, and was completed with YOLOv2. The predicted bounding box had a square shape around the center point of the distal end of the surgical instrument. Therefore, we can use coordinate information from the center of the bounding box as a mark to display the distal end of the surgical instrument. The evaluation of whether the calculated coordinates of the center point were normally displayed was confirmed in order to compare the position of the supervised coordinates. The detection rate of the distal end of the surgical instrument was defined as including the calculated coordinates of the center point of 8 × 8 or 16 × 16 pixels among the valid bounding boxes ( Figure 6). We calculated the detection rate of the distal end using the test data of 512 image. We calculated these processes with MATLAB software. We presented all the results as mean and standard deviation (SD) according to the number of nine-fold subsets.

Training Images for Model Creation
We developed the software for object detection with a deep learning technique via in-house MATLAB software; we used a deep learning-optimized machine with an Nvidia Quadro P5000 graphics card (Nvidia Corporation, Santa Clara, CA, USA), 8.9 Tera floating-point single-precision operations per second, 288 GB/s memory bandwidth, and 16 GB memory per board. We performed the image training as transfer learning by initial weight using You Only Look Once Version 2 (YOLOv2) [32], with the MATLAB deep learning Toolbox and Computer Vision System Toolbox. The training model hyperparameters were as follows: maximum training epochs-10; initial learning rate-0.00001; mini-batch size-96. We used stochastic gradient descent with momentum for optimization with an initial learning rate. We set the momentum and L2 regulation to 0.9 and 0.0001, respectively. We performed image training nine times based on the training subsets in Figure 2.

Evaluation of Created Models
We incorporated the predicted bounding boxes into the MATLAB software, to reveal the region of the distal end of the surgical instrument as a bounding box. We also evaluated the detection of the region of the distal end using average precision (AP), log-average miss rate (LAMR), and frame per second (FPS) for the efficiency of the created model. We examined the bounding boxes based on the supervised ROI, according to the "evaluateDetectionPrecision" and "evaluateDetectionMissRate" function in the MATLAB Computer Vision Toolbox.

Algorithm and Evaluation of Distal End Detection
The proposed algorithm to detect the distal end of a surgical instrument used the central coordinates of the predicted bounding boxes, and was completed with YOLOv2. The predicted bounding box had a square shape around the center point of the distal end of the surgical instrument. Therefore, we can use coordinate information from the center of the bounding box as a mark to display the distal end of the surgical instrument. The evaluation of whether the calculated coordinates of the center point were normally displayed was confirmed in order to compare the position of the supervised coordinates. The detection rate of the distal end of the surgical instrument was defined as including the calculated coordinates of the center point of 8 × 8 or 16 × 16 pixels among the valid bounding boxes ( Figure 6). We calculated the detection rate of the distal end using the test data of 512 image. We calculated these processes with MATLAB software. We presented all the results as mean and standard deviation (SD) according to the number of nine-fold subsets. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 11 Figure 6. Process to calculate the detection rate of the distal end of the surgical instrument.

Statistical Analysis
We presented AP, LAMR, FPS, and the detection rate of the distal end of the surgical instrument as mean ± SD. We evaluated the comparison of the effect with and without data augmentation using the Mann-Whitney U-test at the 0.05 significance level. Table 1 summarizes the AP for the 32 × 32-pixel ROIs detected as a bounding box trained by YOLOv2. The AP, LAMR and FPS for the 32 × 32-pixel ROIs with data augmentation are 0.4272 ± 0.108, 0.6619 ± 0.0703 and 43.3 ± 0.8, respectively, while with data augmentation, they are 0.7718 ± 0.0824, 0.3488 ± 0.1036 and 29.9 ± 4.5, respectively. These results with data augmentation are significantly higher than without data augmentation. Figure 7 illustrates representative examples of the bounding box detection results.

Statistical Analysis
We presented AP, LAMR, FPS, and the detection rate of the distal end of the surgical instrument as mean ± SD. We evaluated the comparison of the effect with and without data augmentation using the Mann-Whitney U-test at the 0.05 significance level. Table 1 summarizes the AP for the 32 × 32-pixel ROIs detected as a bounding box trained by YOLOv2. The AP, LAMR and FPS for the 32 × 32-pixel ROIs with data augmentation are 0.4272 ± 0.108, 0.6619 ± 0.0703 and 43.3 ± 0.8, respectively, while with data augmentation, they are 0.7718 ± 0.0824, 0.3488 ± 0.1036 and 29.9 ± 4.5, respectively. These results with data augmentation are significantly higher than without data augmentation. Figure 7 illustrates representative examples of the bounding box detection results.  Table 2 summarizes the detection rate of the distal end of the surgical instrument. The detection rates, including the calculated coordinates of the center point of the 8 × 8 and 16 × 16 pixels, are 0.6100 ± 0.1014 and 0.9653 ± 0.0177, respectively. Figure 7 illustrates representative examples of results for the detection of the distal end of the surgical instrument.  Table 2 summarizes the detection rate of the distal end of the surgical instrument. The detection rates, including the calculated coordinates of the center point of the 8 × 8 and 16 × 16 pixels, are 0.6100 ± 0.1014 and 0.9653 ± 0.0177, respectively. Figure 7 illustrates representative examples of results for the detection of the distal end of the surgical instrument.

Discussion
The purpose of this work was to develop and evaluate an algorithm to detect the distal end of a surgical instrument, using object detection with deep learning, during surgery. With regard to detection as a bounding box, the AP with data augmentation was 0.7718 ± 0.0824. This result confirmed that the data augmentation was able to achieve less dependency on location and angle of the distal end of surgical instruments, even though such instruments have a variety of types and shapes. We set the rotation angle of the data augmentation to −90°-90°, in 5° steps, because the view of the surgical field was commonly recorded with reference to the view of the main surgeon.

Discussion
The purpose of this work was to develop and evaluate an algorithm to detect the distal end of a surgical instrument, using object detection with deep learning, during surgery. With regard to detection as a bounding box, the AP with data augmentation was 0.7718 ± 0.0824. This result confirmed that the data augmentation was able to achieve less dependency on location and angle of the distal end of surgical instruments, even though such instruments have a variety of types and shapes. We set the rotation angle of the data augmentation to −90 • -90 • , in 5 • steps, because the view of the surgical field was commonly recorded with reference to the view of the main surgeon. Approximately 80% of the AP was appropriate for comparing our results with those of another report [9] on the detection of small anatomies of the brain with deep learning. The FPS with data augmentation showed a slight variation, because the surgical instrument's motion was relatively different throughout the images of the test data, depending on the surgical procedure of that scene. The relationships between the position or motion of the detectable object, and the precision or FPS of the detection, have been reported [33,34], even though the entire surgical instrument's motions through the test data were not evaluated. With regard to the proposed algorithm for the detection of the distal end of the surgical instrument, the detection rates for the center point of the 8 × 8 and 16 × 16 pixels were 0.6100 ± 0.1014 and 0.9653 ± 0.0177, respectively, even though the evaluated condition was limited to valid bounding boxes. Similarly to previous results for the assessment of surgical skill level in the surgical operation [18,19], our results Appl. Sci. 2020, 10, 4245 8 of 11 indicate that the proposed algorithm is efficient as an indicator for the analysis of a surgical operation. The proposed algorithm for the detection of the distal end of the surgical instrument as a point differed procedurally from those in previous reports [35,36], in which the proposed method was calculated using the center coordinates of the bounding box. CNN architectures for object detection have been widely used with many deep learning frameworks [32,[37][38][39]. The advantage of this algorithm is that is creates the dataset using the existing YOLOv2 [32]. Specifically, the only attempt to define the square size of the ROIs around the distal end of the surgical instruments that was easy was that which did not require the information of the whole instrument, because the surgical instrument states varied in the operation fields of the video record. For the distal end detection, the calculation of the coordinates in the image that depicted the distal end of the surgical instruments was not necessary to complete the calculation, because the proposed bounding boxes always showed as a square shape.
The limitations of the present work are as follows. First, the scenes of the surgical record were limited to CEAs. Although the surgical procedures were performed in many parts of the body, in a preliminary study, it was necessary to focus on a single procedure. There are many types of surgical instruments, such as monopolar and bipolar handpieces, biological tweezers, forceps, surgical scissors and scalpels. Furthermore, the evaluation of a surgical instrument detection method has been reported [23], with many procedures and datasets. Therefore, depending on the various types of operations and techniques, a direct comparison of surgical instruments, under the same conditions, should be considered in future work. Second, the present work only focused on the scene where carotid arteries were exposed. The primary reason for focusing on this scene was to evaluate a delicate operative technique with inherent risks, such as cranial nerve injury, the possibility of ischemic stroke caused by plaque disruption, and hemorrhage. This scene was known to involve the most technical surgical performances; that is, those during a CEA [40]. Third, with regard to the conversion from surgical records to JPEG files, the training data in this work used a 30 fps temporal resolution for all conversions. In this work, the scene of the dataset was selected in order to focus on detecting the feature of the instruments via the time resolution, rather than letting the amount of features obtained depend on the length of time. Reducing the frame rate for JPEG conversion would provide more image features; however, the repeating pattern of similar movements might not be affected by the improvement associated with obtaining new image features. The fps of the conversion was adequate for the given procedure, because the approach to the blood vessels was a particularly important part of the surgery, involving the potential risk of bleeding. Nevertheless, this approach contributed to these novel findings and insights in the field of the evaluation of the method of detecting the distal end of the surgical instrument. Note that previous works [41][42][43] have reported the detection of the point of the joint for animals and human body parts. However, the approach reported in these works required a dedicated technique to train the dataset. Moreover, many works [35,36,44,45] have reported the detection of entire surgical instruments using semantic segmentation. These works were used in the Endoscopic Vision Challenge (http://endovis.grand-challenge.org) dataset at the international conference on Medical Image Computing and Computer Assisted Intervention. To the best of our knowledge, detection of the distal end of a surgical instrument, with a particular focus on CEA operations, has not been reported, and this we have done, even though our work was based on a currently limited dataset. The specific target surgery uses similar techniques and instruments, even though the operations have a variety of purposes and employ a variety of techniques. As with all common procedures, the instruments were almost the same for all operations. In particular, the scene of the approach to the vein focuses on the vessels and the instruments. Therefore, the detection of the distal end of the surgical instrument with deep learning will be useful for common surgical procedures.

Conclusions
We performed this research to develop an algorithm to detect the distal end of a surgical instrument in CEA operations, using object detection with deep learning. We determined that the detection of the distal end of the surgical instrument attained a high AP and detection rate. Our proposed