Robust Real-Time Detection of Laparoscopic Instruments in Robot Surgery Using Convolutional Neural Networks with Motion Vector Prediction

: More than half of post-operative complications can be prevented, and operation performances can be improved based on the feedback gathered from operations or notiﬁcations of the risks during operations in real time. However, existing surgical analysis methods are limited, because they involve time-consuming processes and subjective opinions. Therefore, the detection of surgical instruments is necessary for (a) conducting objective analyses, or (b) providing risk notiﬁcations associated with a surgical procedure in real time. We propose a new real-time detection algorithm for detection of surgical instruments using convolutional neural networks (CNNs). This algorithm is based on an object detection system YOLO9000 and ensures continuity of detection of the surgical tools in successive imaging frames based on motion vector prediction. This method exhibits a constant performance irrespective of a surgical instrument class, while the mean average precision (mAP) of all the tools is 84.7, with a speed of 38 frames per second (FPS)


Introduction
According to the World Health Organization (WHO), complications in inpatient surgical operations occur for at most 25% of the patients, and at least half of the cases in which surgery led to harm or damage are considered preventable [1]. This means that improvement of surgical performance can lead to better outcomes of surgical operations. Surgical performance can be improved through sophisticated remote manipulation of the robot [2][3][4], but surgical feedback also has a positive effect on surgical performance [5]. While manual evaluation methods such as the objective structured assessment of technical skills (OSATS), and the global operative assessment of laparoscopic skills (GOALS) can assess the surgical skills and are beneficial in terms of their improvements, it is both time and labor consuming, because surgeries could last multiple hours [6,7]. Manual assessment is subjective to observer bias and can lead to subjective outcomes [8]. Detection of surgical instruments is one of the indicators used for the analysis of surgical operations and it can be useful for the effective and objective analysis of surgery [9]. This also helps prevent surgical tool collision by informing the operator during the procedure [10].
Various approaches have been published on surgical tool detection. Cai et al. [11] imaged markers, which were placed on surgical instruments with the use of two infrared cameras. Kranzfelder et al. [12] presented an approach based on radiofrequency identification (RFID) for real-time tracking of Kranzfelder et al. [12] presented an approach based on radiofrequency identification (RFID) for realtime tracking of laparoscopic instruments. However, at present, there is no proper and reliable antenna system for routine intraoperative applications [12]. However, detection tools utilizing markers interfere with the surgical workflow and require modifications of the tracked instrument [13].
Efforts have been expended to develop vision-based and marker-less surgical tool detection using feature representations, based on color [14,15], gradients [16], or texture [17]. Speidel et al. [18] segmented the instruments in images and recognized their types based on three-dimensional models. Many researchers have also addressed surgical tool detection with the use of convolutional neural networks. Putra et al. [19] proposed for the first time the use of a CNN for multiple recognition tasks on laparoscopic videos. Several works [20][21][22] of surgical tool detection by CNNs have been proposed as a part of the M2CAI 2016 tool presence detection challenge [23]. Jin et al. [24] performed surgical tool localization and phase recognition in cholecystectomy videos based on faster R-CNN. Bodenstedt [25] proposed a new method to detect and to identify surgical tools by calculating a bounding box using a random forest algorithm, and then, extracting multiple features from each bounding box. Shvets et al. [26] introduced a method of robotic instrument semantic segmentation based on deep learning, in both binary and multiclass settings. However, these studies dealt with tool detection in a frame-wise manner, but did not employ time information, and did not detect tools in real time.
In this study, we address the issue of tool detection in laparoscopic surgical videos. Our method is faster and more accurate than cutting-edge technologies [20][21][22]24], and it can be applied during surgery or real-time analyses. We propose a new method to detect the surgical tool in laparoscopic images using YOLO9000 [27] and detect missing surgical tools based on motion vector prediction.

Surgical Tool Detection
The proposed algorithm consists of two stages. The first step aims to detect the surgical tool used in the current frame based on YOLO9000. The second step is to check for the presence of the surgical tools that were not detected in the first step, and to detect them additionally ( Figure 1).

Figure 1.
Flow chart of the proposed algorithm. The top rectangle is the detection step of the surgical tool in the current frame using YOLO9000, and the rest in the upper-right direction pattern corresponds to the missing tool detection step.

Detection with YOLO9000
The proposed method detects a surgical tool using YOLO9000, which is based on a convolutional neural network (CNN). CNN usually draws feature maps from input images using convolutional and pooling layers. In the convolutional layer, the filter extracts the pattern corresponding to each filter in the entire image area based on a convolution operation. Alternatively, the pooling layer generally reduces the size of the output of the preceding convolutional layer, thereby reducing the size of feature map inputs to the next layer, and consequently reduces the total number of parameters required for training. Maximum pooling and average pooling are mainly used for CNN.
You Only Look Once (YOLO) is one of CNN-based detection methods and is suitable for real-time processing. As the general region-based CNN identifies the region-of-interest directly from every input image, it requires considerable computation to detect the position of an object. As a result, it is difficult to detect an object in real time. However, YOLO divides all input images into S × S grid cells. Additionally, the size of the bounding box is set in advance. To be specific, the bounding box is preset by clustering using the box size of the ground truth in the training dataset. Therefore, during training or testing, only B pre-defined bounding boxes are calculated for each grid cell. This is a major difference between YOLO and region-based CNNs, such as the faster R-CNN, and this is the reason why YOLO can detect objects in real time.
The algorithm we used for the purposes of this study is YOLO9000, which is the second of the three versions of YOLO. YOLO9000 uses small grid cells and changes layers to improve accuracy over the previous versions of YOLO. Figure 2 shows the difference between the first version (V1) [28] and the second version (V2) of YOLO. In V1, S is 7 and B is 2, but S is 13 and B is 5 in V2. V2 uses more bounding boxes compared to V1. This is because the size of one grid cell is reduced by increasing the number of grid cells. As a result, it is easy to detect smaller objects. In V1, the configuration of each grid cell in the last layer is (5 × B + C), but it is (5 + C) × B in V2. C is the number of classes. In V1, the probability that a grid cell corresponds to each class is calculated separately from the probability that each bounding box contains an object. By multiplying these two values, the class to which each bounding box corresponds can be determined. In V2, however, class and object probabilities are obtained for each bounding box unit. Furthermore, the fully connected layer of V1 is replaced with the convolutional layer in V2. Therefore, it is designed so that it does not lose spatial information. Finally, unlike V1, V2 uses batch normalization on the convolutional layer to enhance the learning effect in the mini batch. Leaky ReLU [29] is also applied as an activation function for nonlinearity between layers, and maximum pooling is applied. Based on the differences, we applied YOLO9000, that can better preserve the spatial location information of tools for surgical tool detection.
corresponds to the missing tool detection step.

Detection with YOLO9000
The proposed method detects a surgical tool using YOLO9000, which is based on a convolutional neural network (CNN). CNN usually draws feature maps from input images using convolutional and pooling layers. In the convolutional layer, the filter extracts the pattern corresponding to each filter in the entire image area based on a convolution operation. Alternatively, the pooling layer generally reduces the size of the output of the preceding convolutional layer, thereby reducing the size of feature map inputs to the next layer, and consequently reduces the total number of parameters required for training. Maximum pooling and average pooling are mainly used for CNN.
You Only Look Once (YOLO) is one of CNN-based detection methods and is suitable for realtime processing. As the general region-based CNN identifies the region-of-interest directly from every input image, it requires considerable computation to detect the position of an object. As a result, it is difficult to detect an object in real time. However, YOLO divides all input images into S × S grid cells. Additionally, the size of the bounding box is set in advance. To be specific, the bounding box is preset by clustering using the box size of the ground truth in the training dataset. Therefore, during training or testing, only B pre-defined bounding boxes are calculated for each grid cell. This is a major difference between YOLO and region-based CNNs, such as the faster R-CNN, and this is the reason why YOLO can detect objects in real time.
The algorithm we used for the purposes of this study is YOLO9000, which is the second of the three versions of YOLO. YOLO9000 uses small grid cells and changes layers to improve accuracy over the previous versions of YOLO. Figure 2 shows the difference between the first version (V1) [28] and the second version (V2) of YOLO. In V1, S is 7 and B is 2, but S is 13 and B is 5 in V2. V2 uses more bounding boxes compared to V1. This is because the size of one grid cell is reduced by increasing the number of grid cells. As a result, it is easy to detect smaller objects. In V1, the configuration of each grid cell in the last layer is (5 × B + C), but it is (5 + C) × B in V2. C is the number of classes. In V1, the probability that a grid cell corresponds to each class is calculated separately from the probability that each bounding box contains an object. By multiplying these two values, the class to which each bounding box corresponds can be determined. In V2, however, class and object probabilities are obtained for each bounding box unit. Furthermore, the fully connected layer of V1 is replaced with the convolutional layer in V2. Therefore, it is designed so that it does not lose spatial information. Finally, unlike V1, V2 uses batch normalization on the convolutional layer to enhance the learning effect in the mini batch. Leaky ReLU [29] is also applied as an activation function for nonlinearity between layers, and maximum pooling is applied. Based on the differences, we applied YOLO9000, that can better preserve the spatial location information of tools for surgical tool detection.
(a) Although YOLO9000 adjusts the size of the grid cell and uses identity mapping to detect small objects, it is still difficult to detect small-sized surgical tools, because the input image is resized to 416 × 416, which is typically a smaller size compared to the original image. To solve this problem, the third version of YOLO [30] detects objects at three scale levels according to the residual skip connection and upsampling. In addition, multiple label classifications are possible. As a result, the object detection ratio increases, but the computational time also increases, and the speed decreases. For this reason, the third version of YOLO is not suitable for surgical tool detection in real time. The surgical tool detection problem consists of seven classes and requires a single-label classification in real time. Therefore, in the proposed method, it is performed by applying YOLO9000, and the missing tools are additionally detected through motion vector prediction with tool mapping.

Missing Tool Detection with Motion Vector Prediction
The missing tool detection process is subdivided into the following steps-a tool mapping and a tool redetection. In the tool mapping step, the presence of a missing tool is checked. To be specific, the tools identified in the current frame (t) are compared to that of the previous frame (t−1), based on the number and class. If one or more of the tools of this frame (t) have the same class as the tools of the previous frame (t−1), the tool that is closest to the tool of the previous frame (t−1) is considered as the same tool in the current frame (t). Conversely, if a tool only exists in the previous frame (t−1), it is determined that a missing tool exists.
Once the existence of the missing tool is confirmed, motion vector prediction is performed as shown in Figure 2. As the YOLO9000 classifies the surgical tool using a predetermined bounding box, if the main feature of the surgical tool is located at the boundary of the bounding box due to the movement of the surgical tool, it cannot be detected. Therefore, the proposed algorithm predicts the position of the surgical tool in the current frame using the position of the surgical tool in the previous two frames. This prediction is based on the center point of the surgical tool. More specifically, the motion vector (MV) of the surgical tool is calculated using the position of the surgical tool in the previous two frames (Equation (1)).
By adding the value of this motion vector to the position vector of the previous frame (( −1 , −1 )), the position in the current frame ((̂,̂)) is predicted (Equation (2)).
Tool detection is performed again with the use of the trained network by inputting the cropped image at the pre-determined size based on the predicted position of the tool. The size of the newly Although YOLO9000 adjusts the size of the grid cell and uses identity mapping to detect small objects, it is still difficult to detect small-sized surgical tools, because the input image is resized to 416 × 416, which is typically a smaller size compared to the original image. To solve this problem, the third version of YOLO [30] detects objects at three scale levels according to the residual skip connection and upsampling. In addition, multiple label classifications are possible. As a result, the object detection ratio increases, but the computational time also increases, and the speed decreases. For this reason, the third version of YOLO is not suitable for surgical tool detection in real time. The surgical tool detection problem consists of seven classes and requires a single-label classification in real time. Therefore, in the proposed method, it is performed by applying YOLO9000, and the missing tools are additionally detected through motion vector prediction with tool mapping.

Missing Tool Detection with Motion Vector Prediction
The missing tool detection process is subdivided into the following steps-a tool mapping and a tool redetection. In the tool mapping step, the presence of a missing tool is checked. To be specific, the tools identified in the current frame (t) are compared to that of the previous frame (t−1), based on the number and class. If one or more of the tools of this frame (t) have the same class as the tools of the previous frame (t−1), the tool that is closest to the tool of the previous frame (t−1) is considered as the same tool in the current frame (t). Conversely, if a tool only exists in the previous frame (t−1), it is determined that a missing tool exists.
Once the existence of the missing tool is confirmed, motion vector prediction is performed as shown in Figure 2. As the YOLO9000 classifies the surgical tool using a predetermined bounding box, if the main feature of the surgical tool is located at the boundary of the bounding box due to the movement of the surgical tool, it cannot be detected. Therefore, the proposed algorithm predicts the position of the surgical tool in the current frame using the position of the surgical tool in the previous two frames. This prediction is based on the center point of the surgical tool. More specifically, the motion vector (MV) of the surgical tool is calculated using the position of the surgical tool in the previous two frames (Equation (1)).
By adding the value of this motion vector to the position vector of the previous frame ((x t−1 , y t−1 )), the position in the current frame ((x t ,ŷ t )) is predicted (Equation (2)).

of 13
Tool detection is performed again with the use of the trained network by inputting the cropped image at the pre-determined size based on the predicted position of the tool. The size of the newly input image is set to be less than or equal to 416 × 416, which is the size of the input image of YOLO9000, so that the smaller objects can also be visible more easily. Comparison of the second result to the first result obtained based on the tool detection process, and if the intersection of the union (IOU) of the bounding box of the two results is more than 0.5, we regard that the same tool is detected twice. Accordingly, we discard the second result.

Experimental Conditions and Results
We performed experiments on Ubuntu 16.04 using a GPU NVIDIA GeForce GTX 1080, with 16 GB of memory, and a CPU Intel core i7-4770K. The training dataset was created with the use of vertical flip, horizontal flip, or both, to generate the 1st to the 7th videos at m2cai16-tool-locations, thus resulting in 7492 images in total ( Figure 3). In addition, the 10th video of m2cai16-tool-locations was used as the validation set. Regarding the test set, the 8th and 9th videos from m2cai16-tool-locations [31] and the videos 11-15 of the m2cai16-tool dataset [32] were used. The number of each class and the total number of images included in training and test videos are shown in Table 1.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 13 to the first result obtained based on the tool detection process, and if the intersection of the union (IOU) of the bounding box of the two results is more than 0.5, we regard that the same tool is detected twice. Accordingly, we discard the second result.

Experimental Conditions and Results
We performed experiments on Ubuntu 16.04 using a GPU NVIDIA GeForce GTX 1080, with 16 GB of memory, and a CPU Intel core i7-4770K. The training dataset was created with the use of vertical flip, horizontal flip, or both, to generate the 1st to the 7th videos at m2cai16-tool-locations, thus resulting in 7492 images in total ( Figure 3). In addition, the 10th video of m2cai16-tool-locations was used as the validation set. Regarding the test set, the 8th and 9th videos from m2cai16-toollocations [31] and the videos 11-15 of the m2cai16-tool dataset [32] were used. The number of each class and the total number of images included in training and test videos are shown in Table 1.      13.37). The weight used in training was pre-trained using visual object classes (VOC), and nonmaximal suppression (NMS) [28] was applied.
We compared the performance of the proposed method with results presented in other studies conducting experiments on the same dataset. Table 2 and Figure 4 show the performance estimates for our proposed algorithm, for the winner of the 2016 M2CAI Tool Presence Detection Challenge, and for the algorithm based on the Faster R-CNN [33]. We also compared the performance of the proposed method-the algorithm using the second version of YOLO and motion vector prediction-with the results obtained in our previous work [34] for the algorithm using the first version of YOLO. Moreover, we performed the comparison of the proposed algorithm with the deformable part models (DPM) [35] and EndoNet [19], which used different datasets to detect surgical tools. The performance comparison was conducted based on the mAP estimates [24]. As shown in Table 2, the proposed method has a higher mAP than the alternative algorithms including the winners of the M2CAI Tool Presence Detection Challenge. This observation was obtained based on the average of all considered tools. Figure 4 shows the mAP values for each class of algorithms, except for the Raju study. The proposed algorithm showed lower performance than some algorithms for such surgical instruments as hook and clipper, but the mAP of all classes was over 80, showing uniform performance regardless of class. We compared the performance of the proposed method with results presented in other studies conducting experiments on the same dataset. Table 2 and Figure 4 show the performance estimates for our proposed algorithm, for the winner of the 2016 M2CAI Tool Presence Detection Challenge, and for the algorithm based on the Faster R-CNN [33]. We also compared the performance of the proposed method-the algorithm using the second version of YOLO and motion vector predictionwith the results obtained in our previous work [34] for the algorithm using the first version of YOLO. Moreover, we performed the comparison of the proposed algorithm with the deformable part models (DPM) [35] and EndoNet [19], which used different datasets to detect surgical tools. The performance comparison was conducted based on the mAP estimates [24]. As shown in Table 2, the proposed method has a higher mAP than the alternative algorithms including the winners of the M2CAI Tool Presence Detection Challenge. This observation was obtained based on the average of all considered tools. Figure 4 shows the mAP values for each class of algorithms, except for the Raju study. The proposed algorithm showed lower performance than some algorithms for such surgical instruments as hook and clipper, but the mAP of all classes was over 80, showing uniform performance regardless of class.   Table. 2, except the Raju method. Table 3 compares the speed of the proposed algorithm against that of three different algorithms-two algorithms with high performance according to the results provided in Table 2, and an algorithm using random forests [25]. Algorithms using random forests automatically generate bounding boxes and determine the instrument type of the bounding box. The speed comparison is based on frames per second (FPS) and allows estimating the accuracy of each algorithm. The accuracy estimate of each algorithm was based on the values provided in corresponding papers, therefore,  Table 2, except the Raju method. Table 3 compares the speed of the proposed algorithm against that of three different algorithms-two algorithms with high performance according to the results provided in Table 2, and an algorithm using random forests [25]. Algorithms using random forests automatically generate bounding boxes and determine the instrument type of the bounding box. The speed comparison is based on frames per second (FPS) and allows estimating the accuracy of each algorithm. The accuracy estimate of each algorithm was based on the values provided in corresponding papers, therefore, different criteria were considered. Considering the alternative algorithms with similar average mAP, it can be seen that the proposed algorithm is approximately 7 times faster. Moreover, the proposed algorithm has approximately 1.71 times faster speed and 1.72 times higher accuracy than the random forest algorithm. The results of the proposed method are shown in Figure 5. If a tool identified in the previous frame (a) is not found in the current frame (b), the missing tool detection algorithm is applied. (c) is the result of missing tool detection. After the presence of the missing tool is recognized, a white O symbol is displayed in the upper left corner of the image (c, d). Taking Figure 5 as an example, we can describe in more detail that an irrigator is detected in the previous frame (a), however, in the current frame, no surgical tools were detected through YOLO9000 (b). Therefore, through the surgical tool mapping applied on the previous frame and the current frame, it is recognized that the missing tool exists. This is indicated by the white O symbol in the upper left corner of the image. Thereafter, the missing irrigator is detected through the motion vector predicting step, and the class of the detected tool is displayed under the white O symbol.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 13 different criteria were considered. Considering the alternative algorithms with similar average mAP, it can be seen that the proposed algorithm is approximately 7 times faster. Moreover, the proposed algorithm has approximately 1.71 times faster speed and 1.72 times higher accuracy than the random forest algorithm. The results of the proposed method are shown in Figure 5. If a tool identified in the previous frame (a) is not found in the current frame (b), the missing tool detection algorithm is applied. (c) is the result of missing tool detection. After the presence of the missing tool is recognized, a white O symbol is displayed in the upper left corner of the image (c, d). Taking Figure 5 as an example, we can describe in more detail that an irrigator is detected in the previous frame (a), however, in the current frame, no surgical tools were detected through YOLO9000 (b). Therefore, through the surgical tool mapping applied on the previous frame and the current frame, it is recognized that the missing tool exists. This is indicated by the white O symbol in the upper left corner of the image. Thereafter, the missing irrigator is detected through the motion vector predicting step, and the class of the detected tool is displayed under the white O symbol.  Table 4 compares precision, recall, and F1 scores, according to whether the missing tool detection algorithm is applied or not. Application of missing tool detection allows the precision to be reduced by approximately 0.63%, the recall by 4.95%, and the F1 score by approximately 2.35%. The reason  Table 4 compares precision, recall, and F1 scores, according to whether the missing tool detection algorithm is applied or not. Application of missing tool detection allows the precision to be reduced by approximately 0.63%, the recall by 4.95%, and the F1 score by approximately 2.35%. The reason for the precision decrease is attributed to the erroneous detection of a tool as a missing tool in YOLO9000. Accordingly, an additional detection process is executed.

Error Analysis
In the object detection task, errors can be classified as false positive and false negative. A false positive is that the ground truth is false, but the test result is true. In other words, a non-existent surgical tool is detected. For example, the background is erroneously detected as a surgical tool, or the class of the surgical tool is identified incorrectly. A false negative, on the other hand, means that the ground truth is true, but the test result is false. Therefore, it can be concluded that a surgical tool exists, but cannot be detected. for the precision decrease is attributed to the erroneous detection of a tool as a missing tool in YOLO9000. Accordingly, an additional detection process is executed.

Error Analysis
In the object detection task, errors can be classified as false positive and false negative. A false positive is that the ground truth is false, but the test result is true. In other words, a non-existent surgical tool is detected. For example, the background is erroneously detected as a surgical tool, or the class of the surgical tool is identified incorrectly. A false negative, on the other hand, means that the ground truth is true, but the test result is false. Therefore, it can be concluded that a surgical tool exists, but cannot be detected. Figure 6 shows false positives and false negatives observed in detecting surgical instruments using only YOLO9000. The above two images are examples of false positives. More specifically, the background was detected as a surgical tool in the upper left image, and a hook was detected incorrectly as a bipolar in the upper-right image. In this case, the nonexistent bipolar is detected, and existing hook is not detected. Consequently, both false positive and false negative are increased by 1. The bottom images are examples of false negatives. The image on the left is an example of failure to detect a grasper, and the image on the right is an example of failure to detect a hook.  Figure 7 shows the error in each considered surgical video when using YOLO9000 only, and when using both YOLO9000 and missing tool detection. Each of the six pictures represents an error in each video. However, m2cai16-tool-location videos are displayed together because the total  Figure 7 shows the error in each considered surgical video when using YOLO9000 only, and when using both YOLO9000 and missing tool detection. Each of the six pictures represents an error in each video. However, m2cai16-tool-location videos are displayed together because the total number of frames is small. The numbers on the vertical axis represent the number of errors. For example, if the number of surgical instruments erroneously detected in the same frame is two, the error is also registered as two. The bright blue region of the graph represents a false positive, and the yellow dot region represents a false negative. The orange line indicates the total number of errors. In each figure, the bar on the left shows the error when using only YOLO9000, and that one on the right shows the result obtained using missing tool detection together with YOLO9000.
error is also registered as two. The bright blue region of the graph represents a false positive, and the yellow dot region represents a false negative. The orange line indicates the total number of errors. In each figure, the bar on the left shows the error when using only YOLO9000, and that one on the right shows the result obtained using missing tool detection together with YOLO9000.
As shown in the figure, when using only YOLO9000, most errors are false negatives. It can be explained by the fact that there are many errors due to missing tools. To solve this problem, we additionally applied the missing tool detection algorithm. As a result, the total number of errors decreased, as shown in the right graph. In addition, the number of false negatives also decreased. On the other hand, the number of false positives increased, because wrongly detected surgical tools were judged to be missing tools and consequently, were redetected accordingly. Figure 8 shows an example of error caused by missing tool detection. The left image is the result obtained in the previous frame. In the previous frame, a grasper was detected correctly through missing tool detection. However, a part of the background was detected as a specimenbag. As a result, in the current frame (right image), the specimenbag was judged as a missing tool through mapping. Correspondingly, the background was detected incorrectly as a specimenbag again due to applying the missing tool detection algorithm.  As shown in the figure, when using only YOLO9000, most errors are false negatives. It can be explained by the fact that there are many errors due to missing tools. To solve this problem, we additionally applied the missing tool detection algorithm. As a result, the total number of errors decreased, as shown in the right graph. In addition, the number of false negatives also decreased. On the other hand, the number of false positives increased, because wrongly detected surgical tools were judged to be missing tools and consequently, were redetected accordingly. Figure 8 shows an example of error caused by missing tool detection. The left image is the result obtained in the previous frame. In the previous frame, a grasper was detected correctly through missing tool detection. However, a part of the background was detected as a specimenbag. As a result, in the current frame (right image), the specimenbag was judged as a missing tool through mapping. Correspondingly, the background was detected incorrectly as a specimenbag again due to applying the missing tool detection algorithm. Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 13 (a) (b) Figure 8. Example of errors in surgical tool detection using YOLO9000 and missing tool detection (a) The specimenbag was detected incorrectly in the previous frame, (b) The incorrectly detected specimenbag is judged as the missing tool, hence, it is redetected by applying the missing tool detection algorithm in the current frame.

Discussion and Conclusions
In this paper, we proposed the new method of detecting and classifying surgical instruments in laparoscopic images. This method has two main advantages-it can be used during real-time operations, and it is robust in comparison to the existing methods.
Firstly, the proposed method can detect surgical tools in real time by using the object detection system YOLO9000. Unlike other methods, You Only Look Once (YOLO) does not allow for finding the region of interest (ROI). Conventional methods aim to identify the ROI from an input image and thereafter, to classify each ROI. However, applying YOLO allowed for the diminishing of the time required to calculate the ROI. YOLO divides an input image into a set of grid cells and then, performs classification of each grid cell. Owing to this key feature of YOLO, the proposed algorithm can detect surgical tools in real time (Table 3).
Moreover, the proposed method is deemed to be robust. In other words, the proposed method demonstrates the uniform and excellent performance in the detection of surgical instruments of all classes. Based on the results provided in Table 2 and Figure 4, it can be concluded that in comparison to other algorithms, the proposed method has a uniform mean average precision (mAP)-over 80for all classes of surgical instruments, and the highest average mAP with respect to all considered surgical tools. As shown in Figure 4, while the performance of other algorithms with the similar mAP deteriorates for certain classes, the performance of the proposed algorithm is plotted as a flat graph, which confirms its high robustness.
Achieving the robustness of the proposed algorithm is possible owing to the use of the upgraded version of YOLO-YOLO9000. As mentioned earlier, YOLO has a high processing speed, as grid cells are considered instead of ROI. However, it has the problem of lacking accuracy in the first version of YOLO. This can be seen by comparing the performance results of [34] and [24] in Table 2. The study [34] is dedicated to the detection of surgical instruments using the early version of YOLO, and [24] is a study in which surgical instruments were detected applying the faster R-CNN, a typical algorithm using ROI. The results presented in Table 2 and Figure 4 show that the performance of the early version of YOLO is lower than that of the approach based on ROI identification. YOLO9000 has come out to solve these problems. As shown in Figure 2, compared to the earlier version of YOLO, YOLO9000 has subdivided the input image into smaller grid cells resulting in more sophisticated detection.
Another reason for the robustness of the proposed algorithm is that it enables improvements to the detection performance of successive surgical tools owing to the prediction of missing tools. Missing tool detection leads to better performance, as it enables the redetection of surgical tools that have been present in the previous frame, but are not detected in the current frame. As YOLO9000 uniformly divides the input image into grid cells, detection performance may deteriorate when the Figure 8. Example of errors in surgical tool detection using YOLO9000 and missing tool detection (a) The specimenbag was detected incorrectly in the previous frame, (b) The incorrectly detected specimenbag is judged as the missing tool, hence, it is redetected by applying the missing tool detection algorithm in the current frame.

Discussion and Conclusions
In this paper, we proposed the new method of detecting and classifying surgical instruments in laparoscopic images. This method has two main advantages-it can be used during real-time operations, and it is robust in comparison to the existing methods.
Firstly, the proposed method can detect surgical tools in real time by using the object detection system YOLO9000. Unlike other methods, You Only Look Once (YOLO) does not allow for finding the region of interest (ROI). Conventional methods aim to identify the ROI from an input image and thereafter, to classify each ROI. However, applying YOLO allowed for the diminishing of the time required to calculate the ROI. YOLO divides an input image into a set of grid cells and then, performs classification of each grid cell. Owing to this key feature of YOLO, the proposed algorithm can detect surgical tools in real time (Table 3).
Moreover, the proposed method is deemed to be robust. In other words, the proposed method demonstrates the uniform and excellent performance in the detection of surgical instruments of all classes. Based on the results provided in Table 2 and Figure 4, it can be concluded that in comparison to other algorithms, the proposed method has a uniform mean average precision (mAP)-over 80-for all classes of surgical instruments, and the highest average mAP with respect to all considered surgical tools. As shown in Figure 4, while the performance of other algorithms with the similar mAP deteriorates for certain classes, the performance of the proposed algorithm is plotted as a flat graph, which confirms its high robustness.
Achieving the robustness of the proposed algorithm is possible owing to the use of the upgraded version of YOLO-YOLO9000. As mentioned earlier, YOLO has a high processing speed, as grid cells are considered instead of ROI. However, it has the problem of lacking accuracy in the first version of YOLO. This can be seen by comparing the performance results of [34] and [24] in Table 2. The study [34] is dedicated to the detection of surgical instruments using the early version of YOLO, and [24] is a study in which surgical instruments were detected applying the faster R-CNN, a typical algorithm using ROI. The results presented in Table 2 and Figure 4 show that the performance of the early version of YOLO is lower than that of the approach based on ROI identification. YOLO9000 has come out to solve these problems. As shown in Figure 2, compared to the earlier version of YOLO, YOLO9000 has subdivided the input image into smaller grid cells resulting in more sophisticated detection.
Another reason for the robustness of the proposed algorithm is that it enables improvements to the detection performance of successive surgical tools owing to the prediction of missing tools. Missing tool detection leads to better performance, as it enables the redetection of surgical tools that have been present in the previous frame, but are not detected in the current frame. As YOLO9000 uniformly divides the input image into grid cells, detection performance may deteriorate when the main feature is located at the boundary of the grid cell. This situation can occur, for example, as the surgical tool moves. Therefore, it is possible to improve the detection performance by predicting the motion trajectory of the surgical tool and adjusting the position of the grid cell correspondingly. Figure 7 shows the difference in error estimates depending on the presence or absence of the missing tool detection algorithm. The results provided in Table 3 also demonstrate the improved performance owing to this algorithm.
In conclusion, for the purpose of this study we applied two algorithms-YOLO9000 and missing tool detection-to perform the robust detection of surgical instruments in real time. Although the proposed method allows for the diminishing of the error of YOLO9000 by using missing tool detection, the detection error still exists. In particular, missing tool detection requires information from previous frames; therefore, if YOLO9000 detects a surgical tool incorrectly in the previous frame, it consequently affects the current frame.
To solve these problems, it is necessary to use missing tool detection in training. For example, we can obtain a better performance of the proposed method by checking the occurrence of a missing tool in training and adaptively adjusting the probability of the surgical tool presence in the previous frame. Alternatively, a method of using information from previous frames in training through time-sequence techniques such as long short-term memory (LSTM) [36] may be helpful for improving the performance. Finally, increasing the accuracy of the dataset may enable improvements to the detection performance. In this paper, we used an open dataset, which does not reflect information if a surgical tool appears small or obscured; consequently, the detection performance of the proposed method can be improved further if this problem is addressed.