1. Introduction
Unilateral biportal endoscopic spine surgery (UBE) has emerged as an innovative minimally invasive technique that provides a wide surgical field and excellent maneuverability, allowing surgeons to perform decompression and discectomy through two independent portals under continuous irrigation. As its clinical indications continue to expand, UBE is expected to become a standardized procedure in spine surgery [
1].
However, because UBE relies on continuous irrigation and a highly magnified view, even minor bleeding can rapidly obscure visualization. Intraoperatively, this visual obstruction not only increases the risk of dural tears and conversion to open surgery but also leads to prolonged operative times, increased surgeon workload, and the need for higher irrigation pressures. Postoperatively, subjective evaluation of hemostasis within the limited endoscopic field may leave small, persistent bleeders unnoticed, potentially resulting in spinal epidural hematomas. The hemostatic strategy in UBE has not yet been standardized, and variations in irrigation flow, pressure control, and use of hemostatic devices may further influence intraoperative visibility.
In spinal surgery in general, the incidence of symptomatic postoperative spinal epidural hematoma has been reported to be approximately 0.5%, with minimally invasive approaches showing slightly higher rates than conventional techniques [
2]. Recent systematic reviews of UBE have reported postoperative epidural hematoma and dural tear rates ranging from 0.27% to 1% and 2% to 2.5%, respectively [
3,
4,
5]. Although these complications are not directly caused by intraoperative bleeding, maintaining a clear visual field through adequate hemostasis is critical for procedural safety. Nevertheless, objective and standardized methods for evaluating intraoperative bleeding are lacking, leaving surgeons to rely on subjective visual judgment.
Recently, artificial intelligence (AI) has shown promising applications in surgical planning, imaging analysis, and intraoperative guidance. Its adoption has rapidly increased in parallel with the refinement of minimally invasive spine surgery techniques. AI has the potential to support personalized surgical decision-making and enhance intraoperative safety, although this approach remains in its early stages [
6,
7]. In gastrointestinal endoscopy, AI-based systems have enabled real-time bleeding detection and lesion identification [
8]. However, despite these advances, AI has not yet been applied to bleeding detection or visual field assessment in spinal endoscopy. This gap highlights the novelty and clinical importance of the present study [
9].
This pilot study aimed to develop and validate a deep-learning model as an initial approach for the objective detection and quantification of bleeding in spinal endoscopy. By exploring the feasibility and utility of a reliable means of assessing intraoperative bleeding, we sought to provide foundational evidence for future strategies to evaluate hemostatic control and to enhance the safety and effectiveness of UBE procedures.
2. Materials and Methods
2.1. Ethics Statement
This retrospective study was approved by the Institutional Review Board of our institution. Written informed consent for the use of surgical videos for research purposes was obtained from all participating patients.
2.2. Deep Learning Workflow Design
This study was designed based on the hypothesis that a two-step deep learning workflow improves the precision and clinical relevance of bleeding detection (
Figure 1). In the first step, a base model was trained using a large-scale dataset of endoscopic images paired with hue, saturation, and value (HSV)-based masks. This extensive training allowed the network to comprehensively learn the unique visual patterns of the endoscopic field and automatically extract all red regions, achieving more detailed and consistent segmentation than manual annotation alone. In the second step, the base model was fine-tuned using a highly curated dataset, in which UBE expert manually excluded clinically irrelevant red areas, such as bone marrow and vasculature. The model’s performance was validated by comparing its outputs with ground-truth images manually annotated by experienced surgeons. Finally, we confirmed the model’s feasibility for video-level assessment by performing frame-by-frame inference on recorded surgical videos and reconstructing them into a processed output.
2.3. Training Dataset Collection
In this study, 20 videos of UBE procedures comprising discectomy, decompression, and fusion surgeries performed at a single institution by multiple surgeons were used to construct a training dataset. Endoscopic videos were recorded using either full high-definition or 4K image storage systems. Cases with specific intraoperative complications or poor video quality were excluded to ensure the dataset was reliable. From these videos, 223,568 original resolution images were extracted at a rate of 2 fps. Each extracted image was subsequently converted to the HSV color space, and a corresponding binary mask was generated to extract red regions within the endoscopic field.
The HSV threshold parameters were defined by a surgeon with experience in >300 UBE procedures, encompassing regions that could clinically be perceived as red, such as bleeding, vasculature, and cancellous bone. Specifically, the conditions were as follows: hue, 0–8°; 172–180°; saturation, 90–255; and value, 50–255. Concurrently, a field-of-view mask (saturation > 60, value > 70) was applied to explicitly exclude the lateral black margins of the endoscope from the region of interest, thereby eliminating meaningless background. The resulting masks were saved as 8-bit grayscale images, with the red regions represented in white.
Subsequently, the original images and the masks were resized to 512 × 512 pixels. Only images with a red area ratio exceeding 1% within the endoscopic field were selected as training data to exclude insignificant red noise. Overall, 145,763 image–masks pairs containing clinically meaningful red regions were used to train the deep learning model.
2.4. Base Model for Deep Learning
The U-Net++ architecture, which is specialized for image segmentation, with ResNet-34 as the encoder, was employed in the deep-learning model used in this study. All layers were randomly initialized without using any pretrained weights, such as those from ImageNet. The training labels comprised red mask images generated using the aforementioned HSV thresholding conditions.
All input images were in the RGB format and resized to 512 × 512 pixels. The pixel values were scaled to a range of 0–1 and subsequently standardized using a channel-wise mean of [0.485, 0.456, 0.406] and a standard deviation of [0.229, 0.224, 0.225] to stabilize the training process.
Binary cross-entropy with logit loss was used as the loss function, and the Adam optimizer was applied at a learning rate of 1 × 10−4. The batch size was set to 4, and early stopping with a patience of 20 epochs was used to prevent overfitting.
2.5. Fine-Tuning Conditions
The base model, trained using HSV-based masks, was designed to indiscriminately extract all the red regions, including those with limited clinical relevance, such as vascular redness or cancellous bone. Therefore, fine-tuning was performed to specialize the model to detect clinically meaningful bleeding.
The fine-tuning dataset comprised 350 highly curated images selected from the first-stage training pool derived from the 20 surgeries used to train the base model. Rather than relying on random sampling, an experienced UBE surgeon purposively selected these frames across various surgical phases (e.g., bone drilling, soft tissue decompression) to ensure maximum morphological diversity. Annotations were meticulously delineated using precise polygon approximations via Labelme software (v5.6.0), which were subsequently converted into pixel-wise binary masks. Specifically, 120 red mask images representing definite bleeding and 230 zero-mask images representing non-bleeding red areas, such as vessels or cancellous bone, were used.
The model architecture used for fine-tuning was the same U-Net++ built with a ResNet-34 encoder, and the training was resumed using the weights from the base model. All input images were in the RGB format and were resized to 512 × 512 pixels. Data augmentation techniques such as horizontal flipping and variations in brightness and saturation were applied.
Binary cross-entropy with logit loss was used as the loss function, and the model was optimized using the Adam optimizer with a learning rate of 1 × 10−4. However, early stopping was not based on loss values; instead, it was implemented using the Dice coefficient and IoU scores calculated against the ground-truth (GT) masks (described later). The training was terminated if no improvement in these metrics was observed for 10 consecutive epochs.
2.6. Model Evaluation
To prevent any risk of data leakage, the evaluation dataset was constructed from three UBE surgeries (lumbar disk herniation, spinal canal decompression, and foraminal decompression) that were strictly independent of the training and fine-tuning datasets. Specifically, these evaluation cases were completely excluded at the patient and video levels from all training phases. Although the evaluation surgeries were performed by the same primary surgeon included in the training dataset, the videos were uniformly recorded under standardized 4K high-definition conditions—with consistent camera systems, recording devices, and lighting—to ensure a rigorous assessment of the model’s true performance.
From these independent cases, 60 images representing a balanced distribution of bleeding-area ratios were extracted. Initially, a panel of five orthopedic surgeons—each with 1 to 5 years of UBE experience and having performed 100 to 300 procedures—independently annotated the bleeding areas to generate GT masks. To ensure the reliability of the reference standard as a reflection of average expert perception, inter-observer agreement was assessed using pairwise Dice and IoU coefficients. Because excessive variation in the ground truth makes it impossible to accurately evaluate an AI model’s true performance, we restricted the final reference panel to the three surgeons who demonstrated the highest mutual consistency. This rigorous selection resulted in a final set of 180 highly reliable GT masks (60 images × 3 surgeons). The average Dice and IoU values among these three selected experts were subsequently used to define the GT agreement.
For the binary detection of bleeding (presence/absence), we evaluated image-wise classification performance. The output probability maps from the fine-tuned model were binarized using a sigmoid function and a threshold of 0.5 to generate predicted masks. To exclude regions outside the surgical field, circular field-of-view masks were applied to both the GT and predicted masks. Based on these binarized outputs, we reported overall accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), along with the confusion matrix.
For area-based segmentation performance, we focused on the overlap between the predicted and GT masks. Dice and IoU were calculated strictly on GT-positive images (GT bleeding area > 0) to avoid artificial overestimation from true-negative zero-mask cases, in which both GT and predictions were empty. These scores were computed by comparing the predicted masks with each individual surgeon’s GT mask, and the average values were used for evaluation.
To further characterize segmentation performance, we examined how Dice and IoU varied according to GT agreement and bleeding extent. First, the correlations between the model’s Dice/IoU scores and the GT agreement rates were analyzed. Based on the GT agreement rate, images were categorized into three groups (≥0.70, ≥0.80, and ≥0.90), and the model’s performance was summarized within each group. Furthermore, we restricted the area-based subgroup analysis to GT-positive images and categorized them into two groups according to the bleeding-area ratio: a low-bleeding group (>0–20%) and a high-bleeding group (>20%), enabling a comparison of model performance across different levels of bleeding extent.
2.7. Video Application
Frame-by-frame inference was performed on recorded UBE surgical videos to evaluate the model’s performance in a dynamic environment. Each frame was resized to 512 × 512 pixels and processed using the fine-tuned model. The bleeding-area ratio was quantified by calculating the proportion of pixels predicted as bleeding relative to the total number of pixels in each frame. For evaluation, composite videos were generated by synchronizing the original endoscopic footage, the model’s segmentation overlay, and a temporal plot representing the calculated bleeding-area ratio over time. Furthermore, to evaluate the feasibility of real-time clinical integration, the model’s computational performance was assessed. The inference speed (frames per second, FPS) and processing latency per frame were measured using 4K-resolution (3840 × 2160 pixels) endoscopic video inputs operating on an NVIDIA GeForce RTX 5070 Ti GPU (Blackwell architecture, 8960 CUDA cores, 16 GB GDDR7 memory).
2.8. Statistical Analysis
Statistical analyses were performed using Python software (version 3.8.20; Python Software Foundation). Continuous variables are presented as mean ± standard deviation or median with interquartile range, depending on distribution.
Diagnostic performance metrics including accuracy, sensitivity (recall), specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated based on the confusion matrix. Confidence intervals (95% CI) for sensitivity, specificity, PPV, and NPV were computed using the Clopper–Pearson exact method.
The assumption of normality for the performance metrics was assessed using the Shapiro–Wilk test. Accordingly, Spearman’s rank correlation coefficients were calculated to evaluate the relationship between the model performance metrics and GT agreement rates. Two-tailed p-values were reported, and statistical significance was set at p < 0.05.
2.9. AI Usage in Model Development
During the implementation of the deep-learning architecture, an artificial intelligence tool (ChatGPT GPT-5; OpenAI, San Francisco, CA, USA) was utilized to assist in generating and refining the source code. All AI-assisted code segments were rigorously reviewed, tested, and validated by the authors to ensure technical accuracy and the integrity of the resulting model.
4. Discussion
Among the various spinal endoscopic techniques, UBE stands out because it is minimally invasive while providing a broad operative field. Its expanding adoption reflects its potential to become a standard procedure in spine surgery [
10,
11]. In existing reviews, the incidence of postoperative epidural hematoma reportedly ranges from 0.27% to 1% in lumbar UBE, whereas studies on cervical UBE have described rates of approximately 3% to 5% [
4,
5,
12]. Another important consideration is that, owing to the continuous irrigation setting and magnified endoscopic view, even minor bleeding can swiftly compromise visibility, complicate surgical maneuvers and potentially lead to open conversion [
13]. Furthermore, because the procedure is performed under continuous irrigation, the exact amount of intraoperative bleeding cannot be accurately assessed. This limitation may result in surgery proceeding based on subjective assessments that may fail to reflect the true extent of bleeding, potentially leading to unrecognized hemodynamic changes or underestimated blood loss within the irrigated field [
14].
Such postoperative hematomas and open conversions caused by impaired endoscopic visualization can ultimately lead to neurological deterioration and prolonged hospitalization, thereby undermining the greatest advantage of endoscopic surgery: its minimally invasive nature. Hemostasis, a fundamental aspect of UBE, is therefore one of the most critical technical elements. Proper intraoperative hemostasis is essential to reduce the occurrence of postoperative hematoma; however, there is currently no established quantitative framework or metric to objectively evaluate hemostatic techniques during surgery.
In this pilot study, we developed and evaluated a deep-learning model capable of quantitatively assessing visual obstruction caused by bleeding during UBE procedures. Bleeding regions during endoscopic surgery are highly dynamic and morphologically irregular. In the surgical field, blood often mixes with tissues and irrigation fluids, resulting in ambiguous boundaries that are difficult to define. Accurate segmentation under these conditions requires a model that can capture fine structural details and spatial variations across frames. We adopted U-Net++, an extension of the standard U-Net architecture that incorporates dense skip connections across multiple nested layers. This design enables more effective integration of low-level spatial information and high-level semantic features, facilitating accurate detection of both diffuse and sharply defined bleeding regions [
15]. This capability is particularly beneficial for precise estimation of the bleeding area.
We first trained a base model to detect red areas using HSV thresholds. The ResNet-34 encoder was initialized randomly without conventional ImageNet pretraining. This decision was driven by the substantial domain gap between natural images and the unique structural and visual characteristics of endoscopic surgery. Specifically, endoscopic images possess a distinct geometric layout—a circular surgical field surrounded by a black background—which differs fundamentally from the edge-to-edge spatial distribution of objects in standard natural images. By learning from our endoscopic dataset from scratch, the network successfully acquired a domain-specific foundation optimized for this unique spatial layout as well as the uniquely red-dominated, constantly irrigated environment. The model achieved high Dice scores relative to the HSV masks, indicating strong learning of all red-colored structures, including vessels and cancellous bone. However, the model showed poor agreement with clinically relevant bleeding areas, suggesting that color thresholding alone was insufficient for isolating meaningful bleeding regions. To address this issue, the base model was subsequently fine-tuned using expert-annotated masks that explicitly distinguished clinically meaningful bleeding (red masks) from irrelevant red regions such as vessels, cancellous bone, and reflections (zero masks). Fine-tuning, in this context, refers to a transfer learning approach in which the weights of a pre-trained model are used for initialization, followed by additional training on task-specific data to refine performance for the clinically relevant target.
Although the fine-tuning dataset was relatively small, our preliminary experiments revealed an important finding: simply increasing the number of fine-tuning images did not necessarily improve the model’s agreement with the independent ground truth. During the developmental phase, we empirically tested approximately 10 different dataset configurations. We observed that including overly ambiguous or highly complex images tended to confuse the model and increase the risk of overfitting. Therefore, an experienced surgeon purposively curated a dataset of 350 representative and unambiguous images (120 with definite bleeding, 230 with clear non-bleeding red structures) to provide clear, high-quality educational examples for the model. By prioritizing annotation quality and clear morphological features over sheer volume, combined with robust data augmentation, the moderate capacity of the ResNet-34 encoder, and rigorous early stopping, the model effectively avoided overfitting and generalized well to the independent test set.
The high sensitivity (0.93) demonstrated by the model is directly attributable to our strategy of using a base model trained with HSV thresholds to comprehensively extract all red regions within the endoscopic field. This initial stage of “exhaustive red region extraction” served as the foundation for a “safety-first” design, prioritizing the avoidance of missed bleeding. However, this design led to a moderate specificity of 0.60, highlighting the inherent challenge of distinguishing clinically significant hemorrhage from ambiguous red signals. Our inter-rater analysis (
Table 1) underscores this difficulty, revealing that even experienced surgeons vary significantly in their diagnostic thresholds; notably, the model’s performance falls within the range of expert clinical judgment, where some surgeons demonstrate specificities as low as 0.43. This suggests that the model’s current behavior reflects a cautious, high-sensitivity threshold similar to that of some clinicians, prioritizing the detection of potential bleeding over the exclusion of ambiguous signals. Crucially, the model’s detection challenges were primarily confined to minimal bleeding; as noted in our results, all false-negative images were limited to cases where the bleeding area occupied less than 5% of the endoscopic field. While the model still faces challenges in detecting minimal bleeding points, its current performance provides a quantitative baseline that aligns with the protective decision-making process of surgeons. By providing an objective metric for visual obstruction, this study establishes a foundational framework for standardized hemostatic evaluation in UBE.
Regarding the segmentation quality, our two-stage training strategy effectively refined the model’s ability to delineate bleeding contours. In this context, the Dice coefficient and the IoU are widely used overlap-based metrics for evaluating image segmentation performance, quantifying the agreement between predicted and ground-truth masks. Both metrics range from 0 to 1, where 1 indicates perfect overlap and 0 indicates no overlap; Dice can be regarded as a pixel-wise F1-score, whereas IoU is a more conservative overlap measure defined as the ratio between the intersection and the union of the predicted and ground-truth masks [
16]. The observed inter-rater variability (initial mean Dice: 0.66) likely reflects the inherent biological ambiguity of endoscopic bleeding, where blood frequently mixes with irrigation fluid, creating indistinct boundaries that pose challenges for manual annotation even among experts. In our study, the final ground truth was defined as the average expert perception of the most consistent raters, providing a robust and clinically realistic reference for meaningful hemorrhage. In our study, the fine-tuned model achieved a median Dice of 0.79 and a median IoU of 0.65 on GT-positive images. Given the inherent complexity of endoscopic bleeding patterns—including continuous irrigation, motion blur, and the presence of vessels and cancellous bone—these values indicate that the model can reproduce clinically relevant bleeding contours with a high level of fidelity. Although these metrics are not directly equivalent to individual inter-surgeon Dice because of the differences in reference standard construction, the model’s performance closely approaches the range of agreement observed among our experienced surgeons (median Dice: 0.78). This high fidelity suggests that the subsequent fine-tuning effectively taught the network to distinguish clinically significant bleeding from irrelevant red signals by focusing on essential morphological features.
The primary objective of this model was to reliably detect and quantify clinically meaningful bleeding during surgery. From this perspective, stratified performance evaluation based on the GT bleeding-area ratio provided additional insights. In GT-positive images with a bleeding-area ratio >20%, the model demonstrated high reproducibility, with median Dice and IoU scores of 0.83 and 0.71. By contrast, in frames with smaller bleeding areas (>0–20%), the Dice and IoU scores were more variable, particularly in cases with <1% bleeding. This discrepancy reflects the inherent difficulty of accurately aligning the precise morphology of subtle hemorrhages, where even a slight pixel-wise deviation between the predicted and ground-truth masks leads to a disproportionately large decrease in overlap-based scores. This pattern is consistent with the profile of false-negative cases and highlights an area for future improvement—namely, enhancing sensitivity and boundary precision for subtle bleeding—while preserving the high specificity demonstrated in this pilot study. Collectively, these findings suggest that the model achieves highly reproducible performance for clearly defined, clinically significant bleeding, especially when there is strong consensus among experienced operators, and can therefore be used as an objective tool for intraoperative bleeding assessment and monitoring visualization quality.
In addition to still-image evaluation, this model can process complete surgical videos, calculate bleeding-area ratios for each frame, and visualize temporal changes, enabling an objective assessment of field clarity over time. Regarding its computational efficiency, the system achieved an average inference speed of 77.92 FPS with a latency of only 12.83 ms per frame when processing 4K video inputs on an NVIDIA GeForce RTX 5070 Ti GPU. This performance substantially exceeds standard video frame rates (30–60 FPS), demonstrating that the current architecture is capable of high-fidelity, real-time bleeding quantification without intraoperative delay. Such video-based applications may support quantitative evaluation of hemostatic techniques or irrigation strategies, contribute to standardized reporting of endoscopic visibility, and provide a basis for training and quality assurance in UBE.
This study had several limitations. First, external validation was not performed using data from other institutions, which limits the generalizability of the model across different surgical settings and equipment. Second, the training and evaluation datasets were derived from a limited number of surgeons, raising the possibility of operator-related bias. Third, our selection of the three most consistent surgeons to define the ground truth may have introduced a reference standard inflation bias. While this ensured high-quality training data for this pilot study, it may not fully capture the ecological validity of the broader, more ambiguous rater disagreements encountered in clinical practice. Fourth, as highlighted by the false-negative cases, the model’s sensitivity to minimal bleeding (<5% area) was suboptimal. While such minute bleeding may not immediately impair visibility under continuous irrigation, missing these “hidden” sources could potentially lead to postoperative complications, such as epidural hematomas, representing a clinically meaningful limitation. Finally, potential hardware and software constraints may affect the feasibility of applying the model to real-time video analysis, particularly in high-resolution or high-frame-rate settings.
To enhance generalizability and robustness, alternative model architectures should be explored in future studies, and comparative evaluations should be conducted across models with different learning structures. Because only data from the same source were used for training and fine-tuning, expanding the dataset to include multicenter cases from various surgeons and surgical environments will be essential to minimize institutional or operator bias. Furthermore, establishing clinically meaningful cutoff values for bleeding areas is a crucial next step. Future studies analyzing broader clinical datasets are necessary to investigate how AI-derived quantitative metrics correlate with actual surgical outcomes (e.g., operative time and complication rates), thereby determining appropriate thresholds that can actively guide intraoperative hemostasis. Integrating multimodal information—such as irrigation flow rate, surgical instrument tracking, and intraoperative surgeon maneuvers—may further refine model performance and facilitate its practical application in real-time surgical workflows.