1. Introduction
Esophageal cancer (EC) represents a major global health burden, ranking as the eighth most prevalent malignancy and the sixth leading contributor to cancer-related mortality worldwide. Epidemiologic analyses reveal striking geographic disparities, with nearly 80% of cases concentrated in socioeconomically disadvantaged regions. A pronounced gender disparity exists, as males constitute approximately 70% of EC patients, exhibiting 2–5 times higher incidence and mortality rates compared to females [
1,
2]. Population-level survival metrics further underscore systemic healthcare challenges, serving as critical benchmarks for evaluating national cancer control efficacy [
3]. According to 2019 estimates from the American Cancer Society, EC claims over 604,100 lives annually [
4], emphasizing the urgent need for improved diagnostic strategies. While early detection substantially enhances curative potential [
5], current clinical imaging modalities remain insufficient, evidenced by a dismal 5-year relative survival rate of 20%, positioning EC among the malignancies with the poorest prognoses [
6].
Dysplasia is the earliest known precursor lesion of the esophageal mucosa and the first step in the progression leading to EC [
7]. Dysplasia is a proliferation of disordered cells in the epithelium due to genetic alterations, with a predisposition to invasion and metastasis [
8]. The primary histological variants of esophageal cancer are squamous cell carcinoma (SCC) and adenocarcinoma. SCC originates in the squamous epithelium, located in the mid and upper esophagus, and is prevalent in Asia and Africa, whereas ADC arises from the glandular epithelium in the lower esophagus, frequently linked to Barrett’s esophagus, and is more common in Western countries. Their separation is primarily due to distinct cellular backgrounds, varying case frequencies across populations, endoscopic appearances, and specific genetic factors, necessitating tailored screening and treatment strategies for each type. Tobacco use and excessive alcohol consumption are the primary contributors to SCC, and they synergistically increase the susceptibility of squamous mucosa to malignancy. Adenocarcinoma is more prevalent in individuals with obesity and metabolic syndrome, potentially resulting in frequent gastroesophageal reflux disease (GERD) and subsequently leading to Barrett’s esophagus. A diminished consumption of fresh fruits and vegetables, coupled with a heightened intake of nitrosamine-laden foods, may contribute to an elevated risk of esophageal cancer, particularly SCC, in regions where esophageal cancer is prevalent. The individual’s location of residence and financial status influences the likelihood of them having different subtypes: individuals in low-resource areas are more susceptible to SCC due to inadequate nutrition and prevalent alcohol/tobacco consumption, whereas an increasing number of individuals in Western countries are developing adenocarcinoma as obesity and GERD become more prevalent. SCC is the predominant histological variant, constituting 80% of EC cases globally. SCC is defined as a malignant epithelial tumor characterized by squamous differentiation, rather than necessarily originating from the squamous epithelium. Advanced SCC is characterized by a protruding mass or an ulcerated lesion that appears depressed. SCC poses diagnostic challenges due to its frequently nonspecific macroscopic features or color variations; it typically presents as an irregular mucosal surface, enveloped in a thin white coating or exhibiting a reddish tint [
9].
Recent advancements in machine learning have significantly propelled EC diagnostic methodologies. Tang et al. pioneered a deep learning framework leveraging esophageal wall thickness measurements from non-contrast chest CT scans, demonstrating sensitivity and accuracy ranges of 75–81% and 75–77%, respectively [
10]. While CT-based approaches provide anatomical insights, their clinical utility is constrained by ionizing radiation exposure and suboptimal soft tissue differentiation compared to white-light imaging (WLI). Addressing endoscopic automation challenges, Liu et al. implemented a dual-modal deep learning architecture integrating depth sensing and RGB data, achieving remarkable performance metrics, a 97.54% early EC detection rate and a 74.43% mean Dice coefficient for lesion segmentation [
11]. Nakagawa et al. further advanced depth-specific diagnostics through an AI system capable of distinguishing SM1 from SM2/3 submucosal invasions, with 91.0% accuracy, 90.1% sensitivity, and 95.8% specificity [
12]. Complementing these efforts, Smith et al. validated a multispectral endoscopic scattering technique, attaining 96% sensitivity and 97% specificity for high-grade dysplasia detection, exceeding American Society for Gastrointestinal Endoscopy (ASGE) benchmarks [
13]. Despite these innovations, conventional RGB and multispectral imaging remain fundamentally limited by sparse spectral sampling and discontinuous wavelength coverage at the individual pixel level.
Hyperspectral imaging (HSI), or imaging spectroscopy, captures spatial and spectral data across hundreds of contiguous wavelengths, generating a three-dimensional “hypercube” that maps tissue composition and physiological properties for each pixel [
14]. Unlike conventional imaging modalities, HSI provides a dense spectral signature for every spatial point, enabling the detection of subtle mucosal and submucosal abnormalities that are invisible under standard white-light imaging (WLI) [
15,
16,
17,
18]. Commercially available HSI systems operate across ultraviolet (200–380 nm), visible (380–780 nm), and near-infrared (780–2500 nm) spectra, offering unparalleled resolution for distinguishing pathological features [
19,
20]. Narrow-band imaging (NBI), a specialized form of HSI, enhances the vascular and mucosal contrast by selectively illuminating tissues with 415 nm (blue) and 540 nm (green) wavelengths [
21]. Blue light penetrates deeper submucosal layers to highlight the subsurface vasculature, while green light reflects superficial capillaries, improving the visualization of early neoplastic changes [
22]. For instance, Zhang et al. demonstrated that magnifying endoscopy with NBI (ME-NBI) significantly outperformed WLI in diagnosing early gastric cancer, achieving sensitivity and specificity values of 0.83 and 0.96 compared to 0.48 and 0.67 for WLI [
23]. Despite its diagnostic potential, NBI remains underutilized due to hardware limitations in regard to many endoscopy systems. To address this gap, our study integrates HSI with a novel spectral band selection algorithm, the Spectrum-Aided Vision Enhancer (SAVE), to transform conventional WLI endoscopic images into high-fidelity spectral representations. By coupling the SAVE with advanced deep learning architectures (YOLOv9, YOLOv10, YOLO-NAS, RT-DETR, and Roboflow 3.0), we aim to improve the detection of EC stages and anatomical locations. This approach leverages hyperspectral data to train models for identifying dysplasia and squamous cell carcinoma (SCC), overcoming the spectral sparsity of traditional WLI.
3. Results
This study examined endoscopic pictures obtained from the KMUH, encompassing patients classified as normal, dysplastic, and diagnosed with SCC. The dataset distribution is as follows: 26.5% were grouped as normal, and 30.1% were diagnosed with SCC. Furthermore, 43.4% of the patients were classified as moderate smokers, 50.6% were smokers, and 6% had never smoked. Alcohol consumption was prevalent among 45.8% of the patients, with 43.4% engaging in moderate drinking, while 10.8% reported being non-drinkers. The prevalence of betel nut consumption in this group was significant, with 65.1% of patients indicating usage, 16.9% having ceased its use, and 18.1% claiming never having used it. The dataset was randomly partitioned into training, validation, and testing subsets, with a 7:2:1 ratio, respectively. This study aimed to assess the performance of the SAVE imaging modality in comparison with the WLI modality to diagnose several esophageal conditions, including dysplasia, SCC, inflammation, and normal tissue. The evaluation was conducted using different object detection algorithms, namely YOLOv9, YOLOv10, RT-DETR, Roboflow 3.0, and YOLO-NAS.
Based on the YOLOv9 model, the SAVE achieved a higher diagnostic accuracy for most conditions than the previous model (see
Supplementary Material Figure S4 for the visualization of training loss and performance metrics (precision, recall, and F1 score) for YOLOv9 models using both WLI and SAVE imaging modalities. The plots illustrate the learning curves during training and validation phases, highlighting differences in convergence behavior and overall model performance between the two imaging approaches). In the case of precision for dysplasia detection, the figure increased from 70.1% with WLI to 81.3% with the SAVE (see
Supplementary Figure S5 for the confusion matrices for YOLOv9 model performance based on WLI and SAVE imaging modalities. The matrices display the classification outcomes for each lesion class (normal, inflammation, dysplasia, and SCC), allowing a comparison of true positives, false positives, and false negatives between WLI and SAVE). The detection of SCC also improved. The analysis of numerous conditions by this model showed that it possesses good generality when strengthened with SAVE images. YOLOv9 performed well compared to the other models, and it was beaten slightly in terms of performance by other models like Roboflow 3.0 and YOLO-NAS (see
Supplementary Figure S6 for the F1 score confidence curves for the YOLOv9 model using WLI and SAVE imaging modalities. The curves illustrate the relationship between model confidence thresholds and corresponding F1 scores, demonstrating the impact of varying confidence levels on classification performance for each imaging approach). The method was more accurate in terms of precision and recall for SCC and normal tissue.
The results of the YOLOv10 model emphasized the effectiveness of the SAVE, especially in regard to identifying SCC and normal tissue (see
Supplementary Figure S1 for the visualization of training loss and performance metrics (precision, recall, and F1 score) for YOLOv10 models using both WLI and SAVE imaging modalities. The plots illustrate the learning curves during training and validation phases, highlighting differences in convergence behavior and overall model performance between the two imaging approaches). Specifically, the precision for SCC detection was enhanced from 86.6% with WLI to 88.7% with the SAVE (see
Supplementary Figure S2 for the confusion matrices for YOLOv10 model performance based on WLI and SAVE imaging modalities. The matrices display the classification outcomes for each lesion class (normal, dysplasia, SCC, and inflammation), allowing a comparison of true positives, false positives, and false negatives between WLI and SAVE). Such enhancements assert that the SAVE modality helps in improving the specificity of the model in detecting details, therefore increasing efficiency in regard to detection (see
Supplementary Figure S3 for the F1 score confidence curves for the YOLOv10 model using WLI and SAVE imaging modalities. The curves illustrate the relationship between model confidence thresholds and corresponding F1 scores, demonstrating the impact of varying confidence levels on classification performance for each imaging approach). Even though the YOLOv10 model had a similar performance to YOLOv9, some results were slightly better in regard to the YOLOv10 model. However, YOLO-NAS performed better in regard to dysplasia and inflammation detection and Roboflow 3.0 in regard to SCC detection.
On average, the SAVE performed better than WLI in terms of all the results emerging from the RT-DETR model (see
Supplementary Figure S7 for the confusion matrices for RT-DETR model performance based on WLI and SAVE imaging modalities. The matrices display the classification outcomes for each lesion class (normal, in-flammation, dysplasia, and SCC), allowing a comparison of true positives, false positives, and false negatives between WLI and SAVE). The precision in regard to the detection of normal tissue increased from 66.09% for WLI to 75.4% with the SAVE, as shown in
Table 1 (see
Supplementary Figure S8 for the predicted lesion detection results for WLI and SAVE images using the RT-DETR model. The figure showcases the bounding box predictions and class labels generated by the model, illustrating differences in detection performance between the two imaging modalities). Dysplasia was detected poorly by the model. The recall was as low as 17% when using the SAVE. The findings showed that the RT-DETR model is less capable in regard to this particular medical application than the other models. The RT-DETR model had a lower precision and recall than the other models in almost all the conditions tested.
As demonstrated in
Table 1, Roboflow 3.0 showed significant performance improvements with the SAVE, especially in the case of SCC detection, where it had a precision of 95.45%. The highest accuracy was obtained for this model and the F1 score was 90.32%. In regard to the detection of dysplasia, all the measures improved from the baseline values, with the recall increasing from 51.23% with WLI to 53.1% with the SAVE. Thus, Roboflow 3.0 is one of the best models in regard to this use case.
The results revealed that the SAVE constantly exhibited better performance with the YOLO-NAS model across all the conditions concerning dysplasia and inflammation, as shown in
Figure 2. The specific precision for dysplasia detection increased from 75.13% with WLI to 79.81% with the SAVE. The best results were observed for the identification of inflammation and dysplasia, although the overall performance differences between the YOLO-NAS and Roboflow 3.0 models were similar to those of the other models. The relatively reduced SCC detection efficiency indicates that more fine tuning is needed for this particular condition.
Comparing the specificity and sensitivity across all the models revealed that the SAVE modality was more accurate than WLI. The evaluation showed that the SAVE enhanced the precision and F1 scores for most models, which are valued factors in terms of patient care and treatment. Of all the models, Roboflow 3.0 and YOLO-NAS were the most efficient, with Roboflow 3.0 being more efficient in regard to SCC detection, and YOLO-NAS showing balanced performance across all the conditions. These results demonstrated how well the suggested SAVE imaging modality performs when used in clinically relevant scenarios to enhance the performance of multiple ML models.
The proposed approach in regard to the YOLOv9 model was reported to be more robust and showed promising results, especially in terms of SCC detection, where it achieved high overall accuracies and F1 scores related to the WLI and SAVE modalities, as shown in
Table 2. In contrast to the results obtained for other classes, the dysplasia detection rates were lower, and the recall values indicated a problem identifying this class of data. The performance in regard to inflammation detection was satisfactory, with moderate precision and recall for both modalities. Thus, the YOLOv9 model appears to be reliable, but is not as effective at detecting less apparent classes, such as dysplasia.
Some of the achievements when comparing the YOLOv10 and YOLOv9 models include the following: In regard to the detection of SCC, the precision increased, which contributed to improving the performance of the YOLOv10 model. However, dysplasia detection remained an issue, with relatively low recall values. This finding suggests that even though the YOLOv10 model is capable of identifying unambiguous and well-defined cancer-related diseases, such as SCC, the network struggles to identify early stages of cancer diseases like dysplasia. The detection of inflammation improved, but still produced variable results between the WLI and SAVE modalities.
Like most models, RT-DETR had inconsistent performance. As for the SAVE modality, the precision percentage was 100%. However, dysplasia distinction, especially in WLI, was poor, with low recall and F1 scores. This finding implies that the RT-DETR model is implicitly tuned toward detecting significantly advanced carcinoma types, while overlooking early-stage lesions that are similar to dysplasia. In conclusion, RT-DETR is particularly strong in regard to SCC detection, but fails to exhibit the required level of robustness for the proper detection of all categories of data.
In regard to Roboflow 3.0, the usefulness of the SAVE was evident, with improved performance in regard to most categories, especially dysplasia and SCC. The identification of dysplasia, which was usually difficult across all the models, was more precise and had better recall rates when using the SAVE, indicating that this mode of imaging provides more elaborate and useful information in the early stages of the disease. The detection of SCC achieved its best performance in regard to the SAVE, thus strengthening the modality’s improved efficiency in determining severe and clearly defined cancerous conditions.
The YOLO-NAS model showed that the pattern persisted, with the SAVE exceeding WLI in regard to each of the three evaluation metrics. The normal, dysplasia, and SCC classes demonstrated significantly improved recall and precision with the use of the SAVE. The algorithm’s ability to detect dysplasia, which is important in early-stage EC, was significantly better with the SAVE, achieving better recall and F1 scores than WLI.
Figure 3 shows that SCC detection was very accurate when operated under SAVE conditions, and the evaluation metrics were generally high. This finding showed that YOLO-NAS has extensive detection functionality when employed with the SAVE modality.
These results provide evidence that the use of the SAVE modality improves the detection capabilities of all the algorithms over WLI, as shown in
Figure 4. This improvement is most apparent in regard to certain classes, such as dysplasia and SCC, in which early-stage cancer and well-defined conditions are better served by the enhanced spectral detail of the SAVE. The overall accuracy, sensitivity, and specificity analyses of the SAVE in regard to most classes of data highlight its advantage over the other modalities in enhancing diagnostic accuracy and averting more diverse types of EC conditions.
Confidence intervals (CIs) at 95% were computed for the performance scores presented in
Table 2. In regard to SCC detection utilizing YOLOv9, the WLI F1 score was 84.3%, with a CI of [71.7%, 96.9%], whereas the SAVE F1 score was 90.4%, with a CI of [80.2%, 100%], and
p = 0.03. In regard to the cases of dysplasia, the F1 score increased from 60.3% (CI: [51.5%, 69.1%]) with WLI to 65.5% (CI: [57.0%, 73.8%]) with the SAVE (
p = 0.04). Inflammation and normal tissue exhibited improvements in regard to the CI; however, the difference for inflammation was not statistically significant (
p = 0.12), likely attributable to the limited sample size (
n = 16). The enhancements in the precision, recall, and F1 score for the SAVE relative to WLI were statistically significant for the majority of classes of data, affirming the robustness of the SAVE across various lesion types and models. This study was conducted using a static dataset consisting of pre-collected and annotated endoscopic images. Real-time detection capabilities were not developed as part of the model evaluation process. However, an application named the Esophageal Cancer Detection Application (EC Detector), designed using the YOLOv9 model, for cancer detection from esophageal images, was developed with the help of PyQt5. The application allows for increased simplicity of working with it, while providing tools for image uploading and analysis.
Figure S9a reveals the first interactive screen of the tool, and
Figure S9b illustrates the manner in which images are selected. After the image is chosen, the model highlights areas of possible cancerous tissues, as illustrated in the picture below, by using bounding boxes, in regard to three types of classes, as shown in
Figure S9c. This application helps with early diagnosis, hence enhancing clinical results. This tool is meant for future modifications, such as the possibility of identifying other cancers or incorporating other ML algorithms. The current SAVE image conversion occurs offline; however, the developed software application facilitates efficient processing and may be modified for real-time applications in subsequent iterations (see
Supplementary Figure S9 for the Screenshots of the Windows-based CAD application used for esophageal cancer detection). In contrast to NBI, which necessitates specialized endoscopic equipment, the SAVE improves standard WLI images via a software-based method, providing greater accessibility and the ability to retrospectively apply it to previously acquired images. Furthermore, the SAVE facilitates adaptable spectral feature extraction and deep learning optimization, offering a customizable diagnostic instrument that transcends the rigid parameters of conventional NBI systems. In subsequent endeavors, we intend to incorporate the SAVE imaging and deep learning detection models into a real-time clinical workflow. This will entail enhancing the model inference speed, reducing latency, and verifying performance during live endoscopic procedures. Real-time deployment may substantially improve the early detection of esophageal cancer during routine endoscopy, thereby enhancing decision-making and patient outcomes. The findings unequivocally demonstrated that the SAVE improved the performance of all the evaluated models. The SCC F1 score increased from 84.3% to 90.4% in the YOLOv9 model and from 87.3% to 90.3% in the Roboflow 3.0 model. In the case of dysplasia, YOLOv9 showed an enhancement in precision from 72.4% (WLI) to 76.5% (SAVE), and YOLO-NAS advanced from 75.1% to 79.8%. Roboflow 3.0 had the highest sensitivity (85.7%) and F1 score (90.3%) for SCC detection, using SAVE imaging. YOLO-NAS had the most equitable performance across all lesion categories, rendering it a suitable option for general clinical applications. Of the evaluated models, Roboflow 3.0 demonstrated enhanced sensitivity to SCC lesions, indicating its potential efficacy in high-risk cancer detection endeavors. YOLOv10 and RT-DETR also displayed advantages as a result of the SAVE upgrade, albeit with marginally reduced performance consistency across various lesion types. In contrast, YOLO-NAS exhibited consistent and precise outcomes for normal, dysplastic, and hemorrhagic tissues, providing a versatile solution for thorough esophageal lesion detection. Although the SAVE typically improved the performance of most models, some variability was noted, especially in regard to dysplasia detection. The recall for dysplasia marginally diminished in the RT-DETR model, presumably due to architectural sensitivity or class imbalance. These findings underscore that although the SAVE provides significant overall advantages, variations in model-specific performance necessitate additional examination and refinement. The amalgamation of the SAVE with deep learning algorithms presents a promising non-invasive instrument for the early diagnosis of esophageal cancer. The significant enhancement in the F1 scores and sensitivity demonstrated by the SAVE reflects its ability to identify early-stage or subtle lesions that may be overlooked with WLI. This could assist endoscopists in regard to the early identification of at-risk patients, optimizing intervention timing, and enhancing prognosis in a clinical context.
4. Discussion
This study seeks to assess the effectiveness of the SAVE system in enhancing the detection of esophageal cancer lesions, specifically dysplasia and SCC, in comparison to traditional WLI. We evaluated the efficacy of five deep learning models (YOLOv9, YOLOv10, YOLO-NAS, RT-DETR, and Roboflow 3.0) in detecting esophageal lesions through HSI. The amalgamation of the SAVE system with deep learning models markedly improves detection precision, especially for SCC and dysplasia, in contrast to conventional WLI. Of the models assessed, Roboflow 3.0 exhibited superior performance, enhancing its F1 scores and sensitivity. This study illustrates the clinical efficacy of the SAVE system as a dependable instrument for the early identification and diagnosis of esophageal cancer. The primary clinical application of this study is the improvement of EC detection during endoscopic procedures. Although endoscopists will maintain a crucial role in conducting endoscopies, the incorporation of the SAVE system with deep learning models provides substantial assistance in enhancing diagnostic precision. The SAVE system enhances the visual capabilities of conventional WLI, facilitating the detection of subtle lesions, especially early-stage esophageal cancer, dysplasia, and SCC, that might otherwise be overlooked. This technology does not supplant the endoscopist’s role, but rather functions as a significant diagnostic tool, offering real-time, AI-enhanced support in regard to the identification and characterization of lesions. The implementation of the SAVE system may result in enhanced diagnostic accuracy and timeliness, facilitating earlier interventions and improving patient outcomes, especially among high-risk groups. Presently, there exists a paucity of technology capable of directly transforming conventional WLI into HSI for comparative analysis. Although HSI has been investigated in numerous medical applications, its integration with WLI for real-time detection and comparison is still a nascent field of research. Consequently, there is a dearth of research that directly contrasts WLI images with HSI in the realm of EC detection. This study seeks to bridge the existing gap by assessing the efficacy of the SAVE system in transforming WLI images into HSI and improving the identification of esophageal lesions. This innovation offers a new method for enhancing diagnostic accuracy; however, additional progress in HSI conversion technology is required to broaden its applicability in clinical practice.
A notable limitation is that the data were collected solely from one hospital. This limitation may engender bias in regard to the institutions that possess the data and the patients from whom the data were obtained. The validity of the findings is limited, making the study’s conclusions not yet ready for general application. To improve the generalizability of the findings and the proposed model in future research, data must be collected from multiple locations across diverse countries and ethnicities [
33]. A subsequent limitation stems from the preprocessing phase, which entails resizing original images of disparate dimensions to a standardized size of 640 pixels per side. This preprocessing method may have neglected essential image data that could have impacted the models’ efficacy, although it was advantageous in optimizing computational resources. Therefore, subsequent research should focus on enhancing the image resolution during preprocessing or employing adaptive resolution algorithms to retain critical data and optimize computational efficiency. Another concern relates to the computational requirements associated with HSI and ML techniques. The intricate calculations involved pose a challenge due to their demand for substantial time and resources. Therefore, future research efforts should investigate the application of enhanced algorithms and the deployment of advanced computing technologies, including graphics processing units (GPUs) and tensor processing units (TPUs). Real-time detection within microseconds requires improvements in both software and hardware, along with the incorporation of specialized accelerators to expedite specific operations, such as GPUs, field-programmable gate arrays, application-specific integrated circuits, and TPUs [
34]. Conversely, ensemble learning amalgamates multiple models to enhance precision and bolster the system’s dependability for clinical use [
35]. These methods can improve the technology’s usability, greatly assisting healthcare providers, especially endoscopists, and enhancing the reliability of the results they produce. The SAVE methodology utilized in EC could be easily adapted to various cancer types and possibly other medical conditions, where imaging serves as the most effective method for early diagnosis. Incorporating additional cancer types may illustrate the system’s adaptability. The potential of these advancements may significantly improve patients’ quality of life through early detection and timely treatment of conditions. Improvements in medical imaging methodologies, guided by the findings in this study and similar research, may substantially augment diagnostic capabilities and assess patient treatment results. Therefore, future research should rectify the limitations of this study and investigate novel applications of HSI and ML in medical diagnosis. Consequently, improving the technology to include essential advancements, expanding the study population, and exploring diverse conditions to which this device may be relevant could aid in the creation of a dependable diagnostic instrument.