3.1. Autofluorescence Bronchoscopy
We collected and recorded a series of AFB airway exams for 20 lung cancer patients scheduled for diagnostic bronchoscopy at our University hospital. All participants provided informed consent in accordance with an IRB protocol approved by our university’s Office of Research Protections. All exams were performed in the operating room under standard clinical conditions. The physician started an exam in the trachea and then scanned the following major airways: right main bronchus (RMB), right upper lobe bronchus (RUL), right lower lobe bronchus (RLL), left main bronchus (LMB), left upper lobe bronchus (LUL), and left lower lobe bronchus (LLL). Olympus BF-P60 bronchoscopes and the Onco-LIFE autofluorescence light source and video camera were used for all airway exams. The 20 recorded videos were collected at a rate of 30 frames/s and consisted of 66,627 total video frames. The recorded video sequences ranged in duration from 1 min 3 s to 3 min 34 s (median, 1 min 45 s), with video frame counts ranging between 1890 and 6426 frames (median, 3149 frames).
To perform the experiments, we created a 685-frame AFB dataset. Within the 20-case dataset, we selected 208 frames depicting clear ground truth bronchial lesions, where our selection strove to capture variations in airway location, lesion size, and viewing angles.
Figure 5 gives sample lesion frames in the training and validation datasets. In addition, we incorporated 477 frames depicting normal conditions, chosen to represent a variety of airway locations and camera angles.
We point out that researchers have made a special point to note that segmentation methods not trained with any normal images often generate false positives on normal images; to solve this problem, a separate classification network may be used to classify frames as normal or abnormal [
34,
35]. For our application, by training with normal frames in the dataset, we provide added immunity to false positives and improve detection precision, without affecting the recall and mean Dice metrics derived for the validation dataset during training (all metrics are discussed further below). We observe that by doing so, all attention-based networks, such as CaraNet, SSFormer, and our proposed ESFPNet, do not erroneously detect lesions in normal images when the model converges. Lastly, the dataset includes more normal frames than lesion frames because such frames are far more common in a typical endoscopic exam.
An expert observer picked all lesion frames using the standard OpenCV CVAT annotation tool and defined segmentations through the MATLAB image labeler app [
36,
37]. Two to four hours were spent analyzing each video, with the inspection being time dependent on the video length and number of lesions. Up to three passes were made for each video to confirm frame choices, with two other experienced observers helping to corroborate decisions. We did not produce inter- or intra-observer agreement results to measure observer variations (Our anonymized dataset is available to the public on our laboratory’s web site under “Links/Public Databases” at Ref. [
38]).
Per
Table 2, the 685-frame AFB dataset was split into training, validation, and testing subsets using approximately a 50%, 25%, and 25% split, respectively. To avoid leakage between the data subsets, every lesion and normal frame from a given case was placed in the same subset to guarantee independence between the training, validation, and testing phases. Thus, because of this constraint, our actual splits into training, validation, and testing subsets were 47%, 28%, and 25%, respectively, as shown in
Table 2. Lastly, the overall lesion regions roughly varied in size from 800 to 290,000 pixels within a video frame’s circular scan region made up of
352·352 (≈390,000) pixels.
Over the complete 208-frame lesion dataset, a total of 128 distinct lesions were identified during ground-truth construction. Because a particular lesion is generally visible across multiple consecutive frames in a video sequence, considerable similarity will, of course, exist between adjacent, or nearly adjacent, video frames depicting a lesion. To eliminate the impact of frame correlation in the AFB dataset, 61 of the 128 distinct lesions were only represented by one frame in the dataset. For the remaining 67 lesions, we included one or more additional frames for a given lesion only if the added frames showed dramatic differences in size, viewing angle, or illumination. Because our focus is on single-frame detection, the lesion regions appearing in these added frames were all designated as distinct lesions in the dataset. Overall, the 208-frame lesion dataset depicts 311 regions representing lesions, with some frames depicting 1 or more lesion regions.
Note that our strategy for selecting multiple frames for a particular lesion is similar to that employed by other endoscopic imaging researchers. For example, with respect to the public colonoscopy datasets used in the next section, the CVC-Clinic colon database often depicts a particular polypoid lesion over six or more frames from a video, with each frame offering a distinct look [
39]. Also, Urban et al. sampled every fourth video frame depicting a polyp for their dataset [
40].
We compared the Unet++, SSFormer-S, SSFormer-L, CaraNet, and three ESFPNet models [
19,
20,
22], along with traditional image-processing methods based on the simple R/G ratio and a machine-learning approach using a support vector machine (SVM) [
12,
16]. Chang et al. give details for the R/G ratio and SVM (only #1) methods used here [
12]. Note that the UNet++ model had no pretrained components [
22], while the CaraNet drew on a pretrained Res2Net encoder (see
Figure 3) [
19]. Finally, the SSFormer-S and SSFormer-L models used the same pretrained MiT-B2 and -B4 encoders, respectively, as those used by the ESFPNeT-S and ESFPNeT-L models.
All network models for the Unet++, CaraNet, SSFormer-S, SSFormer-L, and ESFPNet-T, ESFPNet-S, ESFPNet-L architectures were trained under identical conditions. We employed the Adam optimizer with learning rate = 0.0001,
, and
, similar to other recent endoscopic video studies conducted for the PraNet and CaraNet [
19,
41]. A network was trained for 200 epochs with batch size = 16, and image size = 352 × 352. To account for the imbalance in the number of normal and lesion frames, sampling weights for normal and lesion frames were set to 1.43 and 4.95, respectively, using the PyTorch function WeightedRandomSampler to ensure an equal number of normal and lesion frames (i.e., 8) in each training batch. We used the same loss
function used by Wei et al. and Lou et al., where
and
are the weighted global intersection over union (IoU) loss and weighted local pixel-wise binary cross-entropy (BCE) loss, respectively [
19,
42]. The training process drew upon the training and validation datasets. During each training epoch, data augmentation techniques were applied to increase and diversify the training dataset. In particular, we employed randomized geometric transformations (rotation and flipping) and color jittering (image brightness and contrast changes), using methods built into PyTorch. Data augmentation, which helps reduce overfitting and improve network robustness, has been a standard procedure used for endoscopic video analysis, where large datasets are generally hard to compile [
43]. Notably, all of the top teams, in a recent gastroenterology challenge, employed data augmentation [
44].
To measure segmentation accuracy, we computed the mean Dice and mean IoU metrics:
where
A and
B equal the segmented lesion and ground truth lesion, respectively, and
is defined as the area of
A. All metrics were computed using tools along with PraNet [
41].
As an additional goal, we also assessed lesion detection performance for the AFB dataset. We point out in passing that colonoscopy researchers have universally limited their focus to pixel-based region segmentation and have not considered region detection [
19,
20,
22,
45]. For our studies, any segmented region that overlaps a ground truth lesion was designated as a true positive (TP). A false positive (FP) corresponded to a segmented region, whether it be on a lesion or normal test frame, that did not overlap a ground truth lesion segmentation. Lastly, a false negative (FN) corresponded to a ground truth lesion not identified by a method. Given these definitions, we also used the following standard metrics to measure detection performance:
where recall, or sensitivity, denotes the percentage of ground truth lesions detected, while precision, or positive predictive value, measures the percentage of segmented regions corresponding to correctly detected lesions.
Figure 6 first gives the training and validation results for the ESFPNet-S model. Both the segmentation accuracy and detection performance (
Figure 6a,b, respectively) steadily improve until leveling off around epoch 120, with little indication of overfitting. Based on these results, we froze model parameters at epoch 122. (Other models were similarly frozen by optimizing the mean Dice measure over the validation dataset.) Lastly,
Figure 6c,d depict the impact of the significant region size parameter on detection performance. As this parameter varies from 100 (smaller regions retained), 400 (default value for later tests), and 800 pixels (stricter limit), the precision and recall performance results vary over a 5–10% range.
Table 3 next gives results for the AFB test set, while
Figure 7 depicts sample AFB segmentation results. The R/G ratio and SVM methods gave by far the worst results overall. The ESFPNet-S model gave superior segmentation and precision performance results over all other models. In addition, the ESFPNet-S model’s 0.940 recall nearly matches the SSFormer-L model’s 0.949 recall. More specifically, for the AFB test set, the SSFormer-L and ESFPNet-S models detected 111 and 110 ground truth regions, respectively, over the 53-frame AFB test set, which contained 117 ground truth lesion legions. The seven regions missed by ESFPNet-S tended to be small (<1000 pixels) and/or appeared darker (less illuminated) and blurred, with the largest missed region made up of 10,383 pixels. Notably, ESFPNet-L exhibited slightly lower performance than ESFPNet-S. This could be attributed to (1) its significantly more complex Mit-B4 encoder, which was originally designed for the SegFormer to segment the much larger 1024 × 2048 cityscapes images, and (2) the correspondingly more complex ESFP decoder [
24,
46], i.e., the larger model implicitly requires more data to optimally train it. We also note that only the R/G ratio, SVM, and Unet++ methods detected any false positive regions on a normal frame.
Regarding the segmentations in
Figure 7, the ESFPNet-S model gave the best performance, with gradual declines in performance observed for the other deep learning models. Lastly, the R/G ratio method missed a lesion on frame #1627 of case 21405-195, whereas the SVM method consistently produced over segmentations in all examples.
3.2. Colonoscopy
We next considered the ESFPNet performance for the problem of defining lesion (polyps) in colonoscopy video. The study’s aim was to demonstrate our proposed model’s robust performance and adaptability to a different endoscopy domain.
For the studies, we drew on five highly cited public video datasets that have been pivotal in the evaluation of polyp analysis methods [
18]. These datasets include CVC-ClinicDB [
39], Kvasir-SEG [
47], ETIS-Larib [
48], CVC-ColonDB [
3], and CVC-T [
49]. The number of total video frames in these datasets ranged from 60 to 1000, similar in size to our AFB dataset.
Three distinct experiments, which considered learning ability, generalizability, and polyp segmentation, were completed using the datasets. The experiments mimicked the procedures performed by Wang et al. and Lou et al. for their respective SSFormer and CaraNet architectures [
19,
20]. For all experiments, we used the mean Dice and mean IoU metrics. For the generalizability experiment, we also considered the structural measurement
[
50], enhanced alignment metric
[
51], and the pixel-to-pixel mean absolute error (MAE) metric as considered by Lou et al. [
19]. All metrics again were computed using the evaluation tool provided with PraNet [
41].
Learning ability experiment: We trained, validated, and tested the three ESFPNet models, along with the Unet++, DeepLabv3+ [
52], MSRF-Net, and SSFormer-L models. Each model was trained and validated with data from a particular database. Each model was then tested on a test subset from the same database. This gave an indication of the model’s learning ability to make predictions on previously seen data. We followed the experimental scheme used for the MSRF-Net [
53]. In particular, using the CVC-ClinicDB (612 frames) and Kvasir-SEG (1000 frames) datasets, we randomly split each dataset into three subsets: 80% train, 10% validation, and 10% test. Following the same training procedures as for the AFB tests, we froze a model when it optimized the mean Dice measure on the validation dataset. The frozen models were then used to generate prediction results for the test dataset. For the models from other’s works, we used their reported results in the comparison. See
Table 4. For the CVC-ClinicDB dataset, ESFPNet-S and ESFPNet-L gave the best and second best results, respectively, while, for the Kvasir-SEG dataset, ESFPNet-L and ESFPNet-S gave the second and third best measures, nearly equaling that of SSFormer-L. Overall, the experiment demonstrates the effective learning ability of ESFPNet.
Generalizability experiment: For the three proposed ESFPNet models, we conducted the following experiment. First, each model was trained on dataset #1. Next, each model was tested on dataset #2, data from a previously unseen source. In particular, we applied the same dataset splitting as recommended for the experimental set-up for the PraNet [
41], i.e., 90% of the video frames constituting the CVC-ClinicDB and Kvasir-SEG datasets (1450 frames) were used for training. Next, all images from CVC-ColonDB (300 frames) and ETIS-LaribPolypDB (196 frames) were used for testing (the previously unseen datasets). We kept the best-attained performance for each dataset as a measure of a model’s forecasting performance on an unseen dataset.
Table 5 clearly shows the capability of ESFPNet for generalizability over all five metrics. The results demonstrate the proposed ESFP decoder’s sustained adaptability through the -T, -S, and -L models, as the MiT encoder increases in complexity from B0, B2, and B4. Notably, the ascending segmentation performance results illustrate that the proposed ESFP aligns well with the enhanced capabilities offered by the increased parameter count of the MiT encoder. Lastly, the results highlight our model’s capacity to assimilate common features of polyps from diverse datasets and predict effectively among unseen data.
Polyp Segmentation Efficacy: We used the same training dataset as in the generalizability experiment, where each model was separately trained until its loss converged. The remaining 10% of the video frames from the CVC-ClinicDB and Kvasir datasets (62 and 100 frames, respectively) and all images from CVC-T (60 frames), CVC-ColonDB (300 frames), and ETIS-LaribPolypDB (196 frames) were used for testing, giving five distinct test datasets. The focus of the experiment was to evaluate segmentation performance over both familiar and unseen data across five datasets. For the other models, we used the numerical results reported in the following studies: Unet++, Zhou et al. [
22]; SFA, Fang et al. [
45]; CaraNet, Lou et al. [
19]; and SSFormer, Wang et al. [
20].
Table 6 gives the results.
ESFPNet-L and ESFPNet-S gave superior performance for two unseen datasets (CVC-ColonDB, ETIS-LaribPolypDB) and one familiar dataset (Kvasir-SEG), respectively, with SSFormer-L giving the second best effort for two out of these datasets. The CaraNet gave the best performance results on the remaining two datasets (familiar CVC-ColonDB and unseen CVC-T), with ESFPNet-L and ESFPNet-S giving the second and third best performance results on these datasets. The sample lesion segmentations of
Figure 8 anecdotally corroborate these numerical observations. The Unet++ and SFA models were not competitive in this test. Overall, the ESFPNet architecture gives exemplary segmentation performance over this diverse collection of datasets.
3.3. Computation Considerations and Ablation Study
The number of parameters defining a network gives a direct indication of the number of floating-point operations (FLOPs) required to process an input and, hence, its computational efficiency.
Table 7 gives measures of model complexity and computational cost for seven of the network models studied in
Section 3.1 and
Section 3.2. The GFLOPs values were calculated using the
package under Facebook’s research platform [
54]. With respect to the models which gave the best performance results in the previous tests, the ESFPNet-S model requires substantially fewer parameters and demands significantly less computation than CaraNet and SSFormer-L. Over all networks, the ESFPNet-T model requires by far the fewest number of parameters and processing operations. Since the earlier experiments indicate that ESFPNet-T can give potentially acceptable performance, its simplicity may warrant use in certain applications.
To gain a fuller picture of model practicality, we also considered the actual computation time in a real-world implementation. Because an end-to-end turnkey version of a network model requires additional image processing steps, such as cropping, resizing, and normalization (
Section 2.1), the actual computation time depends on more than just a network’s parameter count. Secondly, the actual computation time is also influenced by the power of the CPU and GPU employed.
Table 8 presents the computation time measurements for various CPU/GPU configurations, using the hardware discussed in
Section 2.4.
Leveraging CPU multi-threading cuts 5–15 ms per frame by parallelizing the image preparation, resizing, and display operations, but overall, the computation time remains very high if the GPU is not used. GPU acceleration markedly decreases the overall computation time to a range of 26 to 88 ms per frame over all models. Lastly, adding CPU multi-threading to GPU processing cuts a substantial 10–15 additional ms per frame, giving a computation time range of 17 to 73 ms per frame—hence, CPU efficiency clearly helps significantly reduce the computation time and should not be neglected.
Table 8 shows that the superior performing ESFPNet-S model achieves a processing speed exceeding 30 frames per second, enabling real-time video processing, while ESFPNet-T achieves a processing speed of 48 frames per second. In addition, the ESFPNet-S exhibits the second-lowest parameter count and GFLOPs measure, per
Table 7. While Unet++ exhibits the lowest parameter count, it demands the highest computational load of all models due to its dense convolution operations in skip-connections, which especially escalates with larger input sizes. Coupling this with its weaker analysis performance noted earlier, it is the least competitive of the network models. Notably, even though the ESFPNet-S and SSFormer-S models share the same backbone, ESFPNet-S requires fewer parameters and significantly fewer GFLOPs than the SSFormer-S while also giving better segmentation performance. Similar observations can be made when comparing the ESFPNet-L and SSFormer-L models. Although the CaraNet analysis performance is often comparable to that of ESFPNet-S, it demands more parameters and computational resources than ESFPNet-S.
To summarize, for the endoscopy applications considered here, the results and discussion of
Section 3.1 and
Section 3.2 clearly demonstrate the strong analysis performance of ESFPNet-S. In addition, as discussed above, the results of
Table 7 and
Table 8 show the architectural efficiency of ESFPNet-S, both in terms of the number of parameters required and computation time. Thus, ESFPNet-S strikes a favorable balance between analysis performance and architectural efficiency.
To conclude, we performed an ablation study of the ESFPNet-S model, which draws on the MiT-B2 encoder. In particular, we investigated the impact of each component comprising the model’s ESFP decoder (
Figure 1).
Table 9 gives the results (cf.
Table 3). The table clearly shows that all components make a substantial contribution to the performance of the ESFP decoder.