Surgery is often the primary treatment for head and neck squamous cell carcinoma (HNSCC) [1
]. Primary surgery is the modality of choice for resectable oral cavity cancers and late stage disease of the larynx and hypopharynx [2
]. Management of locally advanced SCC may also require a multimodal approach with adjuvant chemoradiation therapy [1
]. Nearly 90% of cancers of the upper aerodigestive track of the head and neck are SCC [4
]. Depending on the extent of the disease, radiation therapy or chemotherapy alone may be the primary curative modality selected, such can be the case with unresectable, recurrent, or metastatic cancers and also with cases known to be susceptible to chemoradiation [1
]. Human papilloma virus (HPV) is an identified cause of SCC, and the most common location for HPV positive (HPV+) SCC is the oropharynx, with nearly 60% of oropharyngeal SCC cases being HPV+ [1
]. Approximately two-thirds of patients with HNSCC present with stage III or IV advanced disease [7
]. Adequate surgical removal of the primary SCC is vital to successful patient outcomes, improved quality of life, survival, and reduced recurrence [8
]. Surgeons can use pre-operative imaging, such as CT or MRI, for planning, but during the surgery, surgeons rely on experience, visual cues, and tactile palpation to determine the extent of the disease. Excised samples and tissue biopsies can be sent for pathological analysis and consultation to determine if the cancer has been sufficiently resected [9
Intraoperative pathologist consultations (IPCs) can be time-consuming and may not fully reflect the extent of the disease due to limitations in tissue sampling and preparation. While the overall accuracy of frozen-sections in IPCs is upwards of 97%, the accuracy for challenging cases, such as positive and close margins, ranges from 71 to 92%, with sensitivities reported as low as 34 to 77% [10
]. These errors can compound, leading to reported positive margins in up to 12% and close margins in up to 19% of HNSCC surgeries, despite having negative frozen sections during IPC [10
]. For example, in oral cavity SCC, up to 30% of patients have positive margins after surgery [3
The task of surgical guidance for SCC resections in the head and neck has been explored with increasing volume in the past five years using several imaging modalities coupled with machine learning [16
]. Some methods propose using fluorescently-tagged monoclonal antibodies that require intravenous administration but have specific optical signatures in the near-infrared (NIR) spectrum, with successful outcomes of studies with 21 patients [3
], and other methods utilize topical fluorescent dyes for targeting SCC [17
]. Label-free optical imaging methods that utilize only narrow bands in the blue and green visible spectrum have also demonstrated success at delineating oral SCC margins in-vivo in studies with 20 patients [19
Hyperspectral imaging (HSI) is an emerging technology in biomedicine [21
] and has been used for cancer detection studies both ex-vivo and in-vivo [22
]. HSI has been utilized for brain cancer detection in-vivo using machine learning algorithms and an optimized, clinical workflow for neurosurgeons [24
]. Additionally, HSI has been proposed for laparoscopic cancer detection in colorectal surgeries with demonstrated potential [26
Our group reported proof-of-concept studies on HSI for the detection of head and neck SCC in fresh surgical specimens from human patients [17
]. In our previous pilot studies with HSI, manually selected regions of interest (ROIs) were classified, and image preprocessing was used to remove specular glare pixels before tissue classification [17
]. In our other works [28
], deep learning algorithms were developed for HSI tissue classification in both whole primary-tumor specimens and at the SCC cancer-normal margin, but only in limited sample sizes from 21 to 29 patients employing cross-validation. Previous works from other groups focus on SCC detection in excised tongue SCC specimens, using both proof-of-concept ROI-based detection of SCC in 7 specimens [31
] and tumor semantic segmentation of the entire cancer-margin specimens [32
] with promising results in leave-one-patient-out cross-validation experiments of 14 patients.
In this large study of 293 tissue specimens from 102 patients with SCC, we develop deep learning methods to classify the whole tissue specimens instead of ROIs and thus further investigate the full potential of label-free HSI-based imaging methods for SCC detection. This is the first work to conduct fully-independent training, validation, and testing directly of the SCC tumor margin with a large patient dataset (N = 102 patients), divided into conventional, keratinizing SCC with variants (N = 88) and HPV+ (N = 14) SCC cohorts. The tissues represent a variety of anatomical sites to give an accurate assessment of the feasibility of label-free, non-contact, and non-ionizing HSI-based imaging modalities for SCC detection. Additionally, this is the first study to investigate and quantify HSI-based methods for HPV+ SCC detection directly. It is hypothesized that deep learning algorithms can be developed to enable label-free HSI-based methods, namely reflectance-based HSI and autofluorescence imaging, to perform with substantial accuracy to provide meaningful information to guide complete surgical resections. Furthermore, it is hypothesized that label-free HSI-based methods will outperform the fluorescent dye-based methods due to lack of target specificity with sufficient signal-to-noise in SCC tissues. The results of this study will inform if HSI and other fluorescence imaging modalities can be expected to provide specific benefits to cancer margin detection during SCC resection surgeries.
3.1. Histological Ground Truth and Registration
After acquiring the HSI, the fresh, ex-vivo tissue specimens were inked to preserve the optical imaging orientation, fixed in formalin, and paraffin embedded. Using a microtome, only the first, high-quality top section corresponding to the surface that was optically imaged was obtained, stained with hematoxylin and eosin, and digitized using whole-slide scanning at 40× objective [35
]. The digital histology images from each specimen were annotated to outline the cancerous and normal areas by a board-certified pathologist with expertise in H and N cancer [17
The digital histology ground truth served as the gold standard for the optical imaging modalities. The histology ground truth image was registered in a semi-automated fashion according to a previously established pipeline of deformable registration to the gross-level HSI [36
]. This registration was subject to errors in tissue-deformation, uncertainty in the cancer margin with depth, and off-plane slices that in total were estimated to be about 1 mm [36
]. Moreover, the variation in photon penetration depth from the optical imaging modalities and the variation in the cancer margin throughout the depth of the tissue specimens was also estimated to create another 1 to 2 mm of error in the margin, according to our previous work [37
]. Therefore, a systematic and objective method for calculating classification performance was implemented by removing the area near the cancer margin in millimeter increments and reporting all values [37
]. The regions near the cancer margin are both included and excluded from performance calculations because the tissue near the margin can be degenerated or have undergone pre-cancerous transformation. Here the registered cancer margin is referred to as the ‘actual TN’ margin, and mm increments estimated from the TN margin are identified. The ‘actual TN’ margin calculates performance metrics for all tissues right up to the pixels that comprise the interface of tumor and normal. For distance calculations for example, ‘TN at 1 mm’ represents that evaluation metrics are calculated from all distances up to 1 mm from the margin. The ‘TN margin at 2 mm’ is also reported, which calculates performance up to 2 mm from the margin.
3.2. Intra-Patient Experiments
Intra-patient experiments used a patient’s known cancer and normal specimens to train a machine learning algorithm and simulated a personalized approach for SCC detection on-the-fly in the operating room. For intra-patient experiments, linear discriminant analysis (LDA) was used in ensemble to train, validate, and test the SCC data from the same patient. Each SCC patient with all tissue types (meaning a purely normal specimen, a specimen containing only primary tumor, and a specimen of the cancer margin) was included and divided into each cohort, conventional SCC (N
= 41) and HPV+ SCC (N
= 6). Despite collecting 102 patients for this study, only 47 fit this distribution of all three tissue types exactly. Independently, an ensemble LDA of 500 learners was trained and validated in 5 folds from each patient’s tumor and normal samples. After each patient’s model development, the patient’s tumor-normal margin specimen was used as the testing data. The LDA method was selected because our previous work demonstrated that it outperformed other regression-based machine learning algorithms [18
]. Training time for 5 cross-validated folds of one patient’s model was about one to three minutes, depending on the size of regions selected for training, which is reasonable for simulating training of a patient’s data for HSI during surgery. All statistical analyses were performed using a paired, one-tailed t-test.
3.3. Inter-Patient Experiments
To explore the ability of HSI and fluorescence imaging modalities to detect SCC on patients fully-independent from algorithm development, two experiments were performed. The first experiment consisted of training the CNN on primary tumor (T) and all normal (N) tissues, while testing on T and N tissues from other patients. The second experiment consisted of training on primary tumor (T) and all normal (N) tissues, while testing only tumor-involved cancer margins (TN) tissues from other patients.
To perform these experiments, within each SCC cohort, patients were randomly divided into 5 folds, each fold served as the fully-independent testing group, while training and validation was performed on the patients in the remaining 4 folds, which allows test-level performance metrics for all patients in our dataset. For the conventional SCC cohort, each model from each fold was trained and validated on approximately 25,000 patches from 110 tissue specimens from 70 patients, and the independent testing group from each fold was approximately 50 tissues from 20 patients. This was performed once for each fold, until the entire cohort dataset, comprised of 70,000 patches from 255 tissue specimens from 88 patients was used as the testing group. For the HPV+ SCC cohort, training/validation was performed in the same fashion in 5 folds, until the entire cohort dataset of 16,000 patches from 38 tissue specimens from 14 patients was used as the testing group. All statistical analyses were performed using a paired, one-tailed t-test.
3.4. Convolutional Neural Network
For inter-patient experiments, a convolutional neural network (CNN) was developed to quickly and efficiently classify cancerous and normal tissues at the cancer margin. Due to the uniqueness of HSI data, the inception-v4 CNN architecture [38
] was customized in several key ways to optimize the CNN to hypercube data in image-patches that were 25 × 25 × C
, where C
is the number of spectral bands of each HS optical modality. The full CNN architecture schematic is presented in detail in Figure S1
. The CNN was developed in TensorFlow on an Ubuntu machine running NVIDIA Titan-XP GPUs [39
]. The early convolutional layers were modified to handle the selected patch-size and create smaller inception blocks that would allow for faster training and classification using the CNN. Training was performed up to 50 epochs; one epoch of training data ran for up to 1 hour using HSI; and deployment of the fully-trained CNN on a single GPU to classify a new HSI scene with hundreds of patches required only 25 ± 10 seconds. The relative saliency of spectral features for correctly predicting SCC or normal in HSI, shown in Figure 1
b, were extracted from the CNN using the class-activated gradients per the grad-CAM algorithm [40
3.5. Image Processing and Reconstruction
For all experiments, the 10 pixels, corresponding to 0.25 mm, at the edge of each tissue were discarded for performance calculations. Since the imaging protocol for tissue specimens required using a flat imaging surface, the tissue free edges created false curvature where the tissue was too thin to provide an adequate imaging signal. Implementation of the inter-patient CNN experiments involved a patch-based approach using a sliding window of size 25 × 25 × 91 and an overlap of 13 pixels. The overlapping regions of image-patches were averaged to produce a smoother result for calculation of the performance metrics of the inter-patient experiments.
To evaluate performance of the machine learning algorithms employed in the experiments for detecting SCC, the area under the curve (AUC) of the receiver operator characteristic (ROC) curve was calculated. The AUC score was selected because it describes the accuracy at all possible thresholds of identifying the positive class and is not susceptible to errors when the classes are imbalanced. For each experiment, the optimal operating point on the ROC was calculated for the validation group data. This validation group threshold was used as the threshold for the testing group to best distinguish between cancer and normal, objectively. Using this threshold, the overall accuracy was calculated. Sensitivity, the ratio of true positives to total positive predictions, and specificity, the ratio of true negatives to total negative predictions, were also calculated and presented.
The results of this large study of 293 tissue specimens from 102 patients with SCC show that label-free, reflectance-based hyperspectral imaging and autofluorescence imaging both outperform the fluorescent dye-based imaging methods, i.e., proflavin and 2-NBDG, and this technology could aid in the detection of SCC. The fluorescent dyes employed are not specific enough to target SCC with a high signal-to-noise ratio in ex-vivo tissue specimens because of the large inter-patient variability. Proflavin allows visualization of nuclear structures, but is washed out by excessive keratin. The regional metabolic uptake of 2-NBDG to localize cancerous areas was not evident or demonstrated by the results of this ex-vivo study. Label-free HSI techniques may yield potential but the best machine learning protocols for training HSI classifier is undetermined. It may be task specific, but the results of this study show that with a large SCC HSI database, deep learning algorithms can be trained with high fidelity to work across a large number of anatomical sites in the upper aerodigestive tract.
IPC analysis with frozen sections remains the current standard for intraoperative guidance, but it is time and labor intensive. Across all 102 patients with SCC recruited for this study, an average number of 2.1 IPCs were performed per surgery, each taking about 41 minutes in total. On average, each surgery typically investigated 3.4 tissues, each of which take about 25 minutes to report final diagnosis. The average imaging time for HSI was about 1 minute with up to 35 seconds for HSI classification using the CNN, which is significantly less than IPC.
Detection of SCC for surgical purposes is a challenging task, whether performed by a surgeon, pathologist, or computer-aided optical imaging modality. In the literature, the accuracy of detecting positive or close margins in frozen sections ranges from 71 to 92% with sensitivity from 34 to 77% [10
]. As sampling and tissue preparation is the main source of error, careful sectioning of small biopsies and vigilant communication is recommended to reduce errors during IPCs [9
]. Nonetheless, significant need for guidance remains, with up to 20% to 30% of cases reported with close or positive margin results after SCC resections [10
]. To this end, to put the SCC detection ability of HSI-based methods into context, we present the pathologist assistant accuracy of 88% for research purposes-only tissue identification. Since current practice is imperfect, the potential benefit of HSI for SCC detection should be evaluated on two criteria: firstly, to establish no potential harm; and secondly, to assess HSI-based intraoperative information that has clinical utility in achieving negative margins, especially considering the time advantage.
The results presented in this study using 293 specimens from 102 patients can be compared to previous pilot studies from our group. Lu et al. 2017 reported results from a small (N
= 24) proof-of-concept study using manual ROIs that showed that training and testing on the same patient with HSI yielded an intra-patient accuracy of 89–94% and intra-patient AUC of about 0.96 [18
]. Our objective and systematic approach yielded equivalent results using nearly double the intra-patient cohort (N
= 47) for distances 1 mm beyond the cancer margin, 85–90% accuracy for conventional SCC and 88–97% accuracy for HPV+ SCC, across all anatomical tissue sites. Moreover, the AUCs obtained from 0.82 to 0.91 for conventional and HPV+ SCC cohorts at 2.25 mm from the cancer margin, importantly, are not limited by the selection of manual ROIs and include specular glare pixels. Therefore, slightly lower results are to be expected, but provide a more realistic performance estimate for HSI-based methods in the operating room. The experimental results of this study were a median AUC of 0.92 for HSI and 0.93 for autofluorescence for all conventional SCC T vs N tissues using the most patient data in an HSI study to date. The previous proof-of-concept work by Lu et al. reported an accuracy of 85% for T versus N tissues only, an accuracy of 76% for manual ROIs near the cancer-margin, and an overall average AUC of 0.88 for all tissues (T, TN, and N) for SCC at comparable tissues and anatomical sites. In comparison, in this study, for larynx, nasal cavity, and oropharyngeal SCC, we achieved AUC scores of 0.85, 0.93 and 0.95 with accuracies above 79%.
The optical imaging modalities in this study were all acquired using HSI technology, including proflavin, 2-NBDG, and autofluorescence, and all were saved as hypercubes for CNN training. Moreover, even RGB images were generated from HSI, and recent work has suggested that CNNs can recover the full HSI spectrum from RGB composites constructed from HSI [42
]. Therefore, it is possible that the results from these modalities benefitted from being HS data, which is one possible explanation for not observing more statistically significant differences. The results of this study are promising at tissue sites that perform with high AUCs in both SCC cohorts. However, the results suggest that HPV+ SCC requires more data to perform well with deep learning. Therefore, the results of this study support the hypothesis that label-free HSI methods significantly outperform the dye-based methods and could provide value for clinical SCC detection.