A Deep Learning Model for Cervical Cancer Screening on Liquid-Based Cytology Specimens in Whole Slide Images

Simple Summary In this pilot study, we aimed to investigate the use of deep learning for the classification of whole-slide images of liquid-based cytology specimens into neoplastic and non-neoplastic. To do so, we used a large training and test sets. Overall, the model achieved good classification performance in classifying whole-slide images, demonstrating the promising potential use of such models for aiding the screening processes for cervical cancer. Abstract Liquid-based cytology (LBC) for cervical cancer screening is now more common than the conventional smears, which when digitised from glass slides into whole-slide images (WSIs), opens up the possibility of artificial intelligence (AI)-based automated image analysis. Since conventional screening processes by cytoscreeners and cytopathologists using microscopes is limited in terms of human resources, it is important to develop new computational techniques that can automatically and rapidly diagnose a large amount of specimens without delay, which would be of great benefit for clinical laboratories and hospitals. The goal of this study was to investigate the use of a deep learning model for the classification of WSIs of LBC specimens into neoplastic and non-neoplastic. To do so, we used a dataset of 1605 cervical WSIs. We evaluated the model on three test sets with a combined total of 1468 WSIs, achieving ROC AUCs for WSI diagnosis in the range of 0.89–0.96, demonstrating the promising potential use of such models for aiding screening processes.


Introduction
According to the Global Cancer Statistics 2020 [1], cervical cancer is the fourth leading cause of cancer death in women, with an estimated 342,000 deaths worldwide in 2020. However, incidence and mortality rates have declined over the past few decades due to either increasing average socioeconomic levels or a diminishing risk of persistent infection with high risk human papillomavirus (HPV) [1]. In developed countries, cervical cytology screening systems have been organised to reduce mortality from cervical cancer [2][3][4][5][6][7][8][9].
The introduction of cervical cancer screening led to a fall in associated mortality rates; however, there is some evidence that the conventional smear method for screening is not consistent in reliably detecting cervical intraepithelial neoplasia (CIN) [10][11][12]. This is because conventional cervical smears, when spread on glass slides, tend to have the cells of interest mixed with blood, debris, and exudate. A number of new technologies and procedures are becoming available in various screening programs (e.g., liquid-based cytology (LBC), automated screening devices, computer-assisted microscopy, digital colposcopy with automated image analysis, HPV testing). The LBC technique preserves the cells of interest in a liquid medium and removes most of the debris, blood, and exudate In this pilot study, we trained a deep learning model, based on convolutional and recurrent neural networks, using a dataset of 1605 cervical WSIs. We evaluated the model on three test sets with a combined total of 1468 WSIs, achieving ROC AUCs for WSI diagnosis in the range of 0.89-0.96.

Clinical Cases and Cytopathological Records
This is a retrospective study. A total of 3121 LBC ThinPrep Pap test (Hologic, Inc.) conventionally prepared cytopathological slide glass specimens of human cervical cytology were collected from a private clinical laboratory in Japan after cytopathological review of those specimens by cytoscreeners and pathologists. The cases were selected mostly at random so as to reflect a real clinical scenario as much as possible; we have also collected cases so as to compile a test set with an equal balance of neoplastic and NILM. The cytoscreeners and pathologists excluded cases that had poor scanned quality (n = 32). Each WSI diagnosis was observed by at least two cytoscreeners and pathologists, with the final checking and verification performed by a senior cytoscreener or pathologist. All WSIs were scanned at a magnification of ×20 using the same Aperio AT2 digital wholeslide scanner (Leica Microsystems, Osaka, Japan) and were saved in SVS file format with JPEG2000 compression. Table 1 breaks down the distribution of the dataset into training, validation, and test sets. The split was carried out randomly taking into account the proportion of each label in the dataset. A clinical laboratory that provided LBC cases was anonymised. The test sets were composed of WSIs of full agreement, clinical balance, and equal balance LBC specimens. The full agreement test set consisted of NILM and neoplastic LBC cases whose obtained diagnoses were fully agreed by two independent cytoscreeners in different institutes. The clinical balance test set consisted of 95% NILM and 5% neoplastic LBC cases based on a real clinical setting [48,49]. The equal balance test set consisted of 50% NILM and 50% neoplastic LBC cases. NILM and neoplastic LBC cases for clinical and equal balance test sets were collected based on the diagnoses provided by the clinical laboratory. The cases in the clinical and equal balance test sets were only based on the diagnostic reports. From these two test sets, we have also created their reviewed counterparts (clinical balance reviewed and equal balance reviewed), where two independent cytoscreeners viewed all the cases and the ones they had a disagreement on were removed (see Table 1).

Annotation
Senior cytoscreeners and pathologists who perform routine cytopathological screening and diagnoses in general hospitals and clinical laboratories in Japan manually annotated 352 neoplastic WSIs from the training sets. Coarse annotations were obtained by free-hand drawing. (Figure 1 using an in-house online tool developed by customising the open-source OpenSeadragon tool at https://openseadragon.github.io/ (accessed on 10 January 2020), which is a web-based viewer for zoomable images.) On average, the cytoscreeneers and pathologists annotated 150 cells (or cellular clusters) per WSI.
Neoplastic WSIs consisted of ASC (atypical squamous cell), LSIL (low-grade squamous intraepithelial lesion), HSIL (high-grade squamous intraepithelial lesion), CIS (carcinoma in situ), ADC (adenocarcinoma), and SCC (squamous cell carcinoma), except for the NILM. For example, on the HSIL ( Figure 1A-D) and SCC ( Figure 1E-H) WSIs, cytoscreeners and pathologists performed annotations around the neoplastic cells ( Figure 1B-D,F-H) based on the representative neoplastic epithelial cell morphology (e.g., increased nuclear/cytoplasmic ratio, abnormalities of nuclear shape, hyperchromatism, irregular chromatin distribution, and prominent nucleolus). On the other hand, the cytoscreeners and pathologists did not annotate areas where it was difficult to cytologically determine that the cells were neoplastic. The NILM subset of the training and validation sets (1301 WSIs) was not annotated and the entire cell spreading areas within the WSIs were used.
The average annotation time per WSI was about an hour. Annotations performed by the cytoscreeners and pathologists were modified (if necessary), confirmed, and verified by a senior cytoscreener.

Deep Learning Models
Our deep learning models consisted of a convolutional neural network (CNN) and a recurrent neural network (RNN) that were trained simultaneously end to end. For the CNN, we have used the EfficientNetB0 architecture [50] with a modified input size of 1024 × 1024 px to allow a larger view; this is based on cytologists' input that they usually need to view the neighbouring cells around a given cell in order to diagnose more accurately. We then performed 7 × 7 max pooling with a stride of 5 × 5. The output of the CNN was reshaped and provided as input to an RNN with a gated recurrent unit Cho et al. [51] model of size 128, followed by a fully connected layer. We used the partial fine-tuning approach [52] for the tuning the CNN component, where only the affine weights of the batch normalisation layers are updated while the rest of the weights in the CNN remain frozen. We used the pre-trained weights from ImageNet as starting weights. Figure 2 shows a simplified overview of the model. The RNN component was initialised with random weights. WSIs tend to contain a large white background that is not relevant for the model. We therefore start the preprocessing by eliminating the white background using Otsu's method [53] applied to the greyscale version of the WSIs.
For training and inference, we then proceeded by extracting 1024 × 1024 px tiles from the tissue regions. We performed the extraction in real-time using the OpenSlide library [54]. To perform inference on a WSI, we used a sliding window approach with a fixed-size stride of 512 × 512 px (half the tile size). This results in a grid-like output of predictions on all areas that contained cells, which then allowed us to visualise the prediction as a heatmap of probabilities that we can directly superimpose on top of the WSI. Each tile had a probability of being neoplastic; to obtain a single probability that is representative of the WSI, we computed the maximum probability from all the tiles.
During training, we maintained an equal balance of positively and negatively labelled tiles in the training batch. To do so, for the positive tiles, we extracted them randomly from the annotated regions of neoplastic WSIs, such that within the 1024 × 1024 px, at least one annotated cell was visible anywhere inside the tile. For the negative tiles, we extracted them randomly anywhere from the tissue regions of NILM WSIs. We then interleaved the positive and negative tiles to construct an equally balanced batch that was then fed as input to the CNN. In addition, to reduce the number of false positives, given the large size of the WSIs, we performed a hard mining of tiles, whereby at the end of each epoch, we performed full sliding window inference on all the NILM WSIs in order to adjust the random sampling probability such that false positively predicted tiles of NILM were more likely to be sampled.
During training, we performed real-time augmentation of the extracted tiles using variations of brightness, saturation, and contrast. We trained the model using the Adam optimisation algorithm [55], with the binary cross entropy loss, beta 1 = 0.9, beta 2 = 0.999, and a learning rate of 0.001. We applied a learning rate decay of 0.95 every 2 epochs. We used early stopping by tracking the performance of the model on a validation set, and training was stopped automatically when there was no further improvement on the validation loss for 10 epochs. The model with the lowest validation loss was chosen as the final model.

Interobserver Concordance Study
For the interobserver concordance study, a total of 10 WSIs (8 NILM cases and 2 neoplastic cases) of cervical LBC already reported by a clinical laboratory were retrieved from the records. Using the in-house on-line web virtual slide application, a total of 16 cytoscreeners (8 have over 10 years experiences and 8 have less than 10 years experiences) have reviewed the 10 WSIs and reported in subclasses (NILM, ASC-US, ASC-H, LSIL, HSIL, SCC, ADC).

Software and Statistical Analysis
The deep learning models were implemented and trained using the open-source TensorFlow library [56].
The true positive rate (TPR) was computed as and the false positive rate (FPR) was computed as where TP, FP, and TN represent true positive, false positive, and true negative, respectively. The ROC curve was computed by varying the probability threshold from 0.0 to 1.0 and computing both the TPR and FPR at the given threshold.

High AUC Performance of WSI Evaluation of Neoplastic Cervical Liquid-Based Cytology (LBC) Images
The aim of this retrospective study was to train a deep learning model for the classification of neoplastic cervical WSIs. We trained a model that consists of a convolutional and a recurrent neural network using a dataset of 1503 WSIs for training and 150 for validation. We evaluated the model on three test sets with a combined total of 1468 WSIs. Figure 3 shows the resulting ROC curves, and Table 2 lists the resulting ROC AUC and log loss, as well as the accuracy, sensitivity, and specificity computed at a probability threshold of 0.5. Table 3

True Positive Prediction
Our deep learning model satisfactorily predicted neoplastic epithelial cells ( Figure 4C-G) in cervical LBC ( Figure 4A

True Negative Prediction
Our model satisfactorily predicted NILM cases ( Figure 5A,B) in cerevical LBC specimen. The heatmap image shows true negative predictions ( Figure 5B,D,E) of neoplastic epithelial cells. In both zero ( Figure 5C) and very low probability tiles ( Figure 5D,E), there are no neoplastic epithelial cells.

False Positive Prediction
A cytopathologically diagnosed NILM case ( Figure 6A) was false positively predicted for neoplastic epithelial cells ( Figure 6B). The heatmap image ( Figure 6B) shows false positive predictions of neoplastic epithelial cells ( Figure 6C,E) with high probabilities. Cytopathologically, there are parabasal cells with a high nuclear cytoplasmic (N/C) ratio ( Figure 6C,D) and cell clusters of squamous epithelial cells with cervical gland cells with high N/C ratios ( Figure 6E), which could be a major cause of false positive. In (E), there are cell clusters of squamous epithelial cells and cervical gland cells with slightly high N/C ratios and a dense chromatin appearance due to the cellular overlapping.

Interobserver Variability
To evaluate the practical interobserver variability among cytoscreeners, we have asked a total of 16 cytoscreeners (8 are over 10 years experiences and 8 are less than 10 years experiences) to review the same 10 LBC WSIs, which consist of 8 NILM and 2 neoplastic cases already diagnosed by a clinical laboratory. The results of each cytoscreener were summarised in Table 4. The Fleiss' kappa statistics were summarised in Table 5. There was poor to moderate concordance in assessing subclass, with Fleiss' kappas of NILM (range: 0.042-0.755), neoplastic (range: 0.098-0.500), and all cases (range: 0.364-0.716). On the other hand, there was poorly to almost perfect concordance in assessing binary class, with Fleiss' kappas of NILM (range: 0.073-0.815), neoplastic (1.000), and all cases (range: 0.568-0.861). Interestingly, there was a robust higher concordance in both subclass and binary class among cytoscreeners over 10-year experiences. However, overall, there was poor concordance in assessing NILM cases (range: 0.042-0.073).

Discussion
In this pilot study, we trained a deep learning model for the classification of neoplastic cells in WSIs of LBC specimens. The model achieved overall a good performance, with ROC AUCs of 0.96 (0.92-0.99) on the full agreement, 0.89 (0.81-0.96) on the clinical balance reviewed, and 0.92 (0.89-0.94) on the equal balance reviewed test sets.
Looking at the interobserver concordance among cytoscreeners in Table 4, it is obvious that there is considerable interobserver variability, with the poor concordance in NILM cases even for binary classification (NILM vs. neoplastic). In addition, there is the problem of human fatigue due to the continuous observation of a large number of cases. Therefore, when considering future accuracy control, it may be necessary to conduct screening using deep learning model(s) with guaranteed accuracy, such as the results of this study, at least in the binary classification (NILM vs. neoplastic), and to conduct detailed assessments by cytoscreeners and cytopathologists in the subclassification (e.g., NILM, ASC-US, ASC-H, LSIL, HSIL, SCC, and ADC).
From our results in Figure 2, it was obvious that there was interobserver variability among cytoscreeners in different clinical laboratories and hospitals. Clinical balance and equal balance test sets were prepared based on diagnostic (screening) reports from a clinical laboratory. The only difference between clinical balance and clinical balance-reviewed (same as equal balance and equal balance-reviewed) was whether it was additionally reviewed by two more cytoscreeners in different clinical laboratories and hospitals or not. All scores (ROC-AUC, accuracy, sensitivity, and specificity) were increased in clinical balance-reviewed and equal balance-reviewed test sets as compared to clinical balance and equal balance test sets ( Figure 2). Hence, our deep learning model would be helpful for standardising in the screening process.
In routine cervical cancer screening at clinical laboratories and hospitals, it is difficult to introduce a screening programme dependent on cervical smears due to poor human cytoscreener resources. LBC techniques opened new possibilities for a systemic cervical cancer screening. LBC slides are amenable to high throughput automated analysis. Especially for the detection of rare events on LBC slides, WSI and subsequent image analysis is of crucial importance for guaranteeing a standardised high-quality read out [25]. Practical automated cervical cytology screening devices have been under development since the 1950s. The technological development in semi-automated screening devices for cervical cancer screening is very rapid; however, currently, no machines are available to provide a fully automated screening by computer without human intervention. There are two FDA-approved semi-automated slide scanning devices on the market; these systems are the BD FocalPoint GS Imaging System and the HOLOGIC ThinPrep Imaging System. Both are designed to perform computer-assisted analysis of cellular images followed by locationguided screening of limited fields of view. FocalPoint-assisted smear reading has been proposed prior to conventional manual reading; the latter may be unnecessary for cases reported as No Further Review (NFR) and would be required for cases reported as Review (REV) [61]. FocalPoint-assisted practice showed statistically superior sensitivity and specificity when compared to conventional manual smear screening for the detection of HSIL and LSIL [14,62,63]. However, ASC-US sensitivity and specificity were not significantly different between FocalPoint-assisted practice and conventional screening [62]. Overall, in neoplastic slides (ASC-US, LSIL, and HSIL) by FocalPoint-assisted practice, sensitivity was in the range of 81.1-86.1% and specificity was in the range of 84.5-95.1% [62]. The other study showed that FocalPoint-assisted reading was comparable to conventional reading, and the very low observed negative predictive value of an NFR report (0.02%) suggested that these cases might safely return to periodic screening [61]. The ThinPrep Imaging System (TIS) is an automated system that uses location-guided screening to assist cytoscreeners in reviewing a ThinPrep Pap LBC slides [64]. TIS scans the LBC slides and identifies 22 fields of view (FOVs) on each slide based on optical density measurements and other features [64]. It has been reported that TIS was ideally suited to the rapid screening of negative cases; however, the sensitivity and specificity of the TIS (85.19% and 96.67%, respectively) were equivalent to those of manual screening (89.38% and 98.42%, respectively) [65]. In another study, for diagnostic categories of neoplastic slides (ASC-US, LSIL, and HSIL) by TIS practice, sensitivity was in the range of 79.2-82.0% and specificity was in the range of 97.8-99.6% [64].
As shown in Figure 2, our LBC cervical cancer screening deep learning model exhibited around 90% accuracy (in the range of 89-91%), 86% sensitivity (in the range of 84-89%), and 91% specificity (in the range of 90-92%) in full agreement, clinical balance-reviewed, and equal balance-reviewed test sets; those scores were as well or better than the existing assistance systems mentioned above.

Conclusions
In the present study, we have trained a deep learning model for the classification of neoplastic cervical LBC in WSIs. We have evaluated the model on three test sets achieving ROC-AUCs for WSI diagnosis in the range of 0.89-0.96. The main advantage of our deep learning model is that the model can be used to evaluate the cervical LBC at the WSI level. Therefore, our model is able to infer whether the cervical LBC WSI is NILM (non-neoplastic) ( Figure 5) or neoplastic (Figure 4). This makes it possible to use a deep learning model such as ours as a tool to aid in the cervical screening process, which could potentially be used to rank the cases by order of priority. After which the cytoscreeners will need to perform full screening and subclassification (e.g., ASC-US, ASC-H, LSIL, HSIL, SCC, ADC) on neoplastic output cases after the primary screening by our deep learning model, which could reduce their working time as the model would have highlighted the potential suspected neoplastic regions, and they would not have to perform an exhaustive search through the entire WSI. Funding: The authors received no financial supports for the research, authorship, and publication of this study.

Institutional Review Board Statement:
The experimental protocol in this study was approved by the ethical board of the private clinical laboratory. All research activities complied with all relevant ethical regulations and were performed in accordance with relevant guidelines and regulations in the clinical laboratory. Due to the confidentiality agreement with the private clinical laboratory, the name of the clinical laboratory cannot be disclosed.

Informed Consent Statement:
Informed consent to use cytopathological samples (liquid-based cytology glass slides) and cytopathological reports for research purposes had previously been obtained from all patients and the opportunity for refusal to participate in research had been guaranteed by an opt-out manner.

Data Availability Statement:
The datasets used in this study are not publicly available due to specific institutional requirements governing privacy protection; however, they are available from the corresponding author and from the private clinical laboratory in Japan on reasonable request. Restrictions apply based on the data use agreement, which was made according to the Ethical Guidelines for Medical and Health Research Involving Human Subjects as set by the Japanese Ministry of Health, Labour, and Welfare.