Pathomics and Deep Learning Classiﬁcation of a Heterogeneous Fluorescence Histology Image Dataset

: Automated pathology image classiﬁcation through modern machine learning (ML) techniques in quantitative microscopy is an emerging AI application area aiming to alleviate the increased workload of pathologists and improve diagnostic accuracy and consistency. However, there are very few efforts focusing on ﬂuorescence histology image data, which is a challenging task, not least due to the variable imaging acquisition parameters in pooled data, which can diminish the performance of ML-based decision support tools. To this end, this study introduces a harmonization preprocessing protocol for image classiﬁcation within a heterogeneous ﬂuorescence dataset in terms of image acquisition parameters and presents two state-of-the-art feature-based approaches for differentiating three classes of nuclei labelled by an expert based on (a) pathomics analysis scoring an accuracy (ACC) up to 0.957 ± 0.105, and, (b) transfer learning model exhibiting ACC up-to 0.951 ± 0.05. The proposed analysis pipelines offer good differentiation performance in the examined ﬂuorescence histology image dataset despite the heterogeneity due to the lack of a standardized image acquisition protocol.


Introduction
With the rapid development of graphics processor units (GPU), Artificial Intelligence (AI) applications are rapidly being introduced in the field of digital and quantitative pathology. In particular, computational neural networks (CNN) though deep learning and pathomics have radically advanced the research opportunities in this field leading to many novel diagnostic applications. Examples of AI in this field include tissue classification methods, nuclei segmentation as well as disease progression and therapy response prediction.
The majority of published works using machine learning (ML) or deep learning (DL) techniques for classification or segmentation are mainly focused on H&E histopathology images across different types of tissue and disease [1][2][3]. Some of them use patches of samples [4][5][6] while more recent publications are dealing with whole slide images [7][8][9]. Notably, there are few scientific papers employing ML techniques on fluorescence data, possibly due to the fact that the number of annotated fluorescence image datasets publicly available is limited. In addition, they do not cover a broad range of tissues and preparations while at the same time there is a significant variability in imaging conditions leading to large heterogeneity in data across centers. A possible explanation is that the main application of fluorescence in the field of surgical pathology is interphase FISH (Florescence in situ hybridization), a high-cost, time-consuming technique. This technique is not used for all tumors but for the diagnosis of a limited number of neoplasms (mainly sarcomas, lymphomas and some solid tumors) by means of the detection of specific to each neoplasm recurrent chromosomal aberrations (deletions, gains, translocations amplifications, and polysomy), as well as to identify chromosomal alterations with established therapeutic or even predictive implications. As a result, available datasets of fluorescence images of normal tissues or tumors with not established diagnostic, therapeutic or predictive chromosomal alterations are constructed mainly to serve training or research purposes and the application of ML in fluorescence microscopy data is still sporadic and limited.
Based on the aforementioned considerations, this study constitutes an in-depth analysis of image classification by using a recent publicly available fluorescence dataset which exhibits a very high degree of heterogeneity in terms of imaging acquisition parameters. The main aim of this study is to address a challenging nucleus classification problem by using advanced AI methods and report the performance and robustness of the proposed pathomics and deep learning methodologies for fluorescence histology image classification.

Related Works
Since there are no prior published studies regarding the examined dataset for direct comparison, relevant works are briefly summarized. Regarding traditional ML approaches, our bibliographic search resulted in only a few relevant publications indicating that this AI application field is still understudied. Particularly, in [10] various machine learning techniques were evaluated to accurately detect myelin in multi-channel microscopy images of a mouse stem cell. Another study presents the application of machine learning (classification pipeline) for the real time visualization of tumor margins in excised breast specimens using fluorescence lifetime imaging [11]. Furthermore, in [12] the authors have developed a machine-learning classification method for the annotation of the progression through morphologically distinct biological states in fluorescence time-lapse imaging. Additionally, traditional texture and statistical features were extracted on both pathology and radiology images to investigate the underlying associations between cellular density and tumor heterogeneity [13]. Additionally, in [14] the authors have developed a deep learning framework that virtually generates hematoxylin and eosin (H&E) images from unstained tissue 4 ,6-Diamidino-2-phenylindole dihydrochloride (DAPI) images.
Regarding deep−learning-based analyses, several fluorescence imaging applications have been reported, including super resolution on microscopy images [15][16][17][18], conversion of standard hematoxylin and eosin stained histology images to UV light fluorescence images [19] and particle detection [20] on sub-cellular sized molecules and virus structures. Additionally, deep−learning in pathology images has been successfully applied in cancer research [21,22], leading to state-of-the-art tissue sample characterization. Jang et al. [21] presented a deep learning-based normal versus tumor differentiation model that was trained in a specific type of cancer and evaluated on different cancer tissues such as liver, bladder, colon, and lung. Valieris et al. [22] proposed a patch-based methodology of whole −slide images with probability of DNA repair deficiency being assigned by a convolutional neural network and a recurrent neural network for an aggregated prediction on a slide basis. This analysis achieved an AUC of 0.8 for breast and 0.81 for gastric cancer.

Dataset Description and Labeling
The dataset used in this work is an annotated dataset that includes tightly aggregated nuclei of multiple tissues suitable for the training of machine learning-based nuclear segmentation algorithms. The dataset is publicly available and deals with sample preparation methods generally used in quantitative immunofluorescence microscopy. The dataset includes N = 79 fluorescence images of immuno and 4 ,6-Diamidino-2-phenylindole dihy-drochloride (DAPI) stained samples containing a total of 7813 nuclei. More specifically, 41 images were derived from a human keratinocyte cell line (normal tissue), 10 images from one Schwann cell stroma-rich tissue cryosection (from a ganglioneuroblastoma patient), 19 images from seven neuroblastoma patients, 1 image from a Wilms patient and 8 images from two neuroblastoma patients. From the data description, it is noteworthy that there is extensive heterogeneity in the dataset in terms of magnification, vendor, signal-to-noise ratio, image size, and diagnosis. More information about the dataset can be found in the work of Kromp F. et al. in [23].
The dataset contained 41 images of a normal cell's nuclei from a human keratinocyte cell line and 38 images of pathological nuclei from three different malignant pediatric tu-mors, neuroblastoma-Schwannian stroma−poor, ganglioneuroblastoma-Schwannian stroma-rich (and specifically from the Schwannian stroma-rich component of the tumor), and Willm's tumor. Neuroblastoma and ganglioneuroblastoma belong to the quite heterogeneous in terms of biologic, genetic and morphologic features group of peripheral neuroblastic tumors which evolve from immature sympathetic neuroblasts during development and constitute one of the commonest childhood extra-cranial solid tumors. Microscopically, the Schwannian stroma-poor tumors are composed of neuroblastic cells forming groups or nests separated by delicate, often incomplete stromal septa (neuropil) without or with very limited Schwannian proliferation, while ganglioneuromas are characterized by two distinctive components: (i) a mature Schwannian stromal component with individually scattered mature and/or maturing ganglion cells and (ii) a neuroblastic component. [24]. Wilms' tumor (nephroblastoma) is a malignant embryonal neoplasm which affects 1: 8000 children, mainly aged <10 years, and originates from nephrogenic blastemal cells and mimics the developing kidney, showing divergent patterns of differentiation [25].
Thus, in terms of classification, the 41 images of the nuclei from the human keratinocyte cell line were labelled from an expert pathologist as "normal", the 10 images from the Schwann cell stroma-rich component of ganglioneuroblastomas as "benign" as they consisted exclusively of nuclei of the mature and maturing ganglion cells scattered in between the mature Schwann cell stroma, and the remaining 28 belonging to neuroblastoma and Willm's tumor categories, as "malignant".

Data Pre-Processing
Since the dataset was very heterogeneous in terms of magnification as is illustrated in Figure 1 (left), an automated method to normalize the sizes of the nuclei across the dataset was developed. The main rationale for this preprocessing step is that pathomics features mainly rely on texture, which is well known to be scale-dependent [26]. In more detail, the average nucleus area (A) was computed for each image and an algorithm adjusted the size of the images in order to achieve similar nuclei sizes in all images. This harmonization step was necessary in order to produce comparable and reliable shape and texture features from each nucleus. The first step of this process was to find the minimum value of the calculated nucleus average area across all images (M). Next, the images were resized with step 0.05% until the nucleus area "A" matched the mean value "M". To ensure that all images had the same size prior to feature extraction, the final step was to pad all the processed images with zeros. The workflow is shown in Figure 1. Lastly, the same procedure was repeated for the annotated images (masks), which were also provided in the dataset. In order to compute the area of each object (nucleus), the label function of the Mahotas library was used [27].

Feature Extraction
Feature extraction was based on the annotations provided by the dataset with a fixed bin size using the default values which has been reported to preserve a higher number of reproducible features in radiomics studies [28][29][30]. Furthermore, we used all the available features classes from the pyradiomics library [31] including statistical features such as first order statistics and higher order statistical texture features such as Grey-Level Run Length Matrix (GLRLM), shape-based 2D features, texture features such as Grey-Level Co-Occurrence Matrix (GLCM), Grey Level Size Zone Matrix (GLSZM) and Grey Level Difference Matrix (GLDM). Additionally, local binary patterns 2D (LBP) and image transformation techniques such as Logarithmic, Exponential, Gradient, and wavelet transforms were used leading to 1032 features. Figure 1. Workflow for image pre−processing. "A" denotes the mean nucleus size/area of the current image, "M" the minimum mean nucleus area of all images and " "denotes a small number of pixels.

Feature Selection
To identify a meaningful group of features with minimum redundancies and relevant information characterizing the three labelled nuclei types, feature selection was performed on the training set with the pymrmr library [32] based on the mutual information differences (MID) method. Thus, the identified feature subset from the training set were transferred to the unseen testing set. For our experiments, we used a step size equal to 1 and computed the corresponding performances selecting from 1 to 50 important features. Ashas been experimentally proven in the aforementioned feature selection methodology [32], the computational complexity exponentially increases and after a certain number of selected features, the error rate reaches a plateau. Therefore, several number of features had to be tested in our analyses in order to find the optimal number of selected features based on error minimization.

Deep-Analysis Specific Image Preprocessing
Deep-learning analysis requires a uniform pixel array dimensionality in vertical and horizontal axes. After the aforementioned data preprocessing by rescaling with respect to the nuclei size as described in Section 2.2, different image sizes were produced. Thus, additional preprocessing steps involving image cropping and padding were applied to ensure the same image size across every sample. Every pretrained model input was set to 250 by 250 pixels. Consequently, original images with higher pixel count were cropped and padded into sub-images to match the aforementioned input. This augmented the examined dataset from 79 to 105 images. The image identifier and nuclei characterization label of the additional 26 images were preserved to avoid compromising the cross-validation process. Therefore, the sample stratification was based on the unique image identifiers.

Transfer Learning Analysis
A transfer learning approach with models pretrained on ImageNet dataset [33] was followed as an "off-the-shelf" feature extraction module. Thus, training of deep models on the examined dataset was avoided since the limited size of the dataset was inappropriate for a de novo network development. In particular, seven families of model architectures with their variations were tested, namely Xception [34], Inception [35], ResNet [36], VGG [37], MobileNet [38], DenseNet [39], NasNet [40]. Their architectural differences in terms of number of layers and learned parameters, type of convolutional kernels and uniqueness of layer organization produced a diverse set of deep imaging descriptors.
The pretrained models were downloaded from the online repository of Keras [41]. The neural and classification layers were omitted because they were trained to differentiate among 1000 classes of natural images. The remaining weights of the convolutional layers were transferred to a new fully convolutional model for feature extraction on fluorescence microscopy images.
Additionally, three different approaches were implemented during feature extraction including raw features from the last convolutional layer of each model, features with global average and global maximum pooling at a kernel level. Following extraction an unsupervised variance-based feature selection process was applied for reducing the dimensionality of the deep vectors. Five thresholds were examined from 0.0 to 0.5 variance at a feature level. The resulted deep descriptors were standardized by value rescaling on a feature-basis prior to classification. Finally, traditional machine learning algorithms (SVM RBF and Logistic Regression) were trained on these deep descriptors to distinguish among normal, benign, and malignant nuclei. A detailed depiction of the overall methodology for the proposed deep analysis is illustrated in Figure 2. The source code of the aforementioned analysis can be found at https://github.com/trivizakis/deepcell (access date: 9 April 2021).

Figure 2.
Pretrained models from the Keras repository were leveraged for the proposed deep learning analysis, specifically in feature extraction. The unsupervised threshold-based feature selection process was followed by a classifier, either SVM RBF or logistic regression.

Ternary Classification
In order to differentiate normal, benign and malignant nuclei images, two classifiers from the scikit-learn library [42] were used; the logistic regression implemented with the one-versus-rest (OVR) scheme and the support vector machine (SVM) with the radial basis function kernel (RBF) for both pathomics and deep descriptors. Support vector machines (SVM) have been used extensively in medical image classification [43,44] for differentiating tissue by utilizing deep features. In the context of nuclei type differentiation both classifiers were trained in a 10-fold cross-validation scheme on the extracted imaging and deep descriptors. The data stratification was applied on an image identifier basis with respect to the class representation across folds to avoid sample selection bias and overfitted models.

Model Performance Evaluation Metrics
In order to evaluate the performance of both pathomics and deep learning analyses the mean AUC and ACC with their standard deviations were calculated on the unseen testing sets. In particular, the feature selection for pathomics was based on optimizing the classification accuracy.

Results
The examined fluorescence dataset has a class distribution of 51.9% for normal, 12.7% for benign and 35.4% for malignant samples. With varying magnification scales, the original image dimensions ranged from 550 by 430 to 1360 by 1024 pixels. A harmonization process prior to analysis, as defined in Section 2.2 and depicted in Figure 1, was motivated by the need for the nuclei's shape and texture features to be comparable. Additional cropping and padding were performed to the harmonized images for deep feature extraction to achieve a uniform image shape of 250 by 250 pixels for each of the examined fluorescence image, as shown in Section 2.4.1.

Pathomics
After the extraction of the 1032 textural and statistical features, a feature selection process was performed with the Minimum Redundancy Maximum Relevance (mRMR) algorithm that identifies the most relevant patterns in the training set. A step size equal to one was used by mRMR to compute the corresponding performances using one up to 50 selected features. However, for the sake of simplicity only indicative performance results for a subset of the used values are reported in Table 1. In more detail, for the case of the logistic regression classifier, the results varied from 0.956 to 0.996 for AUC and 0.8 to 0.957 for ACC. The SVM RBF classifier resulted in AUC values from 0.954 to 0.986 and ACC from 0.786 to 0.929.

Transfer Learning
The experiments were performed on computational infrastructure featuring a 10core Xeon processor with 32 gigabytes of RAM and an Nvidia GTX 1070 graphics card with 8 gigabytes of VRAM. The extraction of deep features from a single image requires approximately 14 ms to 426 ms depending on the architecture. Seven deep architecture families with a total of eighteen model variations were examined. The models were trained on ImageNet dataset, the neural network layers were rejected and three methods for extracting the deep features were applied, as described in Section 2.4.2. The "off-the-shelf" feature extractor transfer learning technique includes: (a) preserving convolutional layer weights from the Keras pretrained model, (b) introducing a new input of images of size 250 by 250 pixels, and (c) removing fully connected and Softmax layers (Figure 2). To prevent samples from the same image being used in both the training and testing sets, a stratified 10-fold cross-validation technique was used on a unique image identifier basis.
Additionally, five variance thresholds were integrated ranging from 0.0 to 0.5 significantly reducing the dimensionality of deep descriptors. Finally, traditional machine learning classifiers were trained with a new per deep descriptor labeling for distinguishing among normal, benign, and malignant nuclei, and they were capable of reaching a testing ACC of up to 0.945 ± 0.06. In terms of ACC performance, the Xception (0.923-0.944), Inception (0.916-0.945), and DenseNet (0.916-0.951) were the top performing deep descriptors across all three feature types, as can be observed in Table 2. It is worth noting that models based on the VGG and ResNet families consistently gave an error higher or equivalent (12.3-35.4%) than the benign class distribution despite the higher score of separability (AUC up to 0.896), indicating that these models have likely been biased toward the minority class.

Discussion
This study was mainly focused on the classification of a publically available fluorescence dataset. The dataset contained 41 images of normal nuclei of human cells and 38 images of pathological nuclei from three different types of rare pediatric embryonal tumors. Two of them, neuroblastoma and ganglioneuroblastoma, are neuroblastic tumors of different grades of differentiation and malignancy which belong to the group of tumors arising from the sympathoadrenal lineage of the neural crest during development, while the third one, Wilms' tumor (nephroblastoma) is a malignant embryonal neoplasm derived from nephrogenic blastemal cells. Two different AI pipelines based on pathomics and deep learning were implemented for the automated classification between normal benign and malignant types of nuclei as labelled by the expert. The classification was a challenging task for the AI pipelines since the dataset exhibited significant heterogeneity in terms of vendor, image size, magnification, and signal-to-noise ratio.
The first step of our analysis was to address the heterogeneity in nuclei sizes emanating from different magnifications and image sizes since size and shape related features from the pyradiomics library could potentially introduce exaggerated values not corresponding to actual differences and lead to unreliable results. To overcome this limitation, an adaptive pre-processing technique in Section 2.2 was proposed to ensure that in all images, nuclei sizes fall within a similar size range. This harmonization step was used for both presented ML analyses in order to ensure uniform nuclei image dimensions.
Regarding the classification performance with pathomics feature extraction, we experimentally showed by repeating the classification with a different number of selected features using the pymrmr algorithm, that for 20 selected features the minimum error was achieved, as presented in Table 1 for both classifiers. Additionally, it is noteworthy that when using more than 20 selected features, the performance accuracy drops (Table 1). Despite the heterogeneous nature of the dataset, classification through pathomics analysis exhibited the highest performance with an AUC of 0.986 and an ACC of 0.957 regarding the logistic regression classifier. In a similar way, the SVM RBF classifier performed almost equally to the logistic regression, with slight differences presenting an AUC of 0.965 and an ACC of 0.929.
An additional harmonization pre-processing stage involving image cropping and padding in the DL approach was necessary prior to performing the feature extraction from the pretrained models (Section 2.4.1) to ensure consistent input image size. Due to the small size of the examined dataset, only transfer learning techniques were considered.
The DenseNet consistently achieved the highest performance (up to ACC 0.951 ± 0.05 and AUC 0.962 ± 0.04) regardless of the employed pooling technique, as shown in Table 2. The state-of-the-art performance of the proposed methodology demonstrates the feasibility of DL analysis in fluorescence histology image analysis and modelling despite the limited size of the available data. This encouraging result indicates that AI can be used with advantage to address clinical unmet needs in fluorescence pathology image analysis. To this end, creating larger, labelled and diverse datasets in terms of vendor and image settings is of utmost importance for developing more generalizable and trustworthy AI models in this field.
The pathomics-based models with 3, 6, 10 (LR and SVM RBF) and 40, 50 (SVM RBF) features, as well as the deep descriptors from VGG, ResNet (SVM RBF with all the three types of deep features) in Table 2, may have made biased predictions because the prediction error (12.3-20%, Tables 1 and 2) is higher or equal to the minority class distribution (12.7%). Regardless of the fact that the classifier seems to be capable of effectively separating samples from the three classes suggested by the high AUC value, the lower accuracy score (ACC) in this case indicates a biased classifier.
Leveraging AI for characterizing vast amounts of pathology image data can spare clinical experts from tedious and time-consuming tasks, thus alleviating their heavy workload. At the same time, the collaboration of humans and AI has the potential to augment the overall efficiency of the decision-making process based on pathology image analysis.
We are aware that our research has some limitations. The first limitation arises from the relatively small size of the dataset used (N = 79 images). That said, the analysis pipeline was carefully selected considering the size of the dataset size, and this was the main reason that more traditional techniques were used. The proposed pipeline should be further evaluated in larger and even more diverse datasets to promote the generalizability of the results. Furthermore, alternative feature selection techniques can be tested in the context of a more extended study. In addition, we are aware that the image pre-processing method involving down sampling could lead to loss of image information, but this step was necessary for the DL models. Lastly, different tissue preparation processes for fluorescence imaging as well as different imaging settings lead to different noise distributions and increased data heterogeneity, posing additional challenges for AI classification algorithms

Conclusions
The proposed classification with pathomics and DL methods demonstrated good performance (ACC up to: 0.957 ± 0.105 for pathomics and 0.951 ± 0.05 for DL) on differentiating between normal, benign and malignant nuclei types. These results indicate that the proposed classification scheme is a promising framework for aiding pathology fluorescence image analysis and interpretation. To accelerate the clinical translation of such tools a closer collaboration between AI researchers and clinicians is required. At the same time, the development of a larger fluorescence histology image database is a sine qua non condition for optimizing such DL models and increasing robustness and generalizability.

Conflicts of Interest:
The authors declare no conflict of interest.