An E ﬃ cient Lightweight CNN and Ensemble Machine Learning Classiﬁcation of Prostate Tissue Using Multilevel Feature Analysis

: Prostate carcinoma is caused when cells and glands in the prostate change their shape and size from normal to abnormal. Typically, the pathologist’s goal is to classify the staining slides and di ﬀ erentiate normal from abnormal tissue. In the present study, we used a computational approach to classify images and features of benign and malignant tissues using artiﬁcial intelligence (AI) techniques. Here, we introduce two lightweight convolutional neural network (CNN) architectures and an ensemble machine learning (EML) method for image and feature classiﬁcation, respectively. Moreover, the classiﬁcation using pre-trained models and handcrafted features was carried out for comparative analysis. The binary classiﬁcation was performed to classify between the two grade groups (benign vs. malignant) and quantile-quantile plots were used to show their predicted outcomes. Our proposed models for deep learning (DL) and machine learning (ML) classiﬁcation achieved promising accuracies of 94.0% and 92.0%, respectively, based on non-handcrafted features extracted from CNN layers. Therefore, these models were able to predict nearly perfectly accurately using few trainable parameters or CNN layers, highlighting the importance of DL and ML techniques and suggesting that the computational analysis of microscopic anatomy will be essential to the future practice of pathology.


Introduction
Image classification and analysis has become popular in recent years, especially for medical images. Cancer diagnosis and grading are often performed and evaluated using AI as these processes have become increasingly complex, because of growth in cancer incidence and the numbers of specific treatments. The analysis and classification of prostate cancer (PCa) are among the most challenging and difficult. PCa is the second most commonly diagnosed cancer among men in the USA and Europe, affecting approximately 25% of patients with cancer in the Western world [1]. PCa is a type of cancer that has always been an important challenge for pathologists and medical practitioners, with respect to detection, analysis, diagnosis, and treatment. Recently, researchers have analyzed PCa in young Korean men (<50 years of age), considering the pathological features of radical prostatectomy specimens and biochemical recurrence of PCa [2].

Related Work
A CNN was first used on medical images by Lo et al. [35,36]. Their model (LeNet) succeeded in a real-world application and could recognize hand-written digits [37]. Subsequent CNN-based methods showed the potential for automated image classification and prediction, especially after the introduction of AlexNet, a system that won the ImageNet challenge. In this era, the categorizing and auto-detection of cancer in the histological sections using machine assistance have shown excellent performance in the field of early detection of cancer.
Zheng et al. [38] developed a new CNN-based architecture for histopathological images, using the 3D multiparametric MRI data provided by PROSTATEx challenge. Data augmentation was performed through 3D rotation and slicing, to incorporate the 3D information of the lesion. They achieved the second-highest AUC (0.84) in the PROSTATEx challenge, which shows the great potential of deep learning for cancer imaging.
Han et al. [39] used breast cancer samples from the BreaKHis dataset to perform multi-classification using subordinate classes of breast cancer (ductal carcinoma, fibroadenoma, lobular carcinoma, adenosis, Phyllodes tumor, tubular adenoma, mucinous carcinoma, and papillary carcinoma). The author developed a new deep learning model and has achieved remarkable performance with an average accuracy of 93.2% on a large-scale dataset.
Kumar et al. [12] performed k-means segmentation to separate the background cells from the microscopy biopsy images. They extracted morphological and textural features from for automated detection and classification of cancer. They used different types of machine learning classifiers (random forest, Support vector machine, fuzzy k-nearest neighbor, and k-nearest neighbor) to classify connectivity, epithelial, muscular, and nervous tissues. Finally, the author obtained an average accuracy of 92.19% based on their proposed approach using a k-nearest neighbor classifier.
Abraham et al. [40] used multiparametric magnetic resonance images and presented a novel method for the grading of prostate cancer. They used VGG-16 CNN and an ordinal class classifier with J48 as the base classifier. The author used the PROSTATAx-2 2017 grand challenge dataset for their research work. Their method achieved a positive predictive value of 90.8%.
Yoo et al. [3] proposed an automated CNN-based pipeline for prostate cancer detection using diffusion-weighted magnetic resonance imaging (DWI) for each patient. They used a total of 427 patients as the dataset, out of these, 175 with PCa and 252 patients without PCa. The author used five CNNs based on the ResNet architecture and extracted first order statical features for classification. The analysis was carried out based on a slice-and patient-level. Finally, their proposed pipeline achieved the best result (AUC of 87%) using CNN1.
Turki [41] performed machine learning classification for cancer detection and used a data sample of colon, liver, thyroid cancer. They applied different ML algorithms, such as deep boost, AdaBoost, XgBoost, and support vector machines. The performance of the algorithms was evaluated using the area under the curve (AUC) and accuracy on real clinical data used classification.
Veta et al. [42] proposed different methods for the analysis of breast cancer histopathology images. They discussed different techniques for tissue image analysis and processing like tissue components segmentation, nuclei detection, tubules segmentation, mitotic detection, and computer-aided diagnosis. Before discussing the different image analysis algorithms, the author gave an overview of the tissue preparation, slide staining processes, and digitization of histological slides. In this paper, their approach is to perform clustering or supervised classification to acquire binary or probability maps for the different stains.
Moradi et al. [43] performed prostate cancer detection based on different image analysis techniques. The author used ultrasound, MRI, and histopathology images, and among these, ultrasound images were selected for cancer detection. For the classification of prostate cancer, feature extraction was carried out using the ultrasound echo radio-frequency (RF) signals, B-scan images, and Doppler images.
Alom et al. [44] proposed a deep CNN (DCNN) model for breast cancer classification. The model was developed based on the three powerful CNN architecture by combining the strength of the inception network (Inception-v4), the residual network (ResNet), and the recurrent convolutional neural network (RCNN). Thus, their proposed model was named as inception recurrent residual convolution neural network (IRRCNN). They used two publicly available datasets including BreakHis and Breast Cancer (BC) classification challenge 2015. The test results were compared against the existing state-of-art models for image-based, patch-based, image-level, and patient-level classification.
Wang et al. [45] proposed a novel method for the classification of colorectal cancer histopathological images. The author developed a novel bilinear convolutional neural network (BCNN) model that consists of two CNNs, and the outputs of the CNN layers are multiplied with the outer product at each spatial domain. Color deconvolution was performed to separate the tissue components (hematoxylin and eosin) for BCNN classification. Their proposed model performed better than the traditional CNN by classifying colorectal cancer images into eight different classes.
Bianconi et al. [20] compared the combination effect of six different colour pre-processing methods and 12 colour texture features on the patch-based classification of H&E stained images. They found that classification performance was poor using the generated colour descriptors. However, they achieved promising results using some pre-processing methods such as co-occurrence matrices, Gabor filters, and Local Binary Patterns.
Kather et al. [31] investigated the usefulness of image texture features, pre-trained convolutional networks against variants of local binary patterns for classifying different types of tissue sub-regions, namely stroma, epithelium, necrosis, and lymphocytes. They used seven different datasets of histological images for classifying the handcrafted and non-handcrafted features using standard classifiers (e.g., support vector machines) to obtain overall accuracy between 95% and 99%.

Tissue Staining
For the identification of cancerous cells, the prostate tissue was sectioned with a thickness of 4µm. The process of deparaffinization (i.e., removal of paraffin wax from slides prior to staining) is especially important after tissue sectioning because, otherwise, only poor staining may be achieved. However, in practice, each tissue section was deparaffinized and rehydrated in an appropriate manner and H&E staining was carried out successfully using an automated stainer (Autostainer XL, Leica). Hematoxylin and Eosin are positively and negatively charged, respectively. The nucleic acids in the nucleus are negatively charged components of basophilic cells; hematoxylin reacts with these components. Amino groups in proteins in the cytoplasm are positively charged components of acidophilic cells; eosin reacts with these components [46][47][48]. Figure 1 shows the visualization of the . Note that the two slides (a,b) are highly dissimilar in texture, which is useful for analysis and classification.

Data Collection
The whole-slide H&E stained images of size 33,584 × 70,352 pixels were acquired from the pathology department of the Severance Hospital of Yonsei University. The slide images were further processed to generate multiple sizes (256 × 256, 512 × 512, and 1024 × 1024) of 2D patches by scanning at 40× optical magnification with 0.3NA objective using a digital camera (Olympus C-3000) which is attached to a microscope (Olympus BX-51). The extracted regions of interest (ROIs) were sent to the pathologist for prostate cancer (PCa) grading. Figure 2 shows an example of the cropped patches extracted from a whole-slide image. Regions containing background and adipose tissue were excluded. After the labeled patches were received, 6000 samples were selected, all with size 256 × 256 pixels (24 bit/pixel); the samples were divided equally into two classes: cancerous and non-cancerous. The tissue samples used in our research were extracted from 10 patients. These samples had an RGB color coding scheme (8 bits each for red, green, and blue).  . Note that the two slides (a,b) are highly dissimilar in texture, which is useful for analysis and classification.

Data Collection
The whole-slide H&E stained images of size 33,584 × 70,352 pixels were acquired from the pathology department of the Severance Hospital of Yonsei University. The slide images were further processed to generate multiple sizes (256 × 256, 512 × 512, and 1024 × 1024) of 2D patches by scanning at 40× optical magnification with 0.3NA objective using a digital camera (Olympus C-3000) which is attached to a microscope (Olympus BX-51). The extracted regions of interest (ROIs) were sent to the pathologist for prostate cancer (PCa) grading. Figure 2 shows an example of the cropped patches extracted from a whole-slide image. Regions containing background and adipose tissue were excluded. After the labeled patches were received, 6000 samples were selected, all with size 256 × 256 pixels (24 bit/pixel); the samples were divided equally into two classes: cancerous and non-cancerous. The tissue samples used in our research were extracted from 10 patients. These samples had an RGB color coding scheme (8 bits each for red, green, and blue). . Note that the two slides (a,b) are highly dissimilar in texture, which is useful for analysis and classification.

Data Collection
The whole-slide H&E stained images of size 33,584 × 70,352 pixels were acquired from the pathology department of the Severance Hospital of Yonsei University. The slide images were further processed to generate multiple sizes (256 × 256, 512 × 512, and 1024 × 1024) of 2D patches by scanning at 40× optical magnification with 0.3NA objective using a digital camera (Olympus C-3000) which is attached to a microscope (Olympus BX-51). The extracted regions of interest (ROIs) were sent to the pathologist for prostate cancer (PCa) grading. Figure 2 shows an example of the cropped patches extracted from a whole-slide image. Regions containing background and adipose tissue were excluded. After the labeled patches were received, 6000 samples were selected, all with size 256 × 256 pixels (24 bit/pixel); the samples were divided equally into two classes: cancerous and non-cancerous. The tissue samples used in our research were extracted from 10 patients. These samples had an RGB color coding scheme (8 bits each for red, green, and blue).

Proposed Pipeline
Image and feature classification based on DL and ML methods showed some promising results in categorizing microscopic images of benign or malignant tissues. Our proposed pipeline for this paper is shown in Figure 3. Our analysis of a tissue image dataset was carried out in five phases, which include image pre-processing, analyze CNN models, feature analysis, model classification, and performance evaluation. In this study, we developed two LWCNN models (model 1 and model 2) and used state-of-art pre-trained models to carry out 2D image classification and perform a comparative analysis among the models. Also, EML classification was performed to classify the handcrafted (OCLBP and IOCLBP) and non-handcrafted (CNN-based) colour texture features extracted from tissue images.

Proposed Pipeline
Image and feature classification based on DL and ML methods showed some promising results in categorizing microscopic images of benign or malignant tissues. Our proposed pipeline for this paper is shown in Figure 3. Our analysis of a tissue image dataset was carried out in five phases, which include image pre-processing, analyze CNN models, feature analysis, model classification, and performance evaluation. In this study, we developed two LWCNN models (model 1 and model 2) and used state-of-art pre-trained models to carry out 2D image classification and perform a comparative analysis among the models. Also, EML classification was performed to classify the handcrafted (OCLBP and IOCLBP) and non-handcrafted (CNN-based) colour texture features extracted from tissue images.

Image Preprocessing
In this phase, the preprocessing was carried out, whereby we resized the patches to 224 × 224 pixels for CNN training, and to adjust the contrast level of the image, power law (gamma) transformation [49,50] was applied to the resized images. The concept of gamma was used to encode and decode luminance values in image systems. Figure 4 illustrates the clarity of images before and after the application of this operation.
The dataset splitting was performed for training, validating, and testing the CNN models. The data samples were labeled with 0 (non-cancerous) and 1 (cancerous) for accurate classification and randomly assigned to one of three groups for training, validation, and testing, as shown in Table 1. The dataset used for DL and ML classification holds a total of 6000 samples. Out of these, 3600 were used for training, 1200 for validation, and 1200 for testing. Before the samples were fed to the network for classification, data augmentation was performed on the training set, which enabled analysis of model performance,

Image Preprocessing
In this phase, the preprocessing was carried out, whereby we resized the patches to 224 × 224 pixels for CNN training, and to adjust the contrast level of the image, power law (gamma) transformation [49,50] was applied to the resized images. The concept of gamma was used to encode and decode luminance values in image systems. Figure 4 illustrates the clarity of images before and after the application of this operation.
The dataset splitting was performed for training, validating, and testing the CNN models. The data samples were labeled with 0 (non-cancerous) and 1 (cancerous) for accurate classification and randomly assigned to one of three groups for training, validation, and testing, as shown in Table 1. The dataset used for DL and ML classification holds a total of 6000 samples. Out of these, 3600 were used for training, 1200 for validation, and 1200 for testing. Before the samples were fed to the network for classification, data augmentation was performed on the training set, which enabled analysis of model performance, reduction of overfitting problems, and improvement of generalization [51]. Therefore, to create some changes in the images, some transformations were applied using augmentation techniques, and these included rotation by 90 • , transposition, random_brightening, and random_contrast, random_hue, and random_saturation, shown in Figure 5c,d. Keras and Tensorflow functions were used to execute data augmentation.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 23 reduction of overfitting problems, and improvement of generalization [51]. Therefore, to create some changes in the images, some transformations were applied using augmentation techniques, and these included rotation by 90°, transposition, random_brightening, and random_contrast, random_hue, and random_saturation, shown in Figure 5c,d. Keras and Tensorflow functions were used to execute data augmentation. Because the images in (a,c) have low contrast, γ = 2 was applied to adjust their intensities, obtaining images in (b,d) that appear clear and "fresh." Therefore, the tissue components were more visible after transformation, which was important for CNN classification.   Because the images in (a,c) have low contrast, γ = 2 was applied to adjust their intensities, obtaining images in (b,d) that appear clear and "fresh." Therefore, the tissue components were more visible after transformation, which was important for CNN classification. reduction of overfitting problems, and improvement of generalization [51]. Therefore, to create some changes in the images, some transformations were applied using augmentation techniques, and these included rotation by 90°, transposition, random_brightening, and random_contrast, random_hue, and random_saturation, shown in Figure 5c,d. Keras and Tensorflow functions were used to execute data augmentation. Because the images in (a,c) have low contrast, γ = 2 was applied to adjust their intensities, obtaining images in (b,d) that appear clear and "fresh." Therefore, the tissue components were more visible after transformation, which was important for CNN classification.

Convolution Neural Network
To classify images of PCa, this paper introduces two LWCNN models to perform the classification of the GP and distinguish between two classes. Both model 1 and model 2 included CNN layers, such as those for input, convolution, rectified linear unit (ReLU), max pooling, dropout, flattening, GAP, and classification. Model 1 contained four convolutional blocks, with a depth of 10 layers, which interleaved two-dimensional (2D) convolutional layers (3 × 3 kernel, strides, and padding) with ReLU and batch normalization (BN) layers, followed by three max-pooling (2 × 2) and three dropout layers. To connect the neural network [52,53], a flattening layer and a sequence of three dense layers containing 1024, 1024, and 2 neurons were connected for feature classification and two probabilistic outputs. The sigmoid activation function [54,55] was used as a binary classifier. The numbers of filters in each block were 32, 64, 128, and 256. These filters acted as a sliding window over the entire image.
Model 2 contained three convolutional blocks, with a depth of seven layers, where the 2D convolutional, ReLU, and BN layers were identical to model 1 but were interleaved with two max-pooling (2 × 2) layers and one dropout layer. The numbers of convolutional filters in this model were 92, 192, and 384. A GAP layer was used instead of flattening, the classification section in this model also had three dense layers containing 64, 32, and 2 neurons. Here, a softmax [56,57] classifier was used to reduce binary loss. The input shape was set to 224 × 224 × 3 while building the model. The detailed design and specification of our lightweight CNN (LWCNN) models are shown in Figure 6 and Table 2, respectively. Model 2 was modified from model 1 based on multilevel feature analysis to improve classification accuracy and reduce validation loss, as shown in Figure 7.

Convolution Neural Network
To classify images of PCa, this paper introduces two LWCNN models to perform the classification of the GP and distinguish between two classes. Both model 1 and model 2 included CNN layers, such as those for input, convolution, rectified linear unit (ReLU), max pooling, dropout, flattening, GAP, and classification. Model 1 contained four convolutional blocks, with a depth of 10 layers, which interleaved two-dimensional (2D) convolutional layers (3 × 3 kernel, strides, and padding) with ReLU and batch normalization (BN) layers, followed by three max-pooling (2 × 2) and three dropout layers. To connect the neural network [52,53], a flattening layer and a sequence of three dense layers containing 1024, 1024, and 2 neurons were connected for feature classification and two probabilistic outputs. The sigmoid activation function [54,55] was used as a binary classifier. The numbers of filters in each block were 32, 64, 128, and 256. These filters acted as a sliding window over the entire image.
Model 2 contained three convolutional blocks, with a depth of seven layers, where the 2D convolutional, ReLU, and BN layers were identical to model 1 but were interleaved with two maxpooling (2 × 2) layers and one dropout layer. The numbers of convolutional filters in this model were 92, 192, and 384. A GAP layer was used instead of flattening, the classification section in this model also had three dense layers containing 64, 32, and 2 neurons. Here, a softmax [56,57] classifier was used to reduce binary loss. The input shape was set to 224 × 224 × 3 while building the model. The detailed design and specification of our lightweight CNN (LWCNN) models are shown in Figure 6 and Table 2, respectively. Model 2 was modified from model 1 based on multilevel feature analysis to improve classification accuracy and reduce validation loss, as shown in Figure 7. , dense-1, dense-2, and output) were used to find the required response based on features that were extracted by the convolutional neural network.  , dense-1, dense-2, and output) were used to find the required response based on features that were extracted by the convolutional neural network.

Model-2 specification
Input The multilevel feature maps were extracted after each convolutional block for pattern analysis and to understand the pixel distribution that the CNN detected, based on the number of convolution filters applied for edge detection and feature extraction. The convolution operation was performed by sliding the filter or kernel over the input image. Element-wise matrix multiplication was performed at each location in the image matrix and the output results were summed to generate the feature map. Max pooling was applied to reduce the input shape, prevent system memorization, and extract maximum information from each feature map. The feature maps from the first block held most of the information present in the image; that block acted as an edge detector. However, the feature map appeared more similar to an abstract representation and less similar to the original image, with advancement deeper into the network (see Figure 7). In block-3, the image pattern was somewhat visible, and by block-4, it became unrecognizable. This transformation occurred because deeper features encode high-level concepts, such as 2D information regarding the tissue (e.g., only spatial values of 0 or 1), while the CNN detects edges and shapes from low-level feature maps. Therefore, to improve the performance of the LWCNN, based on the observation that block-4 yielded unrecognizable images, model 2 was developed using three convolutional blocks, and selected as the model that this paper proposes.
To validate the performance of model 2 (LWCNN), we also included pre-trained CNN models (VGG-16, ResNet-50, Inceptio-V3, and DenseNet-121) for histopathology image classification. These models are very powerful and effective for extracting and classifying the deep CNN features. For each pre-trained network, the dense or classification block was configured according to the model specification. Sigmoid activation function was used for all the pre-trained models to perform binary classification.

Feature Engineering
The extraction of texture features based on handcrafted and non-handcrafted was performed for ensemble machine learning (EML) classification. First, non-handcrafted or CNN-based features were extracted from the GAP layer of the proposed LWCNN (model 2). A different number of feature maps were generated from each CNN layer and the GAP mechanism was used to calculate the average value for each feature map. Second, a total of 20 handcrafted colour texture features were extracted using OCLBP and IOCLBP techniques. Out of these, 10 features were extracted using OCLBP, and 10 features using IOCLBP. The hand-designed feature analysis was performed for EML classification and compare with the non-handcrafted features classification results.
After we generate colour texture map, the LBP technique was applied to each colour channel (Red/Green/Blue) of OCLBP and IOCLBP separately. These state-of-art methods are the extensions of local binary patterns (LBP) and effective for colour image analysis. OCLBP and IOCLBP are the intraand inter-channel descriptors with dissimilar local thresholding scheme (i.e., the peripheral pixels of OCLBP are thresholded at the central pixel value, and IOCLBP thresholding is based on the mean value) [30]. For each aforesaid state-of-art methods, the feature vector was obtained using general rotation-invariant operators (i.e., neighbor set of pixels p was placed on a circle of radius R) that can distinguish the spatial pattern and the contrast of local image texture. Therefore, the operators p = 8 and R = 2 were used to extract the colour features from the H&E stained tissue images.

DL and ML Classification
Prior to training and testing the LWCNN, pre-trained, and EML [58] models, we fine-tuned different types of parameters for better prediction and to minimize model loss. To compute the feature maps in each convolutional layer, a non-linear activation function (ReLU) was used, and the equation can be defined as: where A i,j,k is the activation value of the nth feature map at the location (i, j), I i, j is the input patch, and w n and b n are the weight vector and bias term, respectively, of the nth filter. BN was also used after each convolution layer to regularize the model, reducing the need for dropout. BN was used in our model because it is more effective than global data normalization. The latter normalization transforms the entire dataset so that it has a mean of zero and unit variance, while BN computes approximations of the mean and variance after each mini-batch. Therefore, BN enables the use of the ReLU activation function without saturating the model. Typically, BN is performed using the following equation: where x n is the d-dimensional input, µ mb and σ 2 mb are the mean and variance, respectively, of the mini-batch, and c is a constant.
To optimize the weights of the network and analyze the performance of the LWCNN models, we performed a comparative analysis based on four different types of optimizers, namely stochastic gradient descent (SGD), Adadelta, Adam, and RMSprop. The results of comparative analysis are shown in the next section. The classification performance is measured using the cross-entropy loss, or log loss, whose output is a probability value between 0 and 1. To train our network, we used binary cross-entropy. The standard loss function for binary classification is given by: where N is the number of output class, X i and Y i are the input samples and target labels, respectively, and M w is the model with network weight, w. The hyperparameters were tuned while setting a minimum learning rate of 0.001 using the function known as ReduceLROnPlateau, a factor of 0.8 and patience of 10 were set; thus, if no improvement was observed in validation loss for 10 consecutive epochs, the learning rate was reduced by a factor of 0.8. The batch size was set to eight for training the model and regularization was applied by dropping out 25% and 50% of the weights in the convolution and dense blocks of LWCNN, respectively. The probabilistic output in the dense layer was computed using sigmoid and softmax classifiers.
In addition to CNN methods, traditional ML algorithms including logistic regression (LR) [59] and random forest (RF) [60] were used for features classification. In this paper, an ensemble voting method was proposed in which LR and RF classifiers were combined to create an EML model. This ensemble technique was used to classify the handcrafted and non-handcrafted features and compare the classification performance. The LWCNN, pre-trained, and EML models were tested using the unknown or unseen data samples. Typically, for ML classification, cross-validation was used by splitting the training data into k-fold (i.e., k = 5) to determine the model generalizability, and the result was computed by averaging the accuracies from each of the k trials. Prior to ML classification [61][62][63], the feature values for training and testing were normalized using the standard normal distribution function, which can be expressed as: where P i is the ith pixel in an individual tissue image, and µ and σ are the mean and standard deviation of the dataset. The DL and ML models were built with the Python 3 programming language using the Keras and Tensorflow libraries. Approximately 36 h were invested in fine-tuning the hyperparameters to achieve better accuracy. Figure 8 shows the entire process flow diagram for DL and ML classification. The hyperparameters that were used for DL and ML models are shown in Table 3.
The models were trained, validated, and tested on a PC with the following specifications: an Intel corei7 CPU (2.93 GHz), one NVIDIA GeForce RTX 2080 GPU, and 24 GB of RAM.

Experimental Results
This study mainly focuses on image classification based on AI. The proposed LWCNN (model 2) for tissue image classification and EML for feature classification produced reliable results, which met our requirements, at an acceptable speed. To develop DL models, a CNN approach was used as it is proven excellent performance in detecting specific regions for multiclass and binary classification. When splitting the dataset, a ratio of 8:2 was set for training and testing. Moreover, to

Experimental Results
This study mainly focuses on image classification based on AI. The proposed LWCNN (model 2) for tissue image classification and EML for feature classification produced reliable results, which met our requirements, at an acceptable speed. To develop DL models, a CNN approach was used as it is proven excellent performance in detecting specific regions for multiclass and binary classification. When splitting the dataset, a ratio of 8:2 was set for training and testing. Moreover, to validate the model after each epoch, the training set was further divided, such that 75% of the data was allocated for training and 25% was allocated for validation. Five-fold cross-validation was used during EML training. Algorithms used for preprocessing, data analysis, and classification were implemented in the MATLAB R2019a and PyCharm environments.

Performance Analysis
In this study, a binary classification approach was used to classify benign and malignant samples of prostate tissue. Two levels of classification were performed: DL (based on images) and ML (based on features). Table 4 shows the comparative analysis between the optimizers for model 1 and model 2, respectively. The developed LWCNN models were trained a couple of times by changing the optimizers during training. From the above comparison table, we can analyze that the Adadelta performed the best and gave the best accuracies on test data for both the architectures. SGD and Adam performed close to Adadelta for model 2. On the other hand, RMSProp performed close to Adadelta for model 1. However, Adadelta (update version of Adam and Adagrad) is a more robust optimizer that restricts the window of accumulated past gradients to some fixed size w instead of accumulating all past square gradients. The comparison of these optimizers revealed that Aadelta is more stable and more rapid, hence, an overall improvement on SGD, RMSProp, and Adam. The behavior and performance of the optimizers were analyzed using the receiver operating characteristic (ROC) curve. It is a probabilistic curve that represents the diagnostic ability of a binary classifier system, including an indication of its effective threshold value. The area under the ROC curve (AUC) summarizes the extent to which a model can separate the two classes. Figure 9a,b show the ROC curve and corresponding AUC that depicts the effectiveness of different optimizers used for model 1 and model 2, respectively. For model 1, the AUCs were 0.95, 0.94, 0.96, and 0.93, and for model 2, 0.98, 0.97, 0.98, and 0.97 were obtained using Adadelta, RMSProp, SGD, and Adam, respectively.
Further, based on the optimum accuracy in Table 4, we carried out EML classification using the CNN extracted features from model 2, to analyze the efficiency of ML algorithms. Also, handcrafted features classification was performed to compare the performance with the non-handcrafted features classification results. Moreover, the EML model achieved promising results using the CNN-based features. Model 2 outperformed model 1 in overall accuracy, precision, recall, F1-score, and MCC, with values of 94.0%, 94.2%, 92.9%, 93.5%, and 87.0%, respectively. A confusion matrix (Figure 10) was generated based on the LWCNN model that yielded the optimum results, and thus most reliably distinguished malignant from benign tissue. Benign tissue was labeled as "0" and malignant was labeled as "1" to plot the confusion matrix for this binary classification. The four squares in the confusion matrix represent true positive, true negative, false positive, and false negative; their values were calculated using the test dataset based on the expected outcome and number of predictions of each class. Tables 5 and 6 show the overall comparative analysis for the DL and ML classification.
The performance metrics used to evaluate the analysis results are accuracy, precision, recall, F1-score, and Matthews correlation coefficient (MCC). rapid, hence, an overall improvement on SGD, RMSProp, and Adam. The behavior and performance of the optimizers were analyzed using the receiver operating characteristic (ROC) curve. It is a probabilistic curve that represents the diagnostic ability of a binary classifier system, including an indication of its effective threshold value. The area under the ROC curve (AUC) summarizes the extent to which a model can separate the two classes. Figure 9a,b show the ROC curve and corresponding AUC that depicts the effectiveness of different optimizers used for model 1 and model 2, respectively. For model 1, the AUCs were 0.95, 0.94, 0.96, and 0.93, and for model 2, 0.98, 0.97, 0.98, and 0.97 were obtained using Adadelta, RMSProp, SGD, and Adam, respectively.  Further, based on the optimum accuracy in Table 4, we carried out EML classification using the CNN extracted features from model 2, to analyze the efficiency of ML algorithms. Also, handcrafted features classification was performed to compare the performance with the non-handcrafted features classification results. Moreover, the EML model achieved promising results using the CNN-based features. Model 2 outperformed model 1 in overall accuracy, precision, recall, F1-score, and MCC, with values of 94.0%, 94.2%, 92.9%, 93.5%, and 87.0%, respectively. A confusion matrix (Figure 10) was generated based on the LWCNN model that yielded the optimum results, and thus most reliably distinguished malignant from benign tissue. Benign tissue was labeled as "0" and malignant was labeled as "1" to plot the confusion matrix for this binary classification. The four squares in the confusion matrix represent true positive, true negative, false positive, and false negative; their values were calculated using the test dataset based on the expected outcome and number of predictions of each class. Tables 5 and 6 show the overall comparative analysis for the DL and ML classification. The performance metrics used to evaluate the analysis results are accuracy, precision, recall, F1-score, and Matthews correlation coefficient (MCC).

Visualization Results
The CAM technique was used to visualize the results from an activation layer (softmax) of the classification block. CAM is used to deduce which regions of an image are used by a CNN to recognize the precise class or group it contains [22,64]. Typically, it is difficult to visualize the results from hidden layers of a black box CNN model. More complexity is observed in feature maps with increasing depth in the network; thus, each image becomes increasingly abstract, encoding less information than the initial layers and appearing more blurred. Figure 11 shows the CAM results, indicating the method by which our DL network detected important regions; moreover, the network had learned a built-in mechanism to determine which regions merited attention. Therefore, this decision process was extremely useful in the classification network.

Visualization Results
The CAM technique was used to visualize the results from an activation layer (softmax) of the classification block. CAM is used to deduce which regions of an image are used by a CNN to recognize the precise class or group it contains [22,64]. Typically, it is difficult to visualize the results from hidden layers of a black box CNN model. More complexity is observed in feature maps with increasing depth in the network; thus, each image becomes increasingly abstract, encoding less information than the initial layers and appearing more blurred. Figure 11 shows the CAM results, indicating the method by which our DL network detected important regions; moreover, the network had learned a built-in mechanism to determine which regions merited attention. Therefore, this decision process was extremely useful in the classification network.  overlaying (a,b), with spots indicating significant regions that the convolutional neural network used to identify a specific in that image.
Our CNN detected specific regions using the softmax classifier by incorporating spatially averaged information extracted by the GAP layer from the last convolution layer, which had an output shape of 14 × 14 × 384. The detected regions depicted in Figure 11c were generated by the application of a heat map to the CAM image in Figure 11b and overlaying that on the original image from Figure 11a. A heat map is highly effective for tissue image analysis; in this instance, it showed Our CNN detected specific regions using the softmax classifier by incorporating spatially averaged information extracted by the GAP layer from the last convolution layer, which had an output shape of 14 × 14 × 384. The detected regions depicted in Figure 11c were generated by the application of a heat map to the CAM image in Figure 11b and overlaying that on the original image from Figure 11a. A heat map is highly effective for tissue image analysis; in this instance, it showed how the CNN detected each region of the image that is important for cancer classification. Doctors can use this information to better understand the classification (i.e., how the neural network predicted the presence of cancer in an image, based on the relevant regions). The visualization process was carried out using the test dataset, which was fed into the trained network of model 2.
In this study, supervised classification was performed for cancer grading, whereby our dataset was labeled with "0" and "1" to categorize benign and malignant tissue separately and independently. The probability distributions of data were similar in training and test sets, but the test dataset was independent of the training dataset. Therefore, after the model had been trained with several binary labeled cancer images, the unanalyzed dataset was fed to the network for accurate prediction between binary classes. Figure 12 shows examples of the binary classification results from our proposed model 2, with examples of images that were and were not predicted correctly. Notably, some images of benign were similar to malignant tissues and vice versa in terms of their nuclei distribution, intensity variation, and tissue texture. It was challenging for the model to correctly classify these images into the two groups.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 16 of 23 In this study, supervised classification was performed for cancer grading, whereby our dataset was labeled with "0" and "1" to categorize benign and malignant tissue separately and independently. The probability distributions of data were similar in training and test sets, but the test dataset was independent of the training dataset. Therefore, after the model had been trained with several binary labeled cancer images, the unanalyzed dataset was fed to the network for accurate prediction between binary classes. Figure 12 shows examples of the binary classification results from our proposed model 2, with examples of images that were and were not predicted correctly. Notably, some images of benign were similar to malignant tissues and vice versa in terms of their nuclei distribution, intensity variation, and tissue texture. It was challenging for the model to correctly classify these images into the two groups.

Discussion
The main aim of this study was to develop LWCNN for benign and malignant tissue image classification based on multilevel feature map analysis and show the effectiveness of the model. Moreover, we developed an EML voting method for the classification of non-handcrafted (extracted from the GAP layer of model 2) and handcrafted (extracted using OCLBP and IOCLBP). Generally, in DL, the features are extracted automatically from raw data and further processed for classification using a neural network approach. However, for ML algorithms, features are extracted manually using different mathematical formulae; these are also regarded as handcrafted features. A CNN is suitable for complex detection tasks, such as analyses of scattered and finely drawn patterns in data.
Of particular interest, in the malignant and benign classification task, model 2 was more effective than model 1.Indeed, model 1 performed below expectation, such that we modified it to improve performance, resulting in model 2. The modification comprised removal of the fourth convolutional block, flattening layer, and sigmoid activation function, as well as alterations of filter number and kernel size. Moreover, GAP replaced flattening after the third convolutional block, minimizing overfitting by reducing the total number of parameters in the model. The softmax activation function replaced the sigmoid activation function in the third dense layer. These modifications, based on the multilevel feature map analysis, improved the overall accuracy and localization ability of tissue image classification.
Furthermore, in this study, we have also compared our proposed CNN model with the well-

Discussion
The main aim of this study was to develop LWCNN for benign and malignant tissue image classification based on multilevel feature map analysis and show the effectiveness of the model. Moreover, we developed an EML voting method for the classification of non-handcrafted (extracted from the GAP layer of model 2) and handcrafted (extracted using OCLBP and IOCLBP). Generally, in DL, the features are extracted automatically from raw data and further processed for classification using a neural network approach. However, for ML algorithms, features are extracted manually using different mathematical formulae; these are also regarded as handcrafted features. A CNN is suitable for complex detection tasks, such as analyses of scattered and finely drawn patterns in data. Of particular interest, in the malignant and benign classification task, model 2 was more effective than model 1.Indeed, model 1 performed below expectation, such that we modified it to improve performance, resulting in model 2. The modification comprised removal of the fourth convolutional block, flattening layer, and sigmoid activation function, as well as alterations of filter number and kernel size. Moreover, GAP replaced flattening after the third convolutional block, minimizing overfitting by reducing the total number of parameters in the model. The softmax activation function replaced the sigmoid activation function in the third dense layer. These modifications, based on the multilevel feature map analysis, improved the overall accuracy and localization ability of tissue image classification.
Furthermore, in this study, we have also compared our proposed CNN model with the well-known pre-trained models such as VGG-16, ResNet-50, Inception-V3, and DenseNet-121. Among these, DenseNet proved to give the highest accuracy of 95% followed by the Inception V3 with 94.6%. The pre-trained VGG-16 and ResNet-50 achieved 92% and 93%, respectively. Although DenseNet gained the highest accuracy among all the pre-trained models as well as our proposed model 2, it is not quite comparable with the motto of this paper. The ultimate goal of this paper was to develop a light-weighted CNN without a much-complicated structure with minimum possible convolutional layers and achieve better classification performance. Model 2 proved this hypothesis by achieving an overall accuracy of 94%. On the other hand, all the pre-trained models are well trained on a huge dataset (ImageNet) which includes 1000 classes. Therefore, it is evident that the classification of such models will be done accurately without much hassle. Nevertheless, the comparison of computational cost between the proposed LWCNN and other pre-trained models was performed to analyze the memory usage, trainable parameters, and learning (training and testing) time, shown in Table 7. First, according to the comparison Table 7, the number of trainable parameters used in the LWCNN model was reduced by more than 75% as compared to VGG-16, ResNet-50, and Inception-V3, and 2% as compared to DenseNet-121. Second, the memory usage of the proposed model was significantly less when compared to other models. Third, the time taken to train the proposed model was also drastically less. Among the pre-trained models, VGG-16 and ResNet-50 agree with the objective of this work. From Tables 5 and 7, it is evident that our LWCNN (model 2) is competitive and inexpensive, whereas, the state-of-art models were computationally expensive and achieved comparable results. Therefore, from this perspective, model 2 of our proposed work performed better than VGG-16 and ResNet-50 in terms of accuracy, besides employing a simple architecture. Through fine-tuning of the hyperparameters, the CNN layers were determined to be optimal using the validation and test datasets. The modified, model 2 was adequate for the classification of benign and malignant tissue images. Our study examined the capability of the proposed LWCNN model to detect and forecast the histopathology images; a single activation map was extracted from each block (see Figure 13) to visualize the detection results using a heat map. Notably, we used an EML method for non-handcrafted and handcrafted features classification. However, the EML model was sufficiently powerful to classify the computational features extracted using the optimal LWCNN model, which predicted the samples of benign and malignant tissues almost perfectly accurately. Also, tissue samples that were classified and predicted using the softmax classifier are shown in quantile-quantile (Q−Q) plots of the prediction probability confidence for benign and malignant states in Figure 14a,b, respectively. These Q−Q plots allowed for the analysis of predictions. True and predicted probabilistic values were plotted according to true positive and true negative classifications of samples (see Figure 9), respectively. model, which predicted the samples of benign and malignant tissues almost perfectly accurately. Also, tissue samples that were classified and predicted using the softmax classifier are shown in quantile-quantile (Q−Q) plots of the prediction probability confidence for benign and malignant states in Figure 14a,b, respectively. These Q−Q plots allowed for the analysis of predictions. True and predicted probabilistic values were plotted according to true positive and true negative classifications of samples (see Figure 9), respectively.  In Q−Q plots, note that the black bar at the top parallel to the x-axis shows true probabilistic values; red (true positive) and blue (true negative) markers show the prediction confidence of each sample of a specific class. We used a softmax classifier, which normalizes the output of each unit to be between 0 and 1, ensuring that the probabilities always sum to 1. The number of samples used for each class was 600; the numbers correctly classified were 565 and 557 for true positive and true negative, respectively. A predicted probability value > 0.5 and <0.5 signifies an accurate classification and misclassification, respectively.
The combination of image-feature engineering and ML classification has shown remarkable performance in terms of medical image analysis and classification. In contrast, CNN adaptively learns various image features to perform image transformation, focusing on features that are highly predictive for a specific learning objective [65]. For instance, images of benign and malignant tissues could be presented to a network composed of convolutional layers with different numbers of filters that detect computational features and highlight the pixel pattern in each image. Based on these In Q−Q plots, note that the black bar at the top parallel to the x-axis shows true probabilistic values; red (true positive) and blue (true negative) markers show the prediction confidence of each sample of a specific class. We used a softmax classifier, which normalizes the output of each unit to be between 0 and 1, ensuring that the probabilities always sum to 1. The number of samples used for each class was 600; the numbers correctly classified were 565 and 557 for true positive and true negative, respectively. A predicted probability value > 0.5 and <0.5 signifies an accurate classification and misclassification, respectively.
The combination of image-feature engineering and ML classification has shown remarkable performance in terms of medical image analysis and classification. In contrast, CNN adaptively learns various image features to perform image transformation, focusing on features that are highly predictive for a specific learning objective [65]. For instance, images of benign and malignant tissues could be presented to a network composed of convolutional layers with different numbers of filters that detect computational features and highlight the pixel pattern in each image. Based on these patterns, the network could use sigmoid and softmax classifiers to learn the extracted and important features, respectively. In DL, the "pipeline" of CNN's processing (i.e., from inputs to any output prediction) is opaque [66], performed automatically like a passage through a "black box" tunnel, where the user remains fully unaware of the process details. It is difficult to examine a CNN layer-by-layer. Therefore, each layer's visualization results and prediction mechanism are challenging to interpret.
Overall, all models performed well in tissue image classification, achieving comparable results. The EML method also worked well with CNN-extracted features, yielding comparable results. We conclude that, for image classification, models with very deep layers performed well by more accurately classifying the data samples. We aimed to build an LWCNN model with few feature-map layers and hyperparameters for prediction of cancer grading based on binary classification (i.e., benign vs. malignant). Our proposed methods have proven that lightweight models can achieve good results if the parameters are tuned appropriately. Furthermore, model 2 effectively recognized the histologic differences in tissue images and predicted their statuses with nearly perfect accuracy. The application of DL to histopathology is relatively new. However, it performs well and delivers accurate results. DL methods provide outstanding performance through black box layers; the outputs of each of these layers can be visualized using a heat map. In this study, our model provided insights into the histologic patterns present in each tissue image and can thus assist pathologists as a practical tool for analyzing tissue regions relevant to the worst prognosis. Heat map analyses suggested that the LWCNN can learn visual patterns of histopathological images containing different features relating to nuclear morphology, cell density, gland formation, and variations in the intensity of stroma and cytoplasm. Performance significantly improved when the first model was modified based on the feature map analysis.

Conclusions
In this study, 2D image classification was performed using PCa samples by leveraging non-handcrafted and handcrafted texture features to distinguish a malignant state of tissue from a benign state. We have presented LWCNN-and EML-based image and feature classification using feature map analysis. The DL models were designed with only a few CNN layers and trained with a small number of parameters. The computed feature maps of each layer were fed into these fully CNNs through the flattening and GAP layers, enabling binary classification using sigmoid and softmax classifiers. GAP and softmax were used for model 2, the optimal network in this paper. The GAP layer was used, instead of flattening, to minimize overfitting by reducing the total number of parameters in the model. This layer computes the mean value for each feature map, whereas flattening combined all feature maps extracted from the final convolution or pooling layers by changing the shape of the data from a 2D matrix of features into a one-dimensional array for passage to the fully CNN classifier. A comparative analysis was performed between the DL and EML classification results. Moreover, the computational cost was also compared among the models. The optimum LWCNN (i.e., model 2) and EML models (a combination of LR and RF classifiers) achieved nearly perfectly accurate results with significantly fewer trainable parameters. The proposed LWCNN model developed in the study achieved an overall accuracy of 94%, average precision of 94.2%, an average recall of 92.9%, an average f1-score of 93.5%, and MCC of 87%. On the other hand, using CNN-based features, the EML model achieved an overall accuracy of 92%, an average precision of 92.7%, an average recall of 91%, an average f1-score of 91.8%, and MCC of 83.5%.
To conclude, the analysis presented in this study is very encouraging. However, a model built for medical images may not work well for other types of images. There is a need to fine-tune the hyperparameters to control model overfitting and loss, thereby improving accuracy. The 2D LWCNN (model 2) developed in this study performed well, and therefore, the predicted true positive and true negative samples for benign and malignant, respectively, were plotted using Q-Q plots. The CAM technique was used to visualize the results of the block box CNN model. In the future, we will consider other methods and develop a more complex DL model and compare it with our optimal LWCNN model and other transfer learning models. Further, we will extend the research to multi-class classification (beyond binary) to simultaneously classify benign tissues, as well as grades 3-5.