Training of Deep Convolutional Neural Networks to Identify Critical Liver Alterations in Histopathology Image Samples

: Nonalcoholic fatty liver disease (NAFLD) is responsible for a wide range of pathological disorders. It is characterized by the prevalence of steatosis, which results in excessive accumulation of triglyceride in the liver tissue. At high rates, it can lead to a partial or total occlusion of the organ. In contrast, nonalcoholic steatohepatitis (NASH) is a progressive form of NAFLD, with the inclusion of hepatocellular injury and inﬂammation histological diseases. Since there is no approved pharmacotherapeutic solution for both conditions, physicians and engineers are constantly in search for fast and accurate diagnostic methods. The proposed work introduces a fully automated classiﬁcation approach, taking into consideration the high discrimination capability of four histological tissue alterations. The proposed work utilizes a deep supervised learning method, with a convolutional neural network (CNN) architecture achieving a classiﬁcation accuracy of 95%. The classiﬁcation capability of the new CNN model is compared with a pre-trained AlexNet model, a visual geometry group (VGG)-16 deep architecture and a conventional multilayer perceptron (MLP) artiﬁcial neural network. The results show that the constructed model can achieve better classiﬁcation accuracy than VGG-16 (94%) and MLP (90.3%), while AlexNet emerges as the most e ﬃ cient classiﬁer (97%).


Introduction
Nonalcoholic fatty liver disease (NAFLD) is estimated to be the most common chronic liver disease, with one-quarter of the adult population suffering from it [1]. At the same time, nonalcoholic steatohepatitis (NASH) refers to an aggressive form of NAFLD, which is usually the leading cause of end-stage liver disease or liver transplantation, as it can progress to cirrhosis and hepatocellular cancer (HCC). The diagnosed prevalence of NASH is estimated to reach 18 million subjects by 2027 worldwide, especially in the US, Japan and the EU. Clinical trials have not yet established an effective form of pharmacotherapy for these two conditions. As disease rates tend to increase, even if medication becomes available, it will be still difficult to identify the target population for this treatment. Consequently, the interest of hepatologists in recent years has been in the definitive diagnosis of NAFLD, with histology being the gold standard in modern clinical trials. In this case, the microscopy method on biopsy needle samples makes it possible for all anatomical liver tissue structures, including those of NAFLD and NASH to be examined. In the context of NAFLD, steatosis is predominantly macrovesicular with single and large lipid intracytoplasmic vacuoles pushing aside the hepatocellular nuclei [2]. Occasionally, microvesicular steatosis with multiple small vacuoles within the cytoplasm can be observed, as well as large areas of macrovesicular steatosis agglomeration. In contrast, ballooned hepatocytes present enlarged round cells surrounded by a clear and vacuolar cytoplasm.
Even though liver biopsy is considered the gold standard for evaluating NAFLD and NASH activity, it is an invasive patient procedure [3]. In recent decades, many studies have relied on semi-quantitative predictions for chronic and end-stage liver diseases, which lack diagnostic accuracy due to diagnostic obstacles such as "inter-observer" and "intra-observer" variability. Each case involved subjective microscopic interpretations that came from specialized hepatologists [4]. According to Figure 1, the visual counting of these tissue alterations suggests a difficult and time-consuming process. To overcome this obstacle, modern studies have focused on the development of automated examinations using digital image processing techniques, which can effectively diagnose NAFLD and NASH [5].
quantitative predictions for chronic and end-stage liver diseases, which lack diagnostic accuracy due to diagnostic obstacles such as "inter-observer" and "intra-observer" variability. Each case involved subjective microscopic interpretations that came from specialized hepatologists [4]. According to Figure 1, the visual counting of these tissue alterations suggests a difficult and time-consuming process. To overcome this obstacle, modern studies have focused on the development of automated examinations using digital image processing techniques, which can effectively diagnose NAFLD and NASH [5].
A significant number of research efforts focus on the quantification of liver steatosis. These approaches utilize a combination of image processing techniques (including regions of interest segmentation) with supervised machine learning techniques, using manually annotated features. Τhe risk of hepatic obstruction, which refers to the blockage of the bile ducts, sinusoids, portal veins, etc., has led researchers to develop fat detection systems [6][7][8][9], combined with trained classifiers for the separation of fat tissue from other histological structures [10,11]. Thanks to the effectiveness of these diagnostic systems, histopathology has focused on more complex identification problems, including hepatocellular ballooning and tissue inflammation. These are two chronic diseases for which no automated diagnostic solutions existed until recently [12]. The field of histopathology needed a new generation of algorithms with more independent approaches to the segmentation and classification problems.
In recent years, deep learning methods have introduced innovative and effective solutions to many image analysis tasks. As a result, deep neural networks have expanded to the field of medical imaging, with the purpose to automatically capture the anatomy and physiology of diseases and to quantify their prevalence. Deep learning architectures have been applied to the prognosis of hepatic steatosis, and the monitoring of complex chronic conditions, including regions of collagen fiber [13]. Representation of tissue alterations indicating NAFLD disease using a manual counting process. Ballooned hepatocytes are marked with a green contour line, while the areas of fat accumulation with a red one. This process is time-consuming and highly subjective among physicians, demonstrating the need for a fully automated recognition tool. Figure 1. Representation of tissue alterations indicating NAFLD disease using a manual counting process. Ballooned hepatocytes are marked with a green contour line, while the areas of fat accumulation with a red one. This process is time-consuming and highly subjective among physicians, demonstrating the need for a fully automated recognition tool. A significant number of research efforts focus on the quantification of liver steatosis. These approaches utilize a combination of image processing techniques (including regions of interest segmentation) with supervised machine learning techniques, using manually annotated features. The risk of hepatic obstruction, which refers to the blockage of the bile ducts, sinusoids, portal veins, etc., has led researchers to develop fat detection systems [6][7][8][9], combined with trained classifiers for the separation of fat tissue from other histological structures [10,11]. Thanks to the effectiveness of these diagnostic systems, histopathology has focused on more complex identification problems, including hepatocellular ballooning and tissue inflammation. These are two chronic diseases for which no automated diagnostic solutions existed until recently [12]. The field of histopathology needed a new generation of algorithms with more independent approaches to the segmentation and classification problems.
In recent years, deep learning methods have introduced innovative and effective solutions to many image analysis tasks. As a result, deep neural networks have expanded to the field of medical imaging, with the purpose to automatically capture the anatomy and physiology of diseases and to quantify their prevalence. Deep learning architectures have been applied to the prognosis of hepatic steatosis, and the monitoring of complex chronic conditions, including regions of collagen fiber [13]. A detailed description regarding the contribution of the referred research works is provided in the results and discussion sections.
This work presents a methodology for the classification of multiple hepatic structures from biopsy images, based on convolutional neural networks (CNNs). Particularly in medical image analysis, CNN architectures can overcome the problems caused by the hand-crafted features used in traditional techniques, due to their fully automated feature extraction as seen in Figure 2. The purpose of the proposed deep network is to solve a 4-class classification problem, with (a) ballooned hepatocytes and (b) fat droplets forming the disease classes, while (c) sinusoids and (d) veins forming the healthy classes. In the future, the proposed method could be integrated into a complete prognostic tool for (a) differentiating the healthy from the diseased tissue structures and (b) measuring the severity of the two diseases in clinical trials.
analysis, CNN architectures can overcome the problems caused by the hand-crafted features used in traditional techniques, due to their fully automated feature extraction as seen in Figure 2. The purpose of the proposed deep network is to solve a 4-class classification problem, with (a) ballooned hepatocytes and (b) fat droplets forming the disease classes, while (c) sinusoids and (d) veins forming the healthy classes. In the future, the proposed method could be integrated into a complete prognostic tool for (a) differentiating the healthy from the diseased tissue structures and (b) measuring the severity of the two diseases in clinical trials.

Materials and Methods
A two-step classification method is proposed, which can lead to the automatic characterization of the four histological objects: Step 1. Collection of a sufficient number of isolated training samples from digitized biopsies, pointing to the 4-class tissue alterations.
Step 2. Training two convolutional neural networks carrying the same architecture, but employing different optimization algorithms, as well as estimating their classification performance in several testing images. Also, applying transfer learning updates to well-known pre-trained CNN models and comparing their quantitative performance with the one produced from the new CNN topology. Finally, comparing the same performance with that of a conventional neural network algorithm. Flowchart of the suggested classification method. Initially, the histological structures of interest are isolated, and the proposed CNN is trained. The last stage refers to a future object detection project that could lead to the quantification of ballooning and fat prevalence ratio.

Materials and Methods
A two-step classification method is proposed, which can lead to the automatic characterization of the four histological objects: Step 1. Collection of a sufficient number of isolated training samples from digitized biopsies, pointing to the 4-class tissue alterations.
Step 2. Training two convolutional neural networks carrying the same architecture, but employing different optimization algorithms, as well as estimating their classification performance in several testing images. Also, applying transfer learning updates to well-known pre-trained CNN models and comparing their quantitative performance with the one produced from the new CNN topology. Finally, comparing the same performance with that of a conventional neural network algorithm.

Histological Features Isolation
All biopsy slides involved in this study were collected at St. Mary Hospital (Imperial College Healthcare NHS Trust of London, UK) and came both from NAFLD and NASH patients. All subjects gave their informed consent for the inclusion of their samples in the current study, which was conducted following the rules of the Declaration of Helsinki (revised in 2013). In recent years, various histological dyes have been used for clinical examinations, including picro-Sirius red and Masson's trichrome stains, particularly for the evaluation of liver fibrosis. However, for the following experiments, the gold standard Hematoxylin and Eosin (H&E) dye was selected to highlight the four tissue alterations. Generally, the dataset consists of 64 images digitized with a Hamamatsu microscope (Hamamatsu Photonics, Hamamatsu, Japan). Initially, these images exceeded 10,000 × 10,000 pixels, a size that could not be considered ideal for training deep learning algorithms. Downsampling the images at ×20 magnification proved to be an ideal solution, as it preserved all the anatomical details that form the four tissue structures.
Subsequently, a cropping tool was used to extract individual histological samples, in the form of image patches, from the whole tissue images. In total, 720 healthy and disease structures are provided to form a balanced image dataset (180 samples per class), which are stored in four categories implying the number of individual class objects. According to this assumption, an identification label is assigned for each microscopic structure, namely: (a) ballooning, (b) fat, (c) sinusoid and (d) vein. Furthermore, the dataset is partitioned into training/validation/testing subsets, where 620 structures were used for training, 60 for validation and 40 for testing.

Convolutional Neural Network Model Construction
In this stage, a CNN topology is defined to learn the most informative features from the extracted biopsy tissue structures. The convolution layer operations are accelerated with the use of an NVIDIA GTX1050Ti graphics processing unit (GPU). This refers to a popular computing distribution technique that can train deep neural networks in a short time. Figure 3 displays the techniques used in each layer of the proposed CNN architecture. Figure 2. Flowchart of the suggested classification method. Initially, the histological structures of interest are isolated, and the proposed CNN is trained. The last stage refers to a future object detection project that could lead to the quantification of ballooning and fat prevalence ratio.

Histological Features Isolation
All biopsy slides involved in this study were collected at St. Mary Hospital (Imperial College Healthcare NHS Trust of London, UK) and came both from NAFLD and NASH patients. All subjects gave their informed consent for the inclusion of their samples in the current study, which was conducted following the rules of the Declaration of Helsinki (revised in 2013). In recent years, various histological dyes have been used for clinical examinations, including picro-Sirius red and Masson's trichrome stains, particularly for the evaluation of liver fibrosis. However, for the following experiments, the gold standard Hematoxylin and Eosin (H&E) dye was selected to highlight the four tissue alterations. Generally, the dataset consists of 64 images digitized with a Hamamatsu microscope (Hamamatsu Photonics, Hamamatsu, Japan). Initially, these images exceeded 10,000 × 10,000 pixels, a size that could not be considered ideal for training deep learning algorithms. Downsampling the images at ×20 magnification proved to be an ideal solution, as it preserved all the anatomical details that form the four tissue structures.
Subsequently, a cropping tool was used to extract individual histological samples, in the form of image patches, from the whole tissue images. In total, 720 healthy and disease structures are provided to form a balanced image dataset (180 samples per class), which are stored in four categories implying the number of individual class objects. According to this assumption, an identification label is assigned for each microscopic structure, namely: (a) ballooning, (b) fat, (c) sinusoid and (d) vein. Furthermore, the dataset is partitioned into training/validation/testing subsets, where 620 structures were used for training, 60 for validation and 40 for testing.

Convolutional Neural Network Model Construction
In this stage, a CNN topology is defined to learn the most informative features from the extracted biopsy tissue structures. The convolution layer operations are accelerated with the use of an NVIDIA GTX1050Ti graphics processing unit (GPU). This refers to a popular computing distribution technique that can train deep neural networks in a short time. Figure 3 displays the techniques used in each layer of the proposed CNN architecture.
Initially, in the input layer, each image patch is resized to 64 × 64 × 3 pixel size (width, height, depth), with the bicubic interpolation method. Since this generates a large number of connections weights for modeling the image data, several dimensionality reduction techniques are used in the subsequent convolution layers: • In the first convolution layer, 64 convolution filters consisting of a 5-by-5 kernel size are defined to detect "low-level" features, such as edges, from the raw image data. In each convolution operation, zero-padding is utilized to assign 0 values around the inputs to maintain an output size equal to the input of each kernel filter [14]. Subsequently, batch normalization is applied to normalize the convolved values, as well as the Rectified Linear Unit (ReLU), being the nonlinear Initially, in the input layer, each image patch is resized to 64 × 64 × 3 pixel size (width, height, depth), with the bicubic interpolation method. Since this generates a large number of connections weights for modeling the image data, several dimensionality reduction techniques are used in the subsequent convolution layers:

•
In the first convolution layer, 64 convolution filters consisting of a 5-by-5 kernel size are defined to detect "low-level" features, such as edges, from the raw image data. In each convolution operation, zero-padding is utilized to assign 0 values around the inputs to maintain an output size equal to the input of each kernel filter [14]. Subsequently, batch normalization is applied to normalize the convolved values, as well as the Rectified Linear Unit (ReLU), being the nonlinear activation function, which is considered ideal for minimizing the vanishing gradient problem [15]. Even though ReLUs are widely used in most deep learning applications, their unboundedness on the positive side tends to cause overfitting. To circumvent this issue, max pooling filtering with a stride of 2 is set to decrease overfitting by reducing the spatial size (width and height) of the data representation [16].

•
The second convolution layer applies 32 filters with a 3-by-3 kernel size to search for "higher-level" features within each liver tissue object, including hepatocytes within a ballooning area, as well as multiple occurring pixels pointing at blood cells in hepatic veins. Batch normalization, ReLU function, and max pooling are included again, while dropout with a 0.5 probability is applied with the purpose to prevent overfitting [17].

•
In the third convolution layer, 16 filters with a 3-by-3 kernel size aim to emphasize on connected pixels that can differentiate the textural features among the four examined histological structures. Max pooling is no longer applied and the training process makes a transition to the fully connected layer.

•
The fully connected layer defines a dense layer with 4096 flattened neurons to gather the filtered anatomical features from the three convolution layers. These neurons are further connected to the final softmax layer. Dense and softmax layer connections act similar to a multilayer perceptron (MLP) artificial neural network, with the softmax function allocating probability distributions during the prediction of the four hepatic classes [18].

Applied Optimization Algorithms
A brief reference is made to various parameter values defined in two modern backpropagation algorithms for optimizing the training process. The first applied optimizer is adaptive moment estimation (Adam), which is known for its low memory requirements, as it takes into account first-order gradients only [19]. Since this optimization method is adaptive, it tends to calculate different learning rates from the first (mean) and second raw (uncentered variance) moment estimates of the gradients. Therefore, the updated weights are calculated as follows: where ε denotes the initial learning rate set equal to 0.001,ŝ the first-moment bias andr the corresponding second. δ refers to a numerical stabilization constant with a 10 −8 value, assigned (by default) to reduce the variance in weight updates [20]. In Adam, an important parameter is the decay rate of the squared gradient moving average for penalizing large weights, which is set to a 0.99 scalar value. All the above configurations aim at a more efficient convergence of the loss function towards the global minimum.
The second optimization solution comes from the application of the stochastic gradient descent with momentum (SGDM) algorithm. Specifically, the momentum value is set to accumulate an exponentially decaying average of past gradients, as it continues to move in their direction [20]. Here, the general update rule is given by: where β is a hyperparameter set at 0.9 to prevent momentum m from overspeeding and θ the updated network weights [14]. The θ values are obtained by subtracting the gradient of the loss function J(θ) from the weights ∇ θ J(θ), which are multiplied by a constant learning rate ε equal to 0.001.

Results
The proposed CNN model used two separate training processes utilizing a different optimizer each time. The Adam optimizer was used for the first process while the SGDM optimizer was used for the second process. This section focuses on (1) measuring the performance of the constructed deep architecture on the validation samples (n = 60) as well as (2) the classification capability on the test set (n = 40). At a later stage, the CNN network with the optimal optimization algorithm is compared with well-known pre-trained CNN architectures, utilizing transfer learning updates. Subsequently, the prediction capability of the same optimal model is compared to that of a conventional multilayer perceptron (MLP) neural network.

Training and Validation Results
Having defined the CNN topology, the focus is on the training process, which is set to run for 30 epochs. Every epoch comprises of a full cycle on the entire training set, consisting of 620 samples. Also, the option of shuffling the training data to the input layer is applied at the beginning of each epoch. Figure 4 presents a comparison of the validation graphs, each of which is derived from the training procedure with one of the analyzed optimizers. According to the diagram, the training process is set to run for a maximum of 270 iterations, in which the accuracy of the validation data is calculated. It is recalled that the validation set is not used to update the network weights, but to assess whether a model suffers from either overfitting or underfitting. Finally, a validation patience value of 3 is set to stop the training process, in case the same validation value is produced at least three times, indicating that the CNNs have learned sufficiently from the image data.  purpose is (a) to measure the classification accuracy for every individual liver class and (b) retrieve further statistics from the classification report. These metrics include the mean accuracy, precision, recall (sensitivity) and F-score (Table 1).  At first, the CNN Adam validation graph ( Figure 4) is monitored, which shows the convergence of the neural network's training process during the 162nd training iteration. This was the result of the combination of overfitting with the production of three similar validation values equal to 91.7%. In contrast, the CNN SGDM graph shows that the SGDM optimizer performed better as it did not overfit the training data. This led to better results, as the deep classifier not only completed the learning process by running in all 30 epochs but also produced a higher validation value of 96.7%.

Testing Results
To test the reliability of the developed methodology, the two-trained models (CNN Adam and CNN SGDM ) are called to identify 40 unknown liver structures (10 per class). In the current task, the softmax function is asked to assign an input image described by a vector x, to a class identified by a class label y ∈ {ballooning, fat, sinusoid, vein}. Thus, the function outputs a probability distribution value for the four classes within a [0,1] confidence interval. After the end of the testing process the purpose is (a) to measure the classification accuracy for every individual liver class and (b) retrieve further statistics from the classification report. These metrics include the mean accuracy, precision, recall (sensitivity) and F-score (Table 1). Examples of the image patch test results are shown in Figure 5. Each of these images is accompanied by its estimated classification probability (%), indicating how confident the CNNs are of their predictions. According to the figure, in most cases, an accurate discrimination result marked by a green frame is presented for the four hepatic tissue objects. It is observed that both neural networks have a reduced efficiency in identifying some sinusoids (red frames = misclassifications), which are among the most complex histological features to classify. However, the success lies in the fact that all ballooned cells and fat droplets, which characterize two of the most widespread liver diseases, have been identified with high confidence levels. Based on the exported percentages in Table 1, it is clear that the classifiers are more stable in detecting ballooned hepatocytes, as these consist of multiple changes in the values of their adjacent pixels. They also successfully achieve a visual discrimination of circular structures not always referred to as steatotic fat cells, but as hepatic veins, because they tend to contain several red blood cells.
Proceeding to Table 1, additional information is provided for the two classifiers, with the mean precision and recall (sensitivity) values. First, the performance of CNN Adam shows a lower recall value (92.5%) compared to higher precision (93.6%). This indicates that CNN Adam failed in some true positive (TP) samples. Consequently, it produced more false negative (FN) diagnostics and less false positive (FP) ones, respectively. Examples of CNN Adam misclassifications are shown in Figure 5 below, in which two incorrect sinusoid characterizations are displayed. In contrast, CNN SGDM delivered balanced precision and recall rates (95%), by producing more true positives (TP). For verification purposes, the two measures are combined into a single F-score (F1-score) value, representing their harmonic mean. Thus, if one metric carries a lower value, the F-score converges closer to the small number than the large one, which gives the classification models a more appropriate score than a common arithmetic mean. CNN Adam then receives a 93% F-score, whereas CNN SGDM a higher 95% F-score due to its fully balanced performance. purposes, the two measures are combined into a single F-score (F1-score) value, representing their harmonic mean. Thus, if one metric carries a lower value, the F-score converges closer to the small number than the large one, which gives the classification models a more appropriate score than a common arithmetic mean. CNNAdam then receives a 93% F-score, whereas CNNSGDM a higher 95% F- score due to its fully balanced performance.

Performance Comparison with Pre-Trained CNN Models
Ιn the next step, the performance of the optimal CNNSGDM classifier is compared with two of the most widely used CNN pre-trained models. These refer to the 8-layer AlexNet [21] and the deeper 16-layer VGG-16 [22] neural networks, two architectures that have yielded high classification results in recent years. In both pre-trained models, transfer learning updates have been applied to adapt them to the current classification problem of the four liver tissue alterations. Initially, to apply transfer learning to the AlexNet network, the biopsy image patches were resized to 227 × 227 × 3 pixels, a necessary step to fit as input samples. The output layers of the original AlexNet-CNN network were replaced accordingly to generate probabilities for the four histological structures. In the case of VGG-16, the samples were converted to 224 × 224 × 3 pixel size and the output layers were modified as before.
In both classifiers, the training process was set for a maximum of 10 epochs, while the SGDM algorithm was used to optimize the training process. The validation patience number was set again equal to 3, with the networks completing their training in less than 10 epochs. According to Table 2, a comparison of the accuracy, precision (positive predictive value-PPV), recall (sensitivity) and specificity (true negative rate-TNR) rates, generated by the AlexNet, VGG-16 and the previously built CNNSGDM architecture, is done. According to the percentages in Table 2, the constructed CNNSGDM architecture achieves better classification performances (accuracy: 95%, precision: 95%, recall: 95%, specificity: 98.3%) than VGG-16 (accuracy: 94%, precision: 94.1%, recall: 94%, specificity: 98%), while AlexNet emerges as the most optimal classifier (accuracy: 97%, precision: 97%, recall: 97%, specificity: 99%). All performance differences are presented in Figure 6.

Performance Comparison with a Conventional Neural Network
Τhe performance of a conventional artificial neural network algorithm is then investigated. In more detail, a Multilayer Perceptron (MLP) with 2 hidden layers consisting of 6 nodes each was called upon to perform training and testing on selected features (area, eccentricity, mean intensity, StD

Performance Comparison with Pre-Trained CNN Models
In the next step, the performance of the optimal CNN SGDM classifier is compared with two of the most widely used CNN pre-trained models. These refer to the 8-layer AlexNet [21] and the deeper 16-layer VGG-16 [22] neural networks, two architectures that have yielded high classification results in recent years. In both pre-trained models, transfer learning updates have been applied to adapt them to the current classification problem of the four liver tissue alterations. Initially, to apply transfer learning to the AlexNet network, the biopsy image patches were resized to 227 × 227 × 3 pixels, a necessary step to fit as input samples. The output layers of the original AlexNet-CNN network were replaced accordingly to generate probabilities for the four histological structures. In the case of VGG-16, the samples were converted to 224 × 224 × 3 pixel size and the output layers were modified as before.
In both classifiers, the training process was set for a maximum of 10 epochs, while the SGDM algorithm was used to optimize the training process. The validation patience number was set again equal to 3, with the networks completing their training in less than 10 epochs. According to Table 2, a comparison of the accuracy, precision (positive predictive value-PPV), recall (sensitivity) and specificity (true negative rate-TNR) rates, generated by the AlexNet, VGG-16 and the previously built CNN SGDM architecture, is done. According to the percentages in Table 2, the constructed CNN SGDM architecture achieves better classification performances (accuracy: 95%, precision: 95%, recall: 95%, specificity: 98.3%) than VGG-16 (accuracy: 94%, precision: 94.1%, recall: 94%, specificity: 98%), while AlexNet emerges as the most optimal classifier (accuracy: 97%, precision: 97%, recall: 97%, specificity: 99%). All performance differences are presented in Figure 6.

Performance Comparison with a Conventional Neural Network
The performance of a conventional artificial neural network algorithm is then investigated. In more detail, a Multilayer Perceptron (MLP) with 2 hidden layers consisting of 6 nodes each was called upon to perform training and testing on selected features (area, eccentricity, mean intensity, StD intensity, etc.) extracted from pre-processed images of the same biopsy data set. Once again, the output consisted of 4 nodes pointing to the 4-tissue structures prediction problem. According to the results of Table 2, the MLP produced lower classification rates than the CNN SGDM model (accuracy: 90.3%, precision: 90.3%, recall: 90.3%, specificity: 96.8%). Details of the produced measurements can be found again in Figure 6.

Visualization of Filtered Anatomical Features
This subsection focuses on investigating the feature activations of ballooned cells and fat droplets in all convolution layers for the CNN SGDM and AlexNet models (Figure 7). This visualization tool could help physicians determine the most critical anatomical patterns that characterize the two liver diseases examined in this study. A key characteristic of each convolution filter is that it converts each image patch into multiple feature maps that are more similar to the filter itself [14]. These feature maps are then rectified by the ReLU function, ensuring that they always carry positive activation values. In the ReLU function, since any positive value can be assigned to the activated pixels, a division of the gradient tensor by its l2-norm is proposed, making the magnitude of the output normalized to a closed [0,1] interval [23]. This ensures that the magnitude of all activations is always within the same range of each previously convolved image, making the final representations more visually intense. Therefore, bright white pixels represent strong positive activations, while pure black pixels represent strong negative ones, respectively. overfitting of the convolved pixel data. Based on the ReLU2 activations, this technique is demonstrated to be efficient as it forces both neural networks to filter "higher-level" features that are less co-adapted and can lead to better generalization. Also in ReLU2, it turns out that CNNSGDM executes an earlier activation of pixels that indicate swollen hepatocytes, while AlexNet chooses to perform additional filtering on the detected edges. Moving onto ReLU3 activations, it is noted that non-informative pixel activations have been significantly reduced in both deep models, with AlexNet performing a more ideal filtering of the necessary curves that form the perimeter of the ballooned hepatocyte and the lipid droplet. The same model achieves also better performance within the ballooning area, as only the most important pixels of the two hepatocytes are activated. It is known that the AlexNet architecture consists of two additional convolution layers [21] which, according to the above figure, can lead to the activation of individual small patterns that could As shown in Figure 7, it is found that in the two CNNs, the interest in the first ReLU 1 activations lies at identified edges, which can synthesize the basic structure of the balloon cell and fat droplet. It is recalled that both the CNN SGDM and AlexNet models apply the max pooling operation to their convolution layers (CNN SGDM : layers 1, 2, AlexNet: layers 1, 2, 5). Unlike CNN SGDM , AlexNet applies a 3 × 3 max pooling in all cases, which causes more blur in the two liver samples as it aims to reduce overfitting of the convolved pixel data. Based on the ReLU 2 activations, this technique is demonstrated to be efficient as it forces both neural networks to filter "higher-level" features that are less co-adapted and can lead to better generalization. Also in ReLU 2 , it turns out that CNN SGDM executes an earlier activation of pixels that indicate swollen hepatocytes, while AlexNet chooses to perform additional filtering on the detected edges. Moving onto ReLU 3 activations, it is noted that non-informative pixel activations have been significantly reduced in both deep models, with AlexNet performing a more ideal filtering of the necessary curves that form the perimeter of the ballooned hepatocyte and the lipid droplet. The same model achieves also better performance within the ballooning area, as only the most important pixels of the two hepatocytes are activated.
It is known that the AlexNet architecture consists of two additional convolution layers [21] which, according to the above figure, can lead to the activation of individual small patterns that could improve the overall classification performance. However, it seems that the first three convolution layers of CNN SGDM are sufficient for the necessary histological features to be filtered. On the other hand, a key prerequisite for CNN SGDM is to determine more optimal parameters which could further reduce the overfitting effect on the training data.

Discussion
Non-alcoholic fatty liver disease (NAFLD) is a common cause of liver disorder worldwide. Many studies investigating the natural history of NAFLD have verified its progression from chronic non-alcoholic steatohepatitis (NASH) to end-stage cirrhosis and hepatocellular carcinoma (HCC) [24]. Because a multitude of complications impede their accurate identification and treatment, their prevalence has been evaluated with a variety of diagnostic methods. Quantitative assessment through digital histological imaging has been established as the gold standard in clinical trials, with liver biopsies being the mean for the detection and staging of NASH and NAFLD. However, it is an invasive patient procedure and for this reason, it can be applied in cases that do not allow subjective evaluations.
The current study is an extension of an earlier project [25], the results of which were presented at 42nd International Conference on Telecommunications and Signal Processing (TSP) held in Budapest in July 2019. It focuses on resolving the aforementioned diagnostic barrier by fully automating the supervised classification process using deep learning systems. In particular, a CNN architecture is defined for fast training and accurate classifications on four liver tissue structures from biopsy images. Objects of interest relate to two liver disease structures including, (a) ballooned hepatocytes and (b) fat droplets, as well as two non-disease related objects including, (c) sinusoids and (d) veins. Then, the performance of this new deep topology is compared with that produced by well-known pre-trained CNN models, as well as with a conventional MLP-ANN.
The forthcoming subsections aim to comment on techniques previously applied to a 4-class recognition problem, eventually producing a 95% classification accuracy. The following steps include an overview of research efforts on histopathological liver specimens. The main goal is the obtained results to be qualitative compared with those coming from different diagnostic applications and different liver tissue examinations in recent years. Then, a brief description of the possibilities of extending the present methodology is given, continuing on the motif of fully automated object recognition and how they can offer effective solutions to medical diagnostic centers. Figure 4 illustrates a dashboard showing the validation values, during the training phase, in the corresponding subset of validation images (n = 60). In the first validation step, CNN Adam performs better than CNN SGDM (CNN Adam : 85%, CNN SGDM : 78%), but in future validations, its performance is inferior to that of CNN SGDM , as it tended to overfit the training data. These results are in line with other published conclusions [26,27], claiming that adaptive-based algorithms can boost the CNN computations, by using a vector of changing learning rates, one for each parameter, which is adapted as the training algorithm progresses. This is in contrast to stochastic gradient descent (SGD) optimizers, which use a constant learning rate during the training process [27]. These publications emphasize that even with a small number of mini-batches (64 image patches in the current study), Adam finds no solutions whose performance matches state-of-the-art. It has been constantly shown to be related to non-generalized results and especially in this case to non-convergence. In conclusion, it is usually noted that in systems with large computational resources, the use of SGD-type optimization techniques remains the ideal solution.

Testing Performance
The boxplot has become an ideal technique for presenting a 5-number summary (minimum and maximum range values, upper and lower quartiles, and median value), offering a quick analysis of the models' classification performance [28]. Both CNN Adam and CNN SGDM neural networks show a comparatively longer inter-quartile range (IQR) in the sinusoid class, yielding ultimately higher larger error values, resulting from Q3 + 1.5 * IQR, than the rest hepatic tissue structures. Based on Figure 8, two false positive sinusoid classifications in CNN Adam as liver veins have yielded a greater error variability for the corresponding vein class. The same is true for the CNN SGDM model diagram, where one incorrect classification of a balloon cell as sinusoid (false positive) and another, including a sinusoid misclassified as a ballooning area (false negative), have increased the inter-quartile error range for both class labels. All these performances, along with the error rates, ultimately produce a classification accuracy of up to 95%. It is noted that they show an improvement compared to the results of previous classification approaches [29] and is expected that they will further reduce the overall fat and ballooning prevalence ratio error compared to human visual interpretations [30]. It is also important, that current outcomes suggest a steady improvement in automated detection techniques and emphasize their diagnostic capabilities with respect to semi-quantitative methods.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 19 The boxplot has become an ideal technique for presenting a 5-number summary (minimum and maximum range values, upper and lower quartiles, and median value), offering a quick analysis of the models' classification performance [28]. Both CNNAdam and CNNSGDM neural networks show a comparatively longer inter-quartile range (IQR) in the sinusoid class, yielding ultimately higher larger error values, resulting from Q3 + 1.5 * IQR, than the rest hepatic tissue structures. Based on Figure 8, two false positive sinusoid classifications in CNNAdam as liver veins have yielded a greater error variability for the corresponding vein class. The same is true for the CNNSGDM model diagram, Figure 8. Verification of the classification results. In this figure, the boxplots depict the prediction error probabilities for each hepatic class, produced by the CNNAdam and CNNSGDM deep models. According to the two diagrams, classes with higher accuracy have a mean (the second quartile located at the boundary between the two colors) close to zero and less variance, while outliers are marked with black circles, pointing to greater than normal error values that affect the overall data observation.
where one incorrect classification of a balloon cell as sinusoid (false positive) and another, including a sinusoid misclassified as a ballooning area (false negative), have increased the inter-quartile error range for both class labels. All these performances, along with the error rates, ultimately produce a classification accuracy of up to 95%. It is noted that they show an improvement compared to the results of previous classification approaches [29] and is expected that they will further reduce the overall fat and ballooning prevalence ratio error compared to human visual interpretations [30]. It is also important, that current outcomes suggest a steady improvement in automated detection techniques and emphasize their diagnostic capabilities with respect to semi-quantitative methods.

Methodology Performance Compared to Other Classification Models
According to Table 2 and the accompanying Figure 6 diagram, CNNSGDM performs better than VGG-16, demonstrating that it can significantly contribute to fully automated disease assessments. On the other hand, the AlexNet model achieves better performance. However, the proposed deep CNN classifier is focused on short training processes from scratch as well as on minimizing the number of layers (4 layers in total compared to 8 layers of AlexNet and 16 layers of VGG-16). On the contrary, the poor, compared to the CNN architectures, performance of conventional MLP-ANN is one of the main reasons that have led the research community to make the transition to deep learning algorithms. Table 3 summarizes all the deep neural networks used in the liver biopsy dataset classification process along with their training times. It is emphasized that these training times cannot be directly compared since transfer learning updates have been applied to AlexNet [21] and VGG-16 [22] networks, which have been trained from scratch with the ImageNet dataset. Specifically, AlexNet was trained for 6 days on two NVIDIA Geforce GTX 580 GPUs and VGG-16 on four NVIDIA Titan Black GPUs for 2-3 weeks. Although these architectures are two of the most preferred options for extracting image features, they consist of a huge number of trainable parameters (AlexNet: 60 million, Figure 8. Verification of the classification results. In this figure, the boxplots depict the prediction error probabilities for each hepatic class, produced by the CNN Adam and CNN SGDM deep models. According to the two diagrams, classes with higher accuracy have a mean (the second quartile located at the boundary between the two colors) close to zero and less variance, while outliers are marked with black circles, pointing to greater than normal error values that affect the overall data observation.

Methodology Performance Compared to Other Classification Models
According to Table 2 and the accompanying Figure 6 diagram, CNN SGDM performs better than VGG-16, demonstrating that it can significantly contribute to fully automated disease assessments. On the other hand, the AlexNet model achieves better performance. However, the proposed deep CNN classifier is focused on short training processes from scratch as well as on minimizing the number of layers (4 layers in total compared to 8 layers of AlexNet and 16 layers of VGG-16). On the contrary, the poor, compared to the CNN architectures, performance of conventional MLP-ANN is one of the main reasons that have led the research community to make the transition to deep learning algorithms. Table 3 summarizes all the deep neural networks used in the liver biopsy dataset classification process along with their training times. It is emphasized that these training times cannot be directly compared since transfer learning updates have been applied to AlexNet [21] and VGG-16 [22] networks, which have been trained from scratch with the ImageNet dataset. Specifically, AlexNet was trained for 6 days on two NVIDIA Geforce GTX 580 GPUs and VGG-16 on four NVIDIA Titan Black GPUs for 2-3 weeks. Although these architectures are two of the most preferred options for extracting image features, they consist of a huge number of trainable parameters (AlexNet: 60 million, VGG-16: 138 million), leading to a very demanding processing procedure for the average hardware systems. It is shown that the AlexNet transfer learning process lasted 45 seconds, while for the much deeper VGG-16 model, it took 5 min and 13, which is longer than the CNN SGDM training from scratch process, which was completed in 2 min thanks to its 16,825,876 trainable parameters. All these conclusions justify the effort of novel research works to develop new deep models that could achieve new and shorter training performances on specific classification problems, without the employment of high-budget hardware equipment.

Visualization of Learned Features
A characteristic of a convolution filter is that it decomposes each histological sample into multiple feature maps [23]. Figure 7 includes a commonly used technique for visualizing these maps into independent 2D images.
In most computer vision problems there is a constant change in the background scene, with trained models being called upon to achieve a more rational separation of common objects of interest. Unlike the present identification problem, where there is a recurring background consisting of pixels that usually carry an H&E histological stain, along with objects of non-interest, such as healthy tissue and hepatocytes. As a positive observation in Figure 7, background pixel activations are significantly minimized in the ReLU functions in both CNN SGDM and AlexNet models, with tissue structures being successfully recognized. Thanks to the classifiers' deep architecture, a distinction is made between critical features, such as the change in adjacent pixels intensities that point to different edge types (e.g. straight or curved lines), but also to more detailed structures including ballooned hepatocytes. However, carrying even a small proportion of background pixel activations remains a cause of overfitting, and future applications aim to limit this issue with solutions proposed in the last section.

Qualitative Performance Comparison with Prior Methodologies
The number of digitally scanned microscopic specimens (64), along with the extracted image patches (720), form a sufficient image dataset for the implemented deep CNN architecture. A step that allows a qualitative comparison of the present work with recent innovative efforts aimed at locating diverse anatomical structures and chronic conditions, exclusively on liver biopsy images.
Unfortunately, a direct quantitative comparison with other relative works, presented in the literature, is not feasible. The current study employs a unique liver biopsy dataset, while the cited papers do not in all cases analyze the same histological areas of interest. Also, the cited methods do not rely on similar evaluation metrics when measuring their classification capability. Therefore, Table 4 intends to make a qualitative comparison of the referred methodologies, derived from a combination of digital image processing techniques with conventional machine learning algorithms and fully automated deep learning architectures. The table below initially shows that image preprocessing to H&E stained samples remains an essential step for image segmentation purposes. Also, unsupervised machine learning algorithms, such as K-means, refer to popular clustering techniques for separating biopsy samples from their background as well as tissue structures of interest. Subsequently, these methods lead well-known classifiers such as k-nearest neighbors (k-NN), decision tree (DT) and support vector machines (SVM) to high object recognition performances. Nativ et al. [6] presented an image analysis method that could distinguish the main differences between small-droplet macrovesicular steatosis (sd-Mas) and large-droplet macrovesicular steatosis (ld-MaS). The methodology was based on an automated active contour modeling (ACM) technique for lipid droplet segmentation, the unsupervised K-means algorithm for clustering the two objects of interest and a decision tree classifier to improve the separation between the two categories. After the classification stage, specificity and sensitivity values were 93.7% and 99.3%, respectively. The linear regression coefficient of determination was equal to 0.97, as the correlation method with semi-quantitative pathologists' assessments.
Sumitpaibul et al. [7] proposed an image processing-based method for estimating the fat ratio in liver biopsy images. This study adopted classic image processing techniques to extract the areas of candidate fat blobs, including grayscale and binary image conversion, background segmentation and average noise filtering. Next, the k-NN classifier was called to identify fat blobs that could lead to an accurate calculation of the fat prevalence ratio.
Hall et al. [8] investigated the relationships between liver fat, aminotransferases, and hepatic architecture in steatotic liver sample examinations. Binary segmentation of the red, green and blue (RGB) channels resulted in the distinction of fat vacuoles and the measurement of the fat proportionate area (mFPA). The results showed that there were significant increases in alanine aminotransferase (ALT), and aspartate aminotransferase (AST) when the fat content increased. Other data also indicated both 5% and 20% of mFPA as a cut-off for raised ALT. Moreover, significant growth in hepatic architecture (HA) and lobule radius (LR) were observed when fat accumulation increased (mFPA = 10%).
Roy et al. [9] proposed a segmentation method for extracting histological regions of interest from high-resolution biopsy images. This was followed by the application of image enhancement and morphological operation techniques to enhance and smooth the boundaries of steatosis components, as well as to remove small undesired objects. Furthermore, a sophisticated technique for assigning curvature points to differentiate overlapped fat droplets was presented. Finally, a supervised classification step was used and resulted in discrimination rates for both isolated and overlapped steatosis, where in most cases they were equal to 100%.
Following Vanderbeck et al. [10], they focused on a multiple liver class recognition problem. The method relied on both image preprocessing and supervised classification techniques. The SVM algorithm performed 89% classification accuracy and identified macrosteatosis, bile ducts, portal veins and sinusoids with precision and recall values ≥ 82%. The same team in a subsequent study [12] focused on the automatic detection and quantification of lobular inflammation and hepatocellular ballooning. As before, image preprocessing and supervised classification resulted in 70% and 49% precision and recall values for lobular inflammation and 91% and 54% for hepatocellular ballooning. In addition, the classifier had a 95% area under the curve (AUC) for lobular inflammation and 98% for hepatocellular ballooning. The Spearman's correlation coefficient was applied to compare the method's performance with that of expert pathologists and was 45.2% for lobular inflammation and 46% for hepatocyte ballooning.
Segovia-Miranda et al. [11] applied a three-dimensional imaging technique, to generate spatially-resolved geometrical and functional models for the diagnosis of liver tissue specimens at different NAFLD stages. The methodological approach identified a set of morphological changes associated with NAFLD progression. These morphological changes included the size of lipid droplets distribution, nuclear texture homogeneity, and feature values which were used as tissue biomarkers to distinguish the different stages of NAFLD progression. Correlations between various diagnosed findings and NAFLD progression in individual patients were analyzed. The results indicated that gamma-glutamyl transpeptidase (GGT) had the strongest Pearson correlation coefficient (GGT = 0.680). The same coefficient was 0.473 for alkaline phosphatase (ALP), 0.505 for total bile acid (BA) and 0.518 for the corresponding primary BA.
More recently, there has been an increase in the selection of deep convolutional neural networks for the classification and monitoring of microscopic structures. Vicas et al. [13] aimed at fully-automating the liver fibrosis detection process. The same group also focused on the objective quantification of steatosis, with classical computer vision techniques (image processing, conventional machine learning) and CNNs being the two diagnostic approaches. In the case of deep neural networks, the U-net proved to be the optimal approach for performing pixel-wise region segmentations. The validation of the automated quantitative analysis was performed using the R 2 correlation coefficient based on a physician's qualitative scores. Specifically, the R 2 was 0.748 for the classical computer vision approach and 0.893 for the CNN, respectively.
The above qualitative comparison shows in total that the full capabilities and strengths of the digital image processing field remain to be explored. Nonetheless, it becomes clear that deep learning algorithms can achieve high classification rates, by fully automating the image analysis process without the extensive need for image enhancement and object segmentation techniques. In the next subsection, new techniques, which have been implemented by many experts in the field of deep learning, are discussed and could lead to an improvement of the current methodology.

Future Thoughts and Ideas
As a future work, many improvements could be included such as (a) digitizing new biopsy samples and increasing the dataset in more than 1000 liver structures, (b) enhancing the neural network discrimination experience by applying transfer learning updates and (c) parameterizing the CNN architecture to adapt to the imminent amount of data. The first step can be done in parallel with data augmentation techniques, which refer to classic 2D image transformations, including random rotations, random shearing-zooming, horizontal and vertical flips, etc. Applying more optimizers, including RMSprop and Adadelta, could also prove to be a good alternative to the proposed backpropagation algorithms.
Thereafter, autoencoder neural networks could make a significant contribution to the data preprocessing stage, further limiting the overfitting effect in the training set. Autoencoders (ex. stacked, variational) refer to a sophisticated technique for learning efficient representations of input data, without any supervision [14]. Typically, the autoencoders' output is a reconstruction of the input data in its most efficient form [16]. The current unsupervised model will be employed to reduce the dataset's dimensionality, by preserving the most informative elements that compose a liver disease structure and eliminating the background pixels activations as much as possible.
Future projects will include an accurate method that could involve real-time histological classifications through a digital microscope. All current study implementations, as well as future improvements, will also be included for disease quantification purposes. During the scan of biopsy specimens, the learned feature weights will lead detection windows (as bounding boxes with active contour lines) to structures identified with the four liver classes (Figure 9). Typical examples of such detectors are (a) region-based convolutional neural networks (R-CNNs) [31,32], (b) you only look once (YOLO) [33,34] and (c) single shot detector (SSD) [35]. An interesting choice is also the U-Net architecture [36], which includes a variant of CNNs, for pixel-wise segmentation.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 17 of 19 Figure 9. A blueprint of the object detection method.

Conclusions
The current work focuses on building a deep convolutional neural network architecture, aiming at short training time combined with the precise classification of four liver biopsy tissue alterations. The new CNN model was trained on two different occasions with the SGDM and Adam optimization Following the extraction of the histological sample area (e.g. with K-means) and the discrimination of structures of interest, the exclusion of anatomic features not related to pathological findings will be executed, performing an objective assessment of the liver diseases. As a result, clinicians will have at their disposal a quick and accurate diagnostic tool for support, which will compute the percentages of ballooning degeneration and fat accumulation ratio. The quantification of the two conditions will be carried out using the following formula: where b i and f i are the total count of pixels that form multiple classified balloon cells or fat droplets, respectively, eventually divided by the total area of tissue pixels n t .

Conclusions
The current work focuses on building a deep convolutional neural network architecture, aiming at short training time combined with the precise classification of four liver biopsy tissue alterations. The new CNN model was trained on two different occasions with the SGDM and Adam optimization algorithms, with SGDM producing the optimal classification accuracy (95%). The performance of the new CNN SGDM topology was then compared with that of the pre-trained AlexNet and VGG-16 models, in which transfer learning updates were applied, as well as with a conventional MLP artificial neural network. The results showed that the constructed CNN SGDM model can achieve better classification accuracy than VGG-16, while AlexNet emerged as the most optimal classifier. Also, the constructed model was superior to the conventional MLP-ANN, indicating the need to apply deep learning architectures to modern computer vision methodologies. In conclusion, CNN architectures are based on fully automated image analysis steps, without the extensive need for manual annotations. This is a decisive step in the objective distinction of hepatocyte ballooning and fat droplets, two tissue structures responsible for the increasing prevalence of NAFLD and NASH in recent years.