Enhanced Automatic Morphometry of Nerve Histological Sections Using Ensemble Learning

: There is a need for an automated morphometry algorithm to facilitate the otherwise labor ‐ intensive task of the quantitative histological analysis of neural microscopic images. A benchmark morphometry algorithm is the convolutional neural network Axondeepseg (ADS), which yields a high segmentation accuracy for scanning and transmission electron microscopy images. Neverthe ‐ less, it shows decreased accuracy when applied to optical microscopy images, and it has been ob ‐ served to yield sizable false positives when identifying small ‐ sized neurons within the slides. In this study, ensemble learning is used to enhance the performance of ADS by combining it with the paired image ‐ to ‐ image translation algorithm PairedImageTranslation (PIT). Here, 120 optical mi ‐ croscopy images of peripheral nerves were used to train and test the ensemble learning model and the two base models individually for comparison. The results showed weighted pixel ‐ wise accuracy for the ensemble model of 95.5%, whereas the ADS and PIT yielded accuracies of 93.4% and 90%, respectively. The automated measurements of the axon diameters and myelin thicknesses from the manually marked ground truth images were not statistically different ( p = 0.05) from the measure ‐ ments taken from the same images when segmented using the developed ensemble model, while they were different when they were measured from the segmented images by the two base models individually. The automated measurement of the G ratios indicated a higher similarity to the ground truth testing images for the ensemble model in comparison with the individual base models. The proposed model yielded automated segmentation of the nerve slides, which were sufficiently equivalent to the manual annotations and could be employed for axon diameters and myelin thick ‐ ness measurements for fully automated histological analysis of the neural images.


Introduction
Communication between the central nervous system (CNS) and the body is established through the peripheral nervous system (PNS). Any problem that interrupts this communication can lead to various motor and sensory deficits (e.g., multiple sclerosis and neurodegenerative diseases). Such deficits can be attributed to several factors, such as demyelination, axonal loss, or a mixture of both. Motor and sensory disabilities are correlated with the severity of these defects [1,2]. Therefore, the ability to efficiently image, identify, and quantify nerve histological structures would allow researchers to track the progression and severity of these defects, and to assess the impact of therapeutic drugs [3,4].
In an attempt to automate this otherwise labor-intensive analysis, multiple traditional image segmentation algorithms have been applied for myelin and axon segmentation and labeling. Examples of these algorithms include thresholding, region growing, morphological operations, watershed, and active contours [5][6][7][8][9][10][11]. However, such methods did not gain much popularity due to their limited scope of application and accuracy [12].
In the last few years, deep learning methods, particularly convolutional neural networks (CNNs), have shown great capabilities in medical image segmentation [13]. A recent study presented the open-source software Axondeepseg (ADS). This software implements the CNN approach to segment the nerve axon and myelin with a pixel-wise accuracy that reached up to 95% [14]. The performance of this software has been reported for images collected from SEM and TEM, but not for the cheaper, widely used optical microscopy, raising the need for further exploration. Additionally, optical microscopy is prone to a high level of variability due to the preparation and staining protocol used, which may inherently lead to different variations that are not usually present in SEM and TEM images.
Another deep learning method for image transformation applications is paired image-to-image translation [15]. It is commonly used in computer vision where an image from one domain is converted into a corresponding image in another domain, given sufficient training data [16]. It is accustomed to learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains [17]. This capability could be exploited in neural segmentation tasks, with the two image domains being the original images before the markings and the same image after the segmentation or marking. This approach has not been applied in the neural image segmentation task before.
Combining the performance of the two aforementioned techniques is possible with the use of ensemble learning, where multiple learning algorithms are combined to obtain a better predictive performance than what could be obtained from any of the constituent learning algorithms alone [18]. In this study, the performance of the ADS and a PIT-based classification software (PairedImageTranslation [15]) are investigated individually for segmenting optical microscopy images of nerve slices, then ensemble learning is employed to enhance the segmentation performance by combining the ADS and PIT models. The resulting segmented images are then exploited for the automated measurement of morphological parameters, such as the axon diameter and myeline thickness, and are contrasted with the measurements that were based on manually annotated ground truth images.

Datasets
The images used in this study were obtained from sciatic nerves harvested from four mongrel dogs as a part of a neural interface project. Following the perfusion process, the nerve was cut in sections at 0.5-1.0 μm. Toluidine Blue staining (1% Toluidine Blue and 2% Borate) was used [19] to stain the myelin sheath of the nerve axons.
The prepared slides were then imaged using optical microscopy with a resolution of 0.15 μm/pixel. All of the original animal experiments that produced the used images were approved by the Institutional Animal Care and Use Committee (IACUC) of the Case Western Reserve University (Cleveland, OH, USA). The data set consisted of 120 images of a size of 512 × 512 pixels. These images were distributed into 72 images for training, 24 for validation, and 24 for testing.

Ground Truth Labeling
Ground truth labels were created in two steps, as follows. (i) Inner and outer perimeters of the myelin were labeled manually by initially adjusting the image contrast, performing an automated edge-detection algorithm, and then visually inspecting the image to retain the detected borders while erasing noise lines. (ii) Images were then edited using the Open Source image editor GIMP to correct remaining incomplete boundaries and to fill the axon, myelin, and background regions with black, gray, and white colors, respectively. Figure 1 shows examples of the original images and their corresponding labeled images.

Network Architecture
Both the ADS and PIT frameworks originate from the U-net model [20]. This model is structured in a U shape, which contains two paths: the contracting path and the expanding path. In the contractile path, images are encoded into feature maps at several CNN layers. At each CNN layer, the number of feature maps is doubled, whereas the dimension of the image is reduced by half. The expanding path of the U-Net model works in the exact opposite way as that of the contractile path. At each CNN layer, the number of feature maps is halved, whereas the dimension of the image is doubled until it reaches the dimension of the original image [12,15].

ADS Network Architecture
Each contracting and expanding path of ADS consists of three blocks, and each block has three convolutional layers. In the first block, the convolutional layers use 5 × 5 kernels, while the convolutional layers in the remaining two blocks use 3 × 3 kernels. Each block in the contracting path ends with a convolution of stride 2 for dimensionality reduction. Each strided convolution layer has a corresponding up-convolution layer in the expansion path. The number of channels in the first block starts with 16 and doubles after each block, and then decreases by half in each block of the expansion path. The activation function used in all of the convolutional layers is rectified linear units (ReLU).

PIT Network Architecture
Each contracting and expanding path of PIT consists of four blocks, and each block has two convolutional layers. All of the blocks use 3 × 3 kernels. Each block in the contracting path ends with a convolution of stride 2 for dimensionality reduction. Each strided convolution layer has a corresponding up-convolution layer in the expansion path. The number of channels in the first block starts with 32 and doubles after each block, and then decreases by half in each block of the expansion path. The activation function used in all of the convolutional layers is rectified linear units (ReLU). The output of each contracting path is connected directly to the corresponding expanding path. This is called a skip connection. The idea of skip connections is to transfer the raw information directly to the final output. This helps mitigate the vanishinggradient problem and accelerates learning.

Ensemble Learning
Ensemble learning, also called committee-based learning and a multi-classifier system, uses two or more base models to enhance the classification performance and improve the robustness and quality of the individual models. First, a set of individual models are trained separately. Then, the classification results of the individual models are aggregated by one of the widely used ensemble strategies, such as voting, bagging, boosting, stacking, Bayes optimal classifier, and winner-take-all [21]. In the presented study, we aggregated the results via averaging the probability matrix as described in the following section, which we found to be sufficient to yield the anticipated improvement, as the two base models are distinct in their approach.

Ensemble Learning Pipeline
In the learning step, the training and validation images were fed into both the ADS and PIT models, and then both trained models were evaluated on the testing of four intact images and the two probability matrices that resulted after the Softmax function were averaged to obtain a single probability matrix. Finally, argmax function was used to obtain the prediction matrix, which was then compared with the testing ground truth (manually labeled images) using evaluation metrics. A flowchart of the entire learning and testing process is provided in Figure 2. Overview of the ensemble algorithm pipeline. During the learning step, the training set is fed into both models. The test images are then fed into both models and the prediction matrices are collected following the Softmax function analysis. The probability matrices are then averaged pixelwise and the argmax function is used for prediction. Finally, the evaluation metrics are used to assess the performance of the ensemble training model by comparing the ground truth test labels with the prediction matrices.

Morphological Analysis
The main morphological parameters of interest are the G-ratio, axonal diameter, and myelin thickness [22]. The G-ratio is used to assess axonal myelination and reflects the efficiency of neural pathways to conduct neural signals [23]. It is estimated by dividing the axonal diameter by the myelinated fiber diameter. The axon shapes were assumed to be circular; therefore, the axon and myelinated axon (axon + mylein) diameters were calculated by equating the measured areas of the original myelinated axon to the area of two concentric circles, as depicted in Figure 3.

Evaluation Metrics
To evaluate the performance of the trained model for the segmentation task, several evaluation metrics were used. Dice coefficient and pixel-wise accuracy for axon, myelin, and background were used to indicate their segmentation quality. The dice coefficient was calculated between the segmented image and the ground truth image for the testing image set. It is defined as follows: where A is the number of pixels of a given class that are present in the segmented image, B is the number of pixels of a given class in the ground truth image, and (A ∩ B) is the number of pixels of the same type that are present in both images (correctly segmented pixels).
Pixel accuracy is a semantic segmentation metric that denotes the percent of pixels that are accurately classified in the image. It is defined as follows: As the number of pixels per class is imbalanced, the total accuracy performance classes are calculated as a weighted pixel-wise accuracy, yielding weights of 1, 2.58, and 3.69 for the background, myelin, and axon, respectively.
Precision and sensitivity (also known as recall) for the both axon and myelin were calculated to assess the ability to detect true axon, myelin, and background regions, while avoiding false positives [12]. They were calculated for each pixel type as follows: where represents the count of pixels correctly classified as pixel A, represents the count of the other pixel types incorrectly classified as pixel type A, and represents the count of the pixels of type A that are incorrectly classified as pixels of other types. These two metrics were calculated for the three pixel types (i.e., "type A" indicates either axon, myeline, or background).

Quantitative and Qualitative Analysis of Pixel-Wise Segmentation
The performance of the proposed model was quantitatively assessed by classifying each pixel of the analyzed image into three possible categories, namely axon, myeline, or background. Table 1 lists the evaluation metrics computed on the outputs of the ADS, PIT, and ensemble models. The weighted pixel-wise accuracy of the ensemble model was 95.5%, whereas the ADS and PIT models achieved a weighted pixel-wise accuracy of 93.4% and 90%, respectively.
The axon dice of the ensemble model reached 94.3%, which was higher than that of the ADS by 2%.
The confusion matrices for the three models are shown in Figure 4. The ensemble model had an axon recall of 0.94, which outperformed the two base models by 5% in ADS and 3% in PIT. Myelin recall was also slightly enhanced in the ensemble model at 0.95, compared with that of the ADS model at 0.94. To express the performance of the developed model qualitatively, Figure 5 shows the segmentation results of three sample images against the original images and their ground truth labeled images. The quality of the segmentation results varied among the three models, which reflects the aforementioned variation in the accuracy values. Overall, ADS shows a superior performance over PIT in classifying the transition borders between different cell types, hence the PIT-classified images appear to be pixilated. Nevertheless, PIT appeared to yield a better outcome when eliminating the artifacts (marked with red ellipses in the rightmost column of Figure 5) and yielded a more uniform marking of the axonal regions that are surrounded by myelin sheaths (marked with red rectangles in Figure 5). The ensembled learning model appeared to combine the individual proficiencies of each of the two base models and yielded a superior overall quality of the segmented images.

Quantitative Analysis of Automated Morphometry
Classifying the neural images into the three pixel classes does not have a direct utility per se in the histological assessment of the nerve tissue for which the axonal morphometry data are needed. These data encompass the distributions of the axonal diameters and the corresponding myelin thicknesses for the axonal population, along with the relative myelination index known as the G-ratio. Nevertheless, the automated measurements of the diameters and myelin thicknesses are impacted by the segmentation quality, hence a high accuracy segmentation is needed. To assess the utility of the three segmentation models in obtaining these measurements, automated morphometry was applied to the segmented images by the three models, and the results are compared with the morphometric data resulting from the ground truth images. The distributions yielded of the axonal diameter, myeline thicknesses, and G-ratios are shown in Figure 6.
Statistical analysis was performed for any significant difference in the morphological parameters calculated from the ground truth, ensemble, ADS, and PIT models. The analysis was performed on a total of 628 axons contained in the testing images. The normality assumption was checked using Kolmogorov-Smirnov test as the number of axons was >300. The G-ratio values were found to be normally distributed. Thus, a one way ANOVA test was performed followed by a Student's t-test. The axon diameter and myelin thickness data were found to be not normally distributed (p < 0.05). Thus, the Friedman test was carried out, followed by the two-way Wilcoxon signed-rank test. The statistical significance used in this study was defined as p < 0.05. Figure 6 (a) shows the G-ratio distributions. All four distributions were found to be normally distributed. Both the ensemble and ADS models yielded G-ratio distributions that were statistically different from the ground truth G-ratios. Nevertheless, the yielded p-value was higher for the ensemble G-ratios (p = 0.0002), which indicates slightly enhanced G-ratios in the ensemble model. Figure 6 (b) shows the axon diameter distributions. All four distributions were found to be not normally distributed. Statistical analysis of the diameter distributions indicated no significant difference between the ensemble model and ground truth, while being statistically different from the distributions yielded from the base models individually. Figure 6 (c) shows the myelin thickness distributions. All four distributions were found to be not normally distributed, and further analysis indicated no statistical difference between the results from the ensemble model's segmented images and ground truth images, while the base models yielded statistically different distributions.  Table 2 lists the G-ratios calculated from the ground truth testing images and for the corresponding predicted images of the ADS, PIT, and ensemble models. The average Gratio for four testing whole slides (1360 × 1024 pixels each) was 0.531 ± 0.029 when calculated from ground truth images. The G-ratio calculated from the ensemble model was the closest to that of the ground truth images at 0.524 ± 0.019, with a percentage difference of 1.4%. However, the average G-ratio calculated from the ensemble model was statistically different from the ground truth results (p = 0.0002). Table 2. Summary of the G-ratios calculated from the ground truth testing images and for the corresponding predicted images of the ADS, PIT, and ensemble models.

Images
Ground

Discussion
The presented study evaluated the performance of an ensemble learning model that combines ADS and PIT in segmenting optical microscopy peripheral nerve images for automated morphometry applications. The weighted pixel-wise accuracy of the proposed model was 95.5%. The maximum pixel-wise accuracy found in the literature for axon and myelin segmentation on optical microscopy images was 82.3% using a deep convolutional encoder-decoder [23] for 32 × 32 pixels images of a mouse spinal cord.
A fully convolutional neural network was previously used for axon diameter distribution and resulted in a high correlation with the ground truth, but its utility was demonstrated on grayscale images of central neural structures using electron microscopy images [24].
Ensemble learning has been explored for in vetro axon segmentation, where U-net with a ResNet-50 encoder architecture was used [17]. Nevertheless, it was applied on axonal phase-contrast images to allow for 3D axonal tracing and not the analysis of highresolution neural cross-sectional images.
The U-net architecture was used alone for the segmentation task while incorporating a novel dropout, where the Monte-Carlo dropout approach was incorporated to assess the model uncertainty and to select the samples to be annotated for the next iteration [25]. It was able to improve dice values by 10% to up to 90%, with a relatively small training image count (<100). In our model, each base model was utilized to produce weak predictive results, then ensemble learning fused the favorable prediction via voting schemes in an adaptive way. Hence the incremental improvement on one of the base models is not guaranteed to lead to the corresponding overall improvement of the ensemble model.
Approaches that do not utilize machine learning were also explored, with accuracy values of single-axon detection up to 82%, and classification of the fiber myelination accuracy of 89% [26]. The segmentation process relied solely on the image processing steps and elliptical shape reconstruction for the input images under different levels of contrast and color saturation. Even though the approach is simple and computationally not expensive, it requires manual adjustment of the color levels by experienced operators, which hinders the full automation of the entire process.
The evaluation metrics calculated using the ensemble model showed improved performance over the two base models separately, except for axon precision ( Table 1). The achieved accuracy of the developed ensemble model also surpassed the previously reported accuracy of the ADS base model when segmenting images collected from SEM and TEM [9]. This observation indicates the superiority of the ensemble model to detect true positive axons, which can be attributed to the improved axon regions detected in the PIT model, especially for small axons ( Figure 5); hence, negating the need for manual intervention, as it was often required [7,8,11,27].
The axon precision yielded from the ensemble model was lower than that obtained with the ADS model by 1%. The reason for the decreased axon precision in the ensemble model comes from the low axon precision in the PIT model (84.2%) as the PIT predicts segments of the outer borders as axons instead of being predicted as myelin or background. The ensemble learning approach has been shown to address most of these predictions and reflected the superior performance of the ADS at the outer borders.
The utility of the developed ensemble learning model in the automated axonal morphometric analysis was demonstrated with a high consistency between the axonal and myelin thickness measurement based on the ensembled model segmentation, and the measurements that are based on the ground truth testing images. The G-ratios calculated from the ensembled model yielded a percentage difference of 1.4%, which was statistically different from the ground truth. The reason for this observed difference in the G-ratio and not for the axon diameters and myelin thicknesses can be attributed to the fact that the Gratio calculation incorporates both the axon diameter and the myelin thickness, thereby it can be influenced by two sources of variability cumulatively, resulting in this observed difference from the ground truth images. Nevertheless, the p-value of the three models' comparisons indicated less degree of differences between the ensemble model and ground truth than with the other base models individually.

Conclusions
In this study, the accuracy of the CNN algorithm ADS in segmenting neural images was improved by combining it with a paired-image-to-image translation algorithm using ensembled learning. The ensemble technique has benefited from the superior specificity of PIT in removing artifacts and in the uniform regional classification within axons. As such, the ensembled learning model yielded improved pixel-wise segmentation accuracy. The utility of the ensemble technique in automating the morphometry analysis of the neural images was demonstrated by measuring the axonal diameters, myelin thicknesses, and G-ratios. The ensemble model yielded statistically similar morphometric parameters (i.e., axon diameter and myelin thickness) to the parameters calculated based on the ground truth images. The G-ratio measurements were statistically different from the ground truth measurements, but were improved over the base models. In future work, universal automated segmentation software can be developed by increasing the number of models used in ensemble learning and training the algorithm on larger more variant datasets that are collected from various species using various preparation and staining protocols.