PixelBNN: Augmenting the PixelCNN with Batch Normalization and the Presentation of a Fast Architecture for Retinal Vessel Segmentation

Analysis of retinal fundus images is essential for eye-care physicians in the diagnosis, care and treatment of patients. Accurate fundus and/or retinal vessel maps give rise to longitudinal studies able to utilize multimedia image registration and disease/condition status measurements, as well as applications in surgery preparation and biometrics. The segmentation of retinal morphology has numerous applications in assessing ophthalmologic and cardiovascular disease pathologies. Computer-aided segmentation of the vasculature has proven to be a challenge, mainly due to inconsistencies such as noise and variations in hue and brightness that can greatly reduce the quality of fundus images. The goal of this work is to collate different key performance indicators (KPIs) and state-of-the-art methods applied to this task, frame computational efficiency–performance trade-offs under varying degrees of information loss using common datasets, and introduce PixelBNN, a highly efficient deep method for automating the segmentation of fundus morphologies. The model was trained, tested and cross tested on the DRIVE, STARE and CHASE_DB1 retinal vessel segmentation datasets. Performance was evaluated using G-mean, Mathews Correlation Coefficient and F1-score, with the main success measure being computation speed. The network was 8.5× faster than the current state-of-the-art at test time and performed comparatively well, considering a 5× to 19× reduction in information from resizing images during preprocessing.


Introduction
The segmentation of retinal morphology has numerous applications in assessing ophthalmologic and cardiovascular disease pathologies, such as Glaucoma and Diabetes. 1 Diabetic retinopathy (DR) is one of the main causes of blindness globally, the severity of which can be rapidly assessed based on retinal vascular structure. 2 Glaucoma, another major cause for global blindness, can be diagnosed based on the properties of the optic nerve head (ONH). Analysis of the ONH typically retinal health conditions resulting in unnecessary complications. 11 The use of computer aided detection (CAD) methods are being utilized to quantify the disease state of the retina, however most traditional methods are unable to match the performance of clinicians. These systems underperform due to variations in image properties and quality, resulting from the use of varying capture devices and the experience of the user. 9 To properly build and train an algorithm for commercial settings would require extensive effort by clinicians in the labelling of each and every dataseta feat that mitigates the value of CAD systems. Overcoming these challenges would giver rise to longitudinal studies able to utilize multi-modal image registration and disease/condition status measurements, as well make applications in surgery preparation and biometrics more viable. 9 The emergence of deep learning methods has enabled the development of CAD systems with an unprecedented ability to generalize across datasets, overcoming the shortcoming of traditional or "shallow" algorithms. Computational methods for image analysis are divided into supervised and unsupervised techniques. Prior deep learning, supervised methods encompassed pattern recognition algorithms, such as k-nearest neighbours, decision trees and support vector machines (SVMs).
Examples of such methods in the segmentation of retinal vessels include 2D Gabor wavelet and Bayesian classifiers, 10 line operators and SVMs 3 and AdaBoost-based classifiers. 12 Supervised methods require training materials be prepared by an expert, traditionally limiting the application of shallow methods. Unsupervised techniques stimulate a response within the pixels of an image to determine class membership and do not require manual delineations. The majority of deep learning approaches fall into the supervised learning category, due to their dependence on ground truths during training. Often, unsupervised deep learning techniques refer to unsupervised pretraining for improving network parameter initialization as well as some generative and adversarial methods.
Deep learning overcomes shallow methods' inability to generalize across datasets through the random generation and selection of a series of increasingly dimensional feature abstractions from combinations of multiple non-linear transformations on a dataset. 13 Applications of these techniques for object recognition in images first appeared in 2006 during the MNIST digit image classification problem, of which convolutional neural networks (CNNs) currently hold the high-est accuracy. 14 Like other deep neural networks (DNNs), CNNs are designed modularly with a series of layers selected to address different classification problems. A layer is comprised of an input, output, size (number of "neurons") and a varying number of parameters/hyper-parameters that govern its operation. The most common layers include convolutional layers, pooling/subsampling layers and fully connected layers.
In the case of retinal image analysis, deep algorithms utilize a binary system, learning to differentiate morphologies based on performance masks manually delineated from the images. The current limitation with most unsupervised methods is that they utilize a set of predefined linear kernels to convolve the images or templates that are sensitive to variations in image quality and fundus morphologies. 8 Deep learning approaches overcome these limitations, and have been shown to outperform shallow methods for screening and other tasks in diagnostic retinopathy. 15,16 A recent review chapter discusses many of these issues and related methodologies. 17 This paper presents PixelBNN, a novel variation of PixelCNN 18 -a dense fully convolutional network (FCN), that takes a fundus image as the input and returns a binary segmentation mask of the same dimension. The network was trained on resized images, deviating from other state-of-thart methods which use cropping. The network was able to evaluate test images in 0.0466s, 8.5× faster than the state-of-the-art. Section 2 discusses the method and network architecture. Section 3 describes the experimental design. The resulting network performance is described in Section 4.
Lastly, Section 5 discusses the results, future work and then concludes the paper.

Methodology
Deep learning methods for retinal segmentation are typically based on techniques which have been successfully applied to image segmentation in other fields, and often utilize stochastic gradient descent (SGD) to optimize the network. 15 Recent work into stochastic gradient-based optimization has incorporated adaptive estimates of lower-order moments, resulting in the Adam optimization method, which is further described below. 19 Adam was first successfully applied to the problem of retinal vessel segmentation by the authors, laying the foundation for this work. 20 Herein, a fully-residual autoencoder batch normalization network ("PixelBNN") is trained via a random sampling strategy whereby samples are randomly distorted from a training set of fundus images and fed into the model. PixelBNN utilizes gated residual convolutional and deconvolutional layers activated by concatenated rectifying linear units (CReLU), similar to PixelCNN 18, 21 and PixelCNN++. 22 PixelBNN differs from its predecessors in three areas: (1) varied convolutional filter streams, (2) gating strategy, and (3) introduction of batch normalization layers 23 from which it draws its name.

DRIVE
The CNN was trained and tested against the Digital Retinal Images for Vessel Extraction (DRIVE) database 1 , a standardized set of fundus images used to gauge the effectiveness of classification algorithms. 24 The images are 8 bits per RGBA channel with a 565×584 pixel resolution. The data set comprises of 20 training images with manually delineated label masks and 20 test images with two sets of manually delineated label masks by the first and second human observers, as shown in

STARE
The Structured Analysis of the Retina database 2 has 400 retinal images which are acquired using TopCon TRV-50 retinal camera with 35°field of view and pixel resolution of 700×605. The database was populated and funded through the US National Institutes of Health. 1 A subset of the data is labelled by two experts, thereby providing 20 images with labels and ground truths.
To compensate for the small number of images, four-fold cross validation was used. Therein, the network was trained over four runs, leaving five image out each time, resulting in all 20 images being evaluated without overlapping the training set, thusly minimizing network bias.

CHASE DB1
The third dataset used in this study is a subset of the Child Heart and Health Study in England

Preprocessing
The most common and effective method for correcting inconsistencies within an image dataset is by comparing the histogram of an image obtained to that of an ideal histogram describing the brightness, contrast and signal/noise ratio, and/or determination of image clarity by assessing morphological features. 25 Fundus images typically contain between 500×500 to 2000×2000 pixels, making training a classifier a memory and time consuming ordeal. Rather than processing the entire image, the images are randomly cropped and resized to 256×256 pixels, flipped, rotated and/or enhanced to extend the dataset.

Continuous Pixel Space
It has been shown that a continuous domain representation of pixel colour channels vastly improves memory efficiency during training. 26 This is primary due to dimensionality reduction from initial channel values to a distribution of [-0.5 to 0.5]; features are learned with densely packed gradients rather than needing to keep track of very sparse values associated with typical channel values. 22

Image enhancement
Local histogram enhancement methods greatly improve image quality and contrast, improving network performance during training and evaluation. Rather than sampling all pixels within an image once, histograms are generated for subsections of the image, each of which is normalized.
One limitation for local methods is the risk of enhancing noise within the image. Contrast limited adaptive histogram equalization (CLAHE) is one method that overcomes this limitation. CLAHE limits the maximum pixel intensity peaks within a histogram, redistributing the values across all intensities prior histogram equalization. 27 This is the contrast enhancement method used herein.

Network Architecture
PixelBNN is a fully-residual autoencoder with gated residual streams, each initialized by differing convolutional filters. It is based on UNET, 28 PixelCNN 21 as well as various work on the use of skip connections and batch normalization within fully convolutional networks. [29][30][31][32] It differs from prior work in the layer architecture, use of gated filter streams and regularization by batch normalization joint with dropout during training. While nuanced, the network further differentiates from many state-of-the-art architectures in its use of Adam optimization, layer activation by CReLU and use of downsampling in place of other multi-resolution strategies. The network makes extensive use of CReLU to reduce feature redundancy and negative information loss that would otherwise be incurred with the use of rectified linear units (ReLU). CReLU models have been shown to consistently outperform ReLU models of equivalent size while reducing the number of parameters by half, leading to significant gains in performance. 33 The architecture was influenced by the human vision system: • The use of two parallel input streams resembles bipolar cells in the retina, each stream pos- sessing different yet potentially overlapping feature spaces initialized by different convolutional kernels.
• The layer structure is based on that of the lateral geniculate nucleus, visual cortices (V1, V2) and medial temporal gyrus, whereby each is represented by an encoder-decoder pair of gated resnet blocks.
• Final classification is executed by a convolutional layer which concatenates the outputs of the last gated resnet block, as the inferotemporal cortex is believed to do.
More detail on this subject is covered in prior work by the authors. 17 Fig 2: Processed image patches are passed through two convolution layers with different filters to create parallel input streams for the encoder. Downsampling occurs between each ResNet block in the encoder and upsampling in the decoder. The output is a vessel mask of equal size to the input.

Downsampling without Information Loss
A popular method for facilitating multi-resolution generalizability with fully convolutional networks is the use of dilated convolutions within the model. 21,34 Dilated convolutions can be computational expensive, as they continuously increase in size through the utilization of zero padding to prevent information loss. Downsampling is another a family of methods that sample features during strided convolution at one or more intermediate stages of a FCN, later fusing the samples during upsampling 29 and/or multi-level classifiers. 31 Such methods take advantage of striding to achieve similar processing improvements as dilated convolutions with increased computational efficiency, albeit with a loss in information. Variations in downsampling methods aim to compensate for this loss of information. Figure 2 illustrates the architecture of the proposed method. PixelBNN utilizes downsampling with a stride of 2, as well as long and short skip connections, resembling PixelCNN++. 22 Implementing both long and short skip connections has been shown to prevent information loss and increase convergence speed, 30 while mitigating losses in performance. 35 The method differs from (NIN) layer, which is a 1x1 convolutional layer like those found in Inception models. 35

Platform
Training and testing of the proposed method was done using a computer with an Intel(R) Core(TM) i7-5820K CPU with 3.30GHz of processing power, 32 GB of RAM and a GM200 GeForce GTX TITAN X graphics card equivalent to 3072 CUDA cores. On this platform, it took roughly 14 hours to train the network. At test time, the network processed a single image in 0.0466 seconds using the same system. In this study, Tensorflow 36 and other python scientific, imaging and graphing libraries were used to evaluate the results.

Experimental Design
This paper presents PixelBNN, a novel network architecture for multi-resolution image segmentation and feature extraction based on PixelCNN. This is the first time this family of dense fully connected convolutional networks have been applied to fundus images. The specific task of retinal vessel segmentation was chosen due to the availability of different datasets that together provide ample variances for cross-validation, training efficiency, model performance and robustness.
Architectural elements of the network have been thoroughly evaluated in the literature, as mentioned in Section 2.
3. An ablation study is beyond the scope of this paper and left for future work. Following the completion of the follow on study, the code will be made available here: https://github.com/henryleopold/pixelbnn

Performance Indicators
Model performance is evaluated using a set of key performance indicators (KPIs), which are calculated by comparing the network output against the first set of manual delineations as the ground truth on a per-pixel basis. The test dataset has a second set of manual delineations which are used to benchmark the results against a second human observer (the '2nd observer'). There are four potential classification outcomes for each pixel: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). These outcomes are then used to derive KPIs, such as sensitivity (SN; also known as recall), specificity (SP), accuracy (Acc) and the receiver operating characteristic (ROC), which can be a function of SN and SP, true positive rate (TPR) and false positive rate (FPR) or other similar KPI pairs. SN and SP are two of the most important KPIs to consider when developing a classification system as they are both representations of the "truth condition" and are thereby a far better performance measure than Acc. In an ideal system, both SN and SP will be 100%, however this is rarely the case in real life. The area under a ROC curve (AUC) as well as Cohen's kappa coefficient (κ) are two common approaches for measuring network performance.
κ is measured using the probability (n ki ) of an observer (i) predicting a category (k) for a number of items (N ) and provides a measure of agreement between observers -in this case, the network's prediction and the ground truth. 37 The Matthews Correlation Coefficient (MCC), the F1-score (F1) and the G-mean (G) perfor- The frequency a pixel is properly classified Measure from -1 to 1 for agreement between manual and predicted binary segmentations Harmonic mean of precision and recall 2 * T P 2T P + F P + F N or 2 * P r * SN P r + SN mance metrics were used to better assess the resulting fundus label masks. These particular metrics are well suited for cases with imbalanced class ratios, as with the abundance of non-vessel pixels comparative to a low number of vessel pixels in this binary segmentation task. MCC has been used to assess vessel segmentation performance in several cases, and its value is a range from -1 to +1, respectively indicating total disagreement or alignment between the ground truth and prediction. 38  G-mean calculates the geometric mean between SN and SP. 39 The KPIs are defined in Table 1.

Training Details
For each dataset, the network parameters were randomly reinitialized using the Xavier algorithm. 40 Table 2 summarizes the three data sets as well as the test-train data distribution and approximate information loss incurred during preprocessing. Pre-training was never conducted and so the network was trained from scratch for each dataset; in the case of STARE and CHASE DB1, one set of parameters was trained from scratch for each fold. Images were reduced in size to alleviate the computational burden of the training task rather than using the original image to train the network.
To ensure each dataset was evaluated equivalently, image size was first normalized to 256×256 The images were randomly cropped between 216 to 256 pixels along each axis and resized to 256×256. They were then randomly flipped both horizontally and vertically before being rotated at zero, 90°or 180°. The brightness and contrast of each patch was randomly shifted to further increase network robustness. PixelBNN learns to generate vessel label masks from fundus images in batches of 3 for 100,000 iterations utilizing Adam optimization with an initial learning rate of 1e −5 and decay rate of 0.94 every 20,000 iterations. Batch normalization was conducted with an initial of 1e −5 and decay rate of 0.9 before the application of dropout regularization 41 with a keep probability of 0.6. It required approximately 11 hours to complete training for DRIVE and the same for each fold during cross validation.

Results
The output of PixelBNN is a binary label mask, predicting vessel and non-vessel pixels thereby segmenting the original image. Each dataset contains a two experts' manual delineations; the first was used as the ground truth for training the model and the second was used for evaluating the network's performance against a secondary human observer. Independently, each dataset was used to train a separate model from scratch resulting in three sets of model parameters.

Performance Comparison
The results were compared with those of other state-of-the-art methods for vessel segmentation with published results for at least one of the DRIVE, STARE or CHASE DB1 datasets. The results for the model trained and tested on DRIVE are shown in Table 3, STARE results are shown in Table   4 and CHASE DB1 results are in Table 5. Cross-testing was conducted using each of these sets to measure the performance of the network against each other datasets' test images. The results from cross-testing are summarized in  Figure 3, Figure 4 and Figure 5 show the best and worst scoring   Table 2).

Computation time
Computation time is a difficult metric to benchmark due to variances in test system components and performance. In an attempt to evaluate this aspect, recent works that share the same GPU -the NVIDIA Titan X -were compared. This is a reasonable comparison as the vast majority of computations are performed on the GPU when training DNNs. Table 7 shows the comparable methods    This incurs a loss of information as many pixels and details are discarded in the process, proportionately reducing the feature space by which the model can learn this task. The decision to use this strategy was primarily driven by computational efficiency, as the methods are intended for use in real time within CAD systems. The cross-testing demonstrates the model's ability to learn generalizable features from each dataset, making it a viable architecture for automated delineation of morphological features within CAD systems. The drop in model performances compared to the state-of-the-art is believed to be caused by the loss of information incurred during preprocessing and will be investigated in future work that also delves into an ablation study.

Conclusion
This paper proposed a method for segmenting retinal vessels using PixelBNN -a dense multistream FCN, using Adam optimization, batch normalization during downsampling and dropout regularization to generate a vessel segmentation mask by converting the feature space of retinal fundus images. F1-score, G-mean and MCC were used to measure network performance, rather than Acc, AUC and κ. This novel architecture performed well, even after a severe loss of information, even outperforming state-of-the-art methods during cross-testing. This reduction in information also allowed the system to perform 8.5× faster than the current state-of-the-art at test time, making it a viable candidate for application in real-world CAD systems.