Novel Application of Long Short-Term Memory Network for 3D to 2D Retinal Vessel Segmentation in Adaptive Optics— Optical Coherence Tomography Volumes

Featured: Application: Authors are encouraged to provide a concise description of the specific application or a potential application of the work. This section is not mandatory. Abstract: Adaptive optics—optical coherence tomography (AO-OCT) is a non-invasive technique for imaging retinal vascular and structural features at cellular-level resolution. Whereas retinal blood vessel density is an important biomarker for ocular diseases, particularly glaucoma, automated blood vessel segmentation tools in AO-OCT have not yet been explored. One reason for this is that AO-OCT allows for variable input axial dimensions, which are not well accommodated by 2D-2D or 3D-3D segmentation tools. We propose a novel bidirectional long short-term memory (LSTM)-based network for 3D-2D segmentation of blood vessels within AO-OCT volumes. This technique incorporates inter-slice connectivity and allows for variable input slice numbers. We com-pare this proposed model to a standard 2D UNet segmentation network considering only volume projections. Furthermore, we expanded the proposed LSTM-based network with an additional UNet to evaluate how it refines network performance. We trained, validated, and tested these architectures in 177 AO-OCT volumes collected from 18 control and glaucoma subjects. The LSTM-UNet has statistically significant improvement ( p < 0.05) in AUC (0.88) and recall (0.80) compared to UNet alone (0.83 and 0.70, respectively). LSTM-based approaches had longer evaluation times than the UNet alone. This study shows that a bidirectional convolutional LSTM module improves standard automated vessel segmentation in AO-OCT volumes, although with higher time cost.


Introduction
Adaptive optics optical coherence tomography (AO-OCT) imaging is a non-invasive technique that provides improved lateral resolution compared to traditional optical coherence tomography (OCT) by using adaptive optics (AO) to correct for ocular aberrations. With AO-OCT, it is now possible to obtain three-dimensional (3D), cellular-level resolution of the retina and optic nerve head to study ocular physiology and diseases [1][2][3][4][5][6][7][8]. As an example of cellular resolution capabilities, AO-OCT based methods can reli-ably quantify retinal ganglion cell (RGC) soma morphology [8][9][10][11] and distinguish individual retinal vessels [12][13][14]. OCT and AO-OCT collect and register cross-sectional scans of the retina in sequence that are combined to form 3D volumes. These volumes can be probed to observe vasculature changes in their natural depth-layered plexuses in the en face plane. The anatomic relationship of these plexuses has been previously well-characterized [15]. In the parafoveal and perifoveal macula, retinal vessels separate into three distinct vascular plexuses: the superficial vascular plexus (SVP), intermediate capillary plexus (ICP), and deep capillary plexus (DCP) [15]. The SVP primarily nourishes retinal ganglion cells in the ganglion cell layer (GCL). This layer also may be connected to the radial peripapillary capillary complex (RPCP) which supplies the retinal nerve fiber layer (RNFL) in the peripapillary region. The ICP lies deeper to the SVP and supplies the dendritic synapses in the GCL as well as the cells in the inner nuclear layer (INL). Finally, the DCP is at the base of the INL and mainly supplies the bipolar and horizontal cells and their connections to the photoreceptor outer nuclear layer [16].
Glaucoma is an ocular disease that significantly affects the inner retina, specifically the RGC somas and vasculature that supplies them [17]. It is a leading cause of irreversible blindness with a projected global disease prevalence of greater than 111.8 million by 2040 [18]. Reduction of intraocular pressure (IOP) is the only current treatment for the disease, but glaucoma can worsen even with adequate IOP control and up to one-third of patients develop glaucoma with an IOP in normal range [19]. This indicates the need to understand non-IOP related factors that contribute to the disease, including retinal vascular dysfunction [17,20]. Previous studies have shown that glaucoma, and specifically thinning of the GCL is associated with lower retinal vascular density in OCT images [21]. However, GCL thickness is an approximate surrogate for RGC density. AO-OCT has provided great capability to simultaneously measure RGC and vessel characteristics. We have previously leveraged AO-OCT to determine the association between RGC density and vessel density using a laborious and semi-automated process [22].
To further explore the relationship between RGC damage and vascular dysfunction characterized by vessel drop-out, automated quantification methods to extract these metrics from AO-OCT volumes are needed. A weakly-supervised segmentation method using a deep learning algorithm has been investigated to automatically quantify individual ganglion cell layer soma in AO-OCT volumes [9]. However, automated vessel segmentation in AO-OCT volumes, which resolve vessels down to the capillary scale, has not yet been explored. Automated retinal vessel segmentation tools for this modality will be increasingly useful and relevant for AO-OCT clinical translation.
Retinal blood vessel segmentation using deep learning is an area of active research in other retinal imaging modalities, such as traditional OCT and OCT angiography (OCTA) [23,24]. While OCTA collects multiple axial cross-sections, commonly referred to as B-scans, which can be used to form a 3D volume, typically this volume is projected back onto an en face 2D plane for vessel segmentation and interpretation. 2D-2D segmentation loses inter-slice connectivity between the en face plane slices of the 3D volume that can potentially be valuable context for automated segmentation. However, training end-toend 3D-3D segmentation models, which receive the 3D volume and output 3D vessel labels, are also challenging. 3D-3D convolutional techniques require uniform input sizes [25,26], which may be an obstacle given the variability in retinal layer thickness across patients [27,28]. Furthermore, acquisition of 3D labeled data for deep learning algorithm training is costly requiring up to 50 h per scan to label, even for an expert grader [29]. Such models are also computationally expensive for high resolution data and can potentially be more difficult for providers and researchers to interpret if they are accustomed to a 2D en face view of vascular networks. 3D-2D segmentation, which receives the entire 3D volume as input and generates a 2D label, is a possible tool to leverage inter-slice connectivity while using relatively low-cost labels from 2D segmentation maps. Recurrent neural networks (RNN) are deep learning architectures originally designed to model sequential information with dependence on previous states, such as language processing or time-series forecasting. These tools notably do not require a uniform input size, which offers a unique advantage for the segmentation of AO-OCT vessel images with variable input thickness. The primary purpose of this study is to investigate the performance of novel RNN, specifically bidirectional convolutional long short-term memory networks (LSTM), for 3D-2D vessel segmentation of AO-OCT volumes compared to traditional 2D-2D segmentation.

In-Vivo Adaptive Optics Imaging
This study used AO-OCT data collected from a study of ganglion cell layer soma quantification in 18 glaucoma and control subjects (six glaucoma and 12 control subjects). The data from both glaucoma and control subjects were combined for this study and no analysis differentiated the disease state. The full details of subject clinical assessment and AO imaging are found in previously published papers [8,22]. Pupillary dilation and cycloplegia were achieved with 1% tropicamide in subjects, who were subsequently imaged using the FDA multimodal adaptive optics (mAO) device previously described [8,30]. We examined 1.5° × 1.5° regions located symmetrically 2.5° superior and inferior about the horizontal midline at eccentricities 3°, 6°, and 12° in the temporal retina ( Figure 1A). The AO focus was set approximately to the ganglion cell layer and 300 AO-OCT volumes were collected, registered, and averaged at each location. In the AO-OCT volumes, the SVP, ICP, and DCP, were segmented separately by creating en face average intensity projections across the axial pixels in which each plexus resides ( Figure 2) [15]. After undergoing ImageJ automatic contrast enhancement, all vessels in each en face projection were then manually labeled by a single expert grader (co-author R.V.) using a uniform brush size for each capillary segment. For each capillary branch, the brush size was readjusted depending on the grader's visual estimate of that segment's vessel width. These tracings were binarized in ImageJ, reviewed for quality by the principal investigator, and used as the ground-truth standard for our models ( Figure 2) [31].

Nested Model Architectures
Three models, referred to as UNet, LSTM-UNet, and UNet-LSTM-UNet, were designed and examined. To sequentially evaluate the conferred benefit from each architecture's design, the models were nested such that LSTM-UNet incorporated the UNet architecture, and UNet-LSTM-UNet incorporated the LSTM-UNet architecture.
The UNet architecture was selected as a base architecture as is commonly done for medical image segmentation, including for previous OCT and retinal vascular imaging applications [23,24,32]. Our standard UNet received a 2D image as input and output a 2D image following a process comprised of an encoder, skip connections, and a decoder. The encoder used convolutional layers to extract features at variable input resolutions. Maxpooling during the encoding process shrank input resolution, further allowing identification of segmentation features at different scales. At each resolution, skip connections were used to concatenate the convolution layer output to a corresponding resolution in the decoding pathway. The decoder uses these convolution outputs along with the up-sampled images as inputs for deconvolution layers to generate the resulting segmentation output following sigmoid activation. The resulting output has two channels, one for vessel activation and one for background activation. Our base model had a network depth of 3 utilizing skip connections at each layer ( Figure 1A). We also performed batch normalization and transformation with a leaky rectified linear unit (slope = 0.01) at each convolutional layer.
The LSTM-UNet is our proposed 3D-2D deep learning framework that uses bidirectional convolutional LSTM networks to incorporate the 3D context (i.e., inter-slice connectivity) from 3D input slices with unfixed depths to output 2D segmentation maps. LSTM networks are a form of RNNs that extend base RNN's ability to model sequential data by updating a hidden state representation with input-to-state and state-to-state operations, namely an input gate, output gate, forget gate, and cell-state, which can better account for long-term sequential data [33]. Importantly, a LSTM model allows for varying numbers of input slices, which is necessary for our task, as the volumes containing each vessel plexus can vary in size when considering retinal layer thickness variability across patient populations [27,28]. While the standard LSTM classically uses fully connected layers for language processing applications, for our purposes, we employed a convolutional LSTM, which uses convolutional structures in the input-to-state and state-to-state transitions to more efficiently handle sequential spatial data ( Figure 1B) [34]. Furthermore, to incorporate the inter-slice context from both the previous and following slices, we presented slices to two distinct LSTM units in forward (top-to-bottom) and reverse (bottom-to-top) order and concatenated the outputs to generate a bidirectional convolutional LSTM ( Figure 1C) [35]. In our LSTM-UNet architecture, we used bidirectional LSTM units at three input resolutions during image encoding (Figure 3). The encoded outputs were concatenated with up-sampled images during decoding, eventually resulting in a single channel 2D output that could be fed into a UNet unit. The UNet-LSTM-UNet appends another UNet as an additional image pre-processing step before each slice is fed into the LSTM-UNet. This design, inspired by cascading architectures for brain tumor segmentation [36], similarly allows for variable slice number volumes, 3D-2D segmentation, and provides increased numbers of trainable convolutions and parameters earlier within the architecture, potentially allowing for improved labeling of vessels within each slice.

Model Training and Performance Evaluation
AO-OCT volumes (n = 177) were randomly split into training, validation, and testing datasets in a 60%-20%-20% split, respectively, ensuring an equal composition of SVP, ICP, and DCP volumes in each split. The characteristics of these splits are shown in Table 1. Each slice for each volume underwent automatic contrast adjustment following the Im-ageJ automatic contrast function that is based on histogram stretching [31]. To improve model robustness, we performed image augmentation [37] on the training dataset with random horizontal flip, vertical flip, affine transformation with translation and scaling, and random cropping to patches of 64 × 64 pixels. All models were trained using a binary cross-entropy loss function. Convolutional and deconvolutional layer weights were initialized as described by Kaiming et al. [38]. All models were trained for 600 epochs using an Adam optimizer at a learning rate of 0.0001. We also employed early stopping criteria based on validation set performance with a patience of 180 epochs. During evaluation, the model with the lowest binary cross-entropy loss on the validation set for each architecture was selected as the "best model" and used to segment volumes from each testing set in 64 × 64 × Ni pixel patches, where N was number of slices for a volume, i. The 64 × 64 segmented masks were reassembled to form the final 2D mask output. We evaluated each model's performance on the testing set with an average Dice coefficient, area under receiver operating characteristic curve (AUC), precision, recall, and accuracy [39][40][41]. The metrics were compared between the models using a one-way ANOVA and follow-up Tukey test with statistical significance of differences determined to be p-value < 0.05. We also recorded the time it took each model to generate a segmentation for a single 30-slice volume. This volume was selected as it was the closest to average number of slices per volume within our testing set. All the algorithms were implemented on a computer with a NVIDA GeForce RTX 2070 (8GB) GPU and AMD Ryzen 5 3600 6-Core Processor @ 3.6 GHz (16 GB RAM). All image processing, model training, and model evaluation was performed in ImageJ and PyTorch 1.7.1 using Python 3 [31,42].

Results
The UNet, LSTM-UNet, and UNet-LSTM-UNet were each trained with the same 109 training and 34 testing AO-OCT volumes. We evaluated each model's performance on the testing set with an average Dice coefficient, area under receiver operating characteristic curve (AUC), precision, recall, and accuracy and report their relative performance in Figure 4 and Table 2. Of the three models, the LSTM-UNet had the greatest average Dice coefficient (0.69), recall (0.80), and AUC (0.88), while the UNet-LSTM-UNet had the best precision (0.65). The LSTM-UNet and UNet-LSTM-UNet had similar average pixel-wise classification accuracy (0.92). Both the LSTM-UNet and UNet-LSTM-UNet had significantly better AUC (p value < 0.001 for both) and recall (p value < 0.001 and = 0.001, respectively) performance on the testing set than UNet alone.  Table 2. Performance metrics (with standard deviation) on testing set for the three segmentation architectures. Area under receiver operating characteristic curve (AUC). * indicates significant difference (p-value < 0.05 by one-way analysis of variance). We show representative examples of all three models' qualitative performance in Figure 5. Visually we observe that all three models were affected by shadowing artifacts in deeper layers (ICP and DCP), but generally the LSTM-UNet and UNet-LSTM-UNet were less affected than the UNet model.  Table 3 shows the number of learnable parameters and the evaluation time on a sample 30 slice volume for each architecture, demonstrating that the size of the model was correlated with evaluation time. However, the relationship between number of parameters and time to evaluation was not necessarily linear, with the LSTM-UNet and UNet-LSTM-UNet segmenting the sample volume in 184.8 and 241.7 s respectively, while the UNet alone, which had 20-25% of the number of parameters of the LSTM-based models, segmented a single 2D projection of the 30 slices in 1.6 s, including image loading speed.

Discussion
This is the first study to explore automated vessel segmentation in AO-OCT volumes. In this work, we found that augmentation of a UNet with a LSTM could significantly improve vessel segmentation performance with respect to AUC and recall on our held-out testing set when compared to a UNet alone.
Previous studies have examined alternative 3D-2D vessel segmentation approaches in OCTA retinal imaging and other imaging modalities. Li et al., have developed a novel image projection network (IPN) which uses a unidirectional pooling layer to effectively learn weights for each slice within the projection step [43]. This unidirectional pooling layer in the IPN necessitates consistent pixel volumes as input, which would require interpolation or compression of non-uniform input for individual retinal layer blood vessel segmentation. In a large dataset of 500 OCTA volumes, their most recent iteration has shown a best Dice coefficient of 0.93 representing a 0.03 improvement over the baseline 2D-2D UNet with a Dice coefficient of 0.90 in their dataset [44]. As an imaging modality, OCTA uses motion-based processing to improve vessel contras [45] and it is expected that the segmentation performance on OCTA data would be greater in both the base UNet and the IPN than in AO-OCT volumes, which in its current form, does not perform any additional processing to improve vessel contrast. In comparison to the added benefit demonstrated by the 3D-2D architecture of Li et al. over baseline UNet, we achieve similar improvement of 0.04 with a Dice coefficient of 0.69 with our LSTM-UNet relative to 0.65 in the UNet alone. Lee et al. developed a Spider U-Net, which similarly uses bidirectional convolutional LSTM to capture inter-slice connectivity, but employs the LSTM in between the encoding and decoding path of several UNet modules for each slice [46]. This architecture was trained and evaluated on multiple modalities, specifically Brain MRA, Abdomen CT and Cardiac MRI, for 3D-3D segmentation of blood vessels with Dice coefficients for Spider U-Net improving over 2D UNet by 0.05, 0.13, and 0.06, respectively for each dataset. While this architecture differs from ours in that it requires annotations for each slice within a 3D volume for training and was evaluated on (non-retinal) vessel and organ segmentation tasks, Lee et al., found that incorporating a LSTM for inter-slice connectivity for their task produced fewer false-negative pixels. Our results are consistent with this finding as the recall for our LSTM-based models were significantly improved over the UNet based models alone.
Our study also found that vessels in the ICP and DCP that are subject to shadowing artifacts evident in the raw image are more likely to be partially or fully segmented within the LSTM-based models compared to the UNet alone. This finding is expected, as an averaged projection would have lower pixel intensity and would be more difficult to distinguish in the 2D-2D approach, but in the inter-slice context gained from the LSTM could assist with identification of these vessels and could indicate that deeper plexuses may benefit from LSTM-based segmentation methods. When comparing the two LSTM-based architectures, we found that the UNet-LSTM-UNet architecture has similar performance to the LSTM-UNet architecture alone, with non-significantly different precision (p value = 0.96), Dice coefficient (p value = 0.99), recall (p value = 0.61), accuracy (p value = 0.99) and AUC (p value = 0.65), indicating that increased parameters alone are not guaranteed to significantly improve performance. In fact, when factoring in the cost of increased evaluation time by the UNet-LSTM-UNet model, our work suggests that the LSTM-UNet approach is superior to the higher capacity model. Additionally, while the LSTM-based models demonstrate greater segmentation performance than the UNet alone, the evaluation time for a 30-slice volume (184.8 or 241.7 s) was 100-200 times longer than the projection segmentation (1.6 s). These differences indicate a trade-off between segmentation performance and speed inherent in the two architectures. Whether the time cost of LSTMbased models is a significant barrier for real-world segmentation and outweighs the benefit of greater fidelity performance will be an important consideration when implementing these models for research or clinical use in the future.
This work is not without limitations. Our sample size represents 177 volumes collected from a limited cohort of 18 patients. We note that this is a substantial number of volumes and on a greater or similar scale to previously published OCTA vessel segmentation datasets [47] and that theoretically the computer vision task of classifying pixels from grayscale images should be agnostic to the individual patient identity or disease state. However, more studies will be needed to ensure these results generalize to a greater subject population and perform consistently across glaucoma and control eyes separately. The design of studies to ensure no differences in model segmentation performance of glaucoma and control eyes is especially important if these tools are intended to be utilized to quantify biomarkers between the two disease states. Additionally, as our ground-truth labels were derived from annotating the 2D projection of each volume, rather than 3D annotations with masks for each slice, there is the possibility of human judgement impacting our model's training and performance. However, any error in manual labeling resulting from tracing the 2D projection would more likely bias our results towards the 2D-2D UNet alone, yet our 3D-2D approach remains significantly superior.

Conclusions
The results of this study demonstrate that augmenting traditional UNet approaches with LSTM enables improved automated vessel segmentation in AO-OCT volumes. This 3D-2D approach would enable researchers to continue using lower cost 2D labels on readily available 3D AO-OCT data to train deep learning tools for AO-OCT vessel segmentation.

Disclaimer
The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the U.S. Department of Health and Human Services.

Informed Consent Statement:
Informed consent for the collection and analysis of data was obtained from all subjects involved in the study.

Data Availability Statement:
Computer code and best performance states for computational models can be found at https://github.com/ctnle/AO-OCT-Vessel-Segmentation (accessed on 10 October 2021).

Conflicts of Interest:
The authors declare no conflict of interest