Segmentation of Preretinal Space in Optical Coherence Tomography Images Using Deep Neural Networks

This paper proposes an efficient segmentation of the preretinal area between the inner limiting membrane (ILM) and posterior cortical vitreous (PCV) of the human eye in an image obtained with the use of optical coherence tomography (OCT). The research was carried out using a database of three-dimensional OCT imaging scans obtained with the Optovue RTVue XR Avanti device. Various types of neural networks (UNet, Attention UNet, ReLayNet, LFUNet) were tested for semantic segmentation, their effectiveness was assessed using the Dice coefficient and compared to the graph theory techniques. Improvement in segmentation efficiency was achieved through the use of relative distance maps. We also show that selecting a larger kernel size for convolutional layers can improve segmentation quality depending on the neural network model. In the case of PVC, we obtain the effectiveness reaching up to 96.35%. The proposed solution can be widely used to diagnose vitreomacular traction changes, which is not yet available in scientific or commercial OCT imaging solutions.


Introduction
The preretinal space of the human eye and the pathologies connected with its improper changes have become of interest in recent years [1]. The field of ophthalmology has benefited greatly due to the development of noninvasive diagnostic tools such as optical coherence tomography (OCT) [2,3]. This imaging modality uses near-infrared light reflected from the analyzed tissue to illustrate changes in the eye morphology.
As one of the significant advantages, the OCT allows to visualize the progression of posterior vitreous detachment (PVD) and to monitor its possible pathological outcomes [4][5][6][7]. In the majority of the cases, the process of PVD is asymptomatic [8]. It is a consequence of a naturally occurring liquefaction of the vitreous in the aging eye. This phenomenon leads to a progressive separation of posterior cortical vitreous (PCV) [9] from the retina surface, starting from the weakest points of adhesion-the perifoveal quadrants. In the final stages of the detachment, the vitreous separates from the fovea and the optic nerve head.
Although the complete posterior detachment is prevalent in over 40% of healthy subjects at the age of 60 [8], its abnormal development can cause severe pathological changes such as vitreomacular traction (VMT), epiretinal membrane, and macular hole [10].
For example, Figure 1 illustrates OCT images of a healthy retina with a visible attachment of PCV and a pathological case of vitreomacular traction.
In VMT, the posterior vitreous with continuous adhesion to the macula exerts traction on the retina (mainly fovea-the most sensitive part of the macula). Persistent traction can lead to deformation of the fovea, cystoid foveal thickening, and disorganization of retinal layers. Such problems manifest with metamorphopsia (image deformity), deterioration of visual acuity, blurred or impaired central vision that significantly impair daily tasks (e.g., reading) [11].
It has also been reported that the prevalence of VMT increases significantly with age, from 1% in subjects of 63-74 years old to 5.6% in patients over 85 years old [12]. The chance of spontaneous VMT resolution is up to 23% [13]. However, if left untreated, the probability of severe retina damage (frequently due to the development of a macular hole) and vision deterioration increases with time. Systematic monitoring of preretinal changes allows the physician to determine if (and when) surgical intervention is required. Currently, VMT advancement is determined based only on a single cross-section through the center of the macula. To the best of our knowledge, no research or commercial image informatics solutions allow for automatic segmentation of the vitreous cortex and thus the preretinal space. Therefore, manual measurements can only be made in a few points of the volume and are not sufficient to quantify the profile of this epiretinal pathology. A numerical analysis of preretinal space volume and adhesion area is required to assess precisely the development and current stage of VMT. Such investigation can include: quantification of the preretinal space volume [14], statistical analysis of vitreoretinal interface parameters (e.g., adhesion area), description of the stage of pathology development [15] and its changes in time. Therefore, the eye doctor community highly desires fully automated OCT image analysis presented in this paper as a necessary step in advancing the fundamental understanding of the changes in vitreoretinal interface pathologies. Automated segmentation of preretinal space and thus 3D volumetric analysis would optimize available treatment strategies, including ocriplasmin, sulfur hexafluoride, and octafluoropropane gas injection [14].

Key Contributions
The focus of this paper is the development of an automatic image informatics approach for the segmentation of the preretinal space in both healthy and pathological cases. The major contributions of this study are: • Evaluation of state-of-the-art neural network methods employed for the new task of preretinal space segmentation to the best of our knowledge has not been previously attempted. The conducted experiments demonstrate that the deep learning approach is better suited for segmenting preretinal space than a standard graph-search method. • Analysis of the influence of kernel shape and size on convolutional network performance for OCT images. With the experiments described in Section 6, we show that by changing the shape and size of a convolutional kernel, it is possible to overcome topology errors in pixel-wise retina layer segmentation. • Analysis of relative distance map calculation to improve retina layers topology preservation using various network architectures. We propose two methods of obtaining a distance map for any given image. They do not require any given knowledge about the retina layers (as is the case in the compared methods), are less computationally expensive, and give similar (or in some cases better) results. • Collection of a unique dataset of 3D OCT images from 50 VMA and VMT subjects (7050 2D B-scans) with manual segmentation data of preretinal and retinal volumes [16]. The gathered data was statistically analyzed with respect to image quality and features distribution.
The rest of the article is organized as follows. First, the recent state-of-the-art approaches for retina layers segmentation as well as the methods selected for preretinal space segmentation are described in Section 2. Section 3 contains the characterization of the data used in the study. Section 4 presents the methods implemented in this study, while Section 5 describes the experiment setup. The results of the designed experiments are presented in Section 6. Finally, Section 7 summarizes the conducted research, discusses its advantages and limitation, and provides insights into problems worth investigating in future studies.

Retinal Layers Segmentation
With the increased availability of OCT imaging devices, the detailed analysis of retina pathologies became possible. However, manual segmentation and measurement of tissue biomarkers are very time-consuming, and with an increased number of pathological subjects, not always an option. Thus, multiple image informatics algorithms have been developed in recent years to support the effort of ophthalmologists.
As can be derived from analysis of the literature, the complexity of proposed methods advanced with years, as well as their accuracy of retina layers segmentation and number of layers that could be segmented. Since 2012 the graph-search methods proved one of the most accurate retina layers segmentation for healthy and pathological cases. Their disadvantage, however, is the need for extensive image preprocessing (primarily noise suppression) [27,28] and careful selection of parameters for each dataset to make the designed approach suitable for the task. Additionally, the complexity and high time consumption make them inadequate for real-time application in a clinical setting.
The progress has also been made with machine learning, pattern recognition, kernel, and clustering techniques [29][30][31]. Furthermore, after the expansion of convolutional neural networks (CNNs) in the field of image segmentation, fully convolutional networks (FCNs) became a useful tool for segmentation of biomedical images [32][33][34]. In 2017, the first attempt was made to use a network called ReLayNet [35] (a variation of UNet [33] and DeconvNet [36]) for retina layers segmentation. Table 1 lists a summary of neural network topologies utilized in current pixel-wise approaches for retina layers segmentation (excluding papers that combine pixel-classification with graph search technique). In addition, the literature was analyzed in terms of the number and size of images utilized for training and experiment settings (e.g., segmented layers, loss function, data augmentation methods). Any previously undefined diseases abbreviations are described at the end of the paper. The reviewed works focus on segmenting main retina layers in normal and pathological subjects and, in some cases, accompanied fluids. Although the available databases consist of a limited number of training images, researchers compensate for that with data augmentation methods. The majority of published methods are based on UNet architecture [33], its modifications (with dilated convolutions [40], batch normalization [35], dropout layers [42]) or combination with other networks, such as ResNet [45]. Aside from accuracy, the advantage of the facilitation of neural networks for retina segmentation is their ability to simultaneously segment multiple layers as well as fluids and pathological tissues. Furthermore, contrary to classical solutions, they do not require a separate set of models and parameters for each specific disease or normal case.

Preretinal Space Segmentation
Despite the plethora of available retina layer segmentation methods, preretinal space segmentation from OCT images is not widely researched. To the best of our knowledge, deep learning methods have not yet been used for this task, and only a handful of reports of other approaches can be found in the literature.
For instance, Malagola et al. [46] showed that it might be possible to measure the volume of the preretinal space after re-calibrating the native OCT device segmentation algorithm to search for the preretinal space instead of the retina. This approach, however, does not allow to perform any further numerical analysis concerning the retina morphology, is not fully automatic, requires re-calibration of the device for each scan, and most of all, is not device-independent, thus cannot be employed in worldwide research.
The first graph search-based approach that segmented both the retina borders (e.g., inner limiting membrane (ILM), and retinal pigment epithelium (RPE)) and the posterior vitreous cortex was published in 2014 [47]. However, as was reported, the graph search requires significant preprocessing (e.g., denoising, removing low quality areas of the image). Even with such preparations, this method is prone to errors due to the fact that the PCV line is frequently on the level of noise in the image or its reflectivity is too low. The main disadvantage of the graph search approach is the assumption that the PCV line is visible through the entire width of the image, which, due to its varying density, is not necessarily true.
Furthermore, automated segmentation was used for quantitative analysis of epiretinal abnormalities only in several studies [48][49][50] (for epiretinal membrane and macular hole). The lack of research in PCV segmentation is caused by the unavailability of the data (OCT scans and manual segmentation) with both VMA and VMT, and insufficient accuracy of state-of-the-art retina segmentation methods when applied to this task.

Issues with Layers Topology
As was described above, the convolution neural networks typically output probability maps classifying each pixel belonging to one of the designed classes. However, it means that individual pixels are analyzed locally, while this methodology gives results that may perfectly match regional data, if we look closely at a B-scan (see Figure 1), it can be noticed that many areas belonging to different layers or image regions have similar intensities or contrast characteristics. As a result, CNNs produce inconsistencies in the topological order of retina layers that are unacceptable in medical imaging [51].
One of the first approaches to address that issue was proposed by He et al. [44,51]. In their works, two separate networks were used: first to learn the intensity features and second to correct the obtained topology by learning the implicit latent network features corresponding to shape factors in an adversarial manner. Some other works on topologyguaranteed predictions tried to directly predict the coordinates of layer boundary while encoding prediction of lower layers as a relative position to the upper ones [52]. Nevertheless, such an approach may easily lead to error propagation if the uppermost boundary is incorrectly segmented.
A promising approach was proposed by Lu et al. [53], in which they integrated information about the hierarchical structure of the image in the form of a relative distance map (RDM). This map, computed from an initial graph search-based retina borders segmentation in a preprocessing step, was provided to the neural network as a second channel of the input image. This work was further extended by Ma et al. [40] by substituting the graph search-based initial segmentation with cascading networks trained separately.
The relative distance map provides a way of introducing additional spatial information alongside the input image. Each pixel value of the map corresponds to the pixel position in the image as a function of retina location. Thus, knowing the coordinates of inner and outer retina borders (namely ILM and RPE) across a B-scan, the intensity values of the relative distance map are computed for each pixel with indexes (x, y) as follows: where ILM(x) and RPE(x) represent the y (vertical) coordinate of previously segmented ILM and RPE lines in the image column x. According to this equation, the pixels above the ILM take value < 0, pixels positioned within the retina tissue take values in the range of 0, 1 , and pixels below the retina are > 1. Such weighing scheme, concatenated (as a second channel) to an original OCT image, which is also in the range of 0, 1 , allows the network to learn layers topological dependence. Such procedure boost precision of segmenting non-neighbouring layers with similar intensity patterns or lower contrast. Our work explores the possibility of utilizing deep neural networks (DNN) for the task of preretinal space segmentation and investigates the challenges connected with the specificity of the preretinal space (shape and image intensity variations). We present the research results for preserving the topological correctness of the segmented OCT image areas, including the implementation of four different distance maps (based on prior segmentation or without it). We propose investigating the influence of convolutional kernel size and shape on topology correctness and overall segmentation accuracy.

OCT Image Dataset
The goal of conducting research in OCT retina layers segmentation creates a need to obtain numerous images with annotated biomarkers searched for in a specific task. Since 2018 only several public OCT databases with 3D images have been established, and most of them are aimed at classifying a specific disease from a single B-scan. Other cohorts, focused on automatic retina layer segmentation, provide manual segmentations for 3 to 8 retina borders. Most contain images of 10 to 25 volumes [37,[54][55][56], although one database consists of over 380 scans [57]. The subjects included in those databases are either healthy or involve patients suffering from pathologies such as age-related macular degeneration (AMD), diabetic macular edema (DME), diabetic retinopathy (DR). Nevertheless, none of the available databases concern patients with vitreoretinal interface pathologies, especially vitreomacular traction, not to mention annotations of PCV.
Thus, a CAVRI (Computer Analysis of VitreoRetinal Interface) dataset [16] of OCT images with VMA and VMT has been created to analyze the characteristics of preretinal space. Subjects for this dataset were recruited at the Department of Ophthalmology, Chair of Ophthalmology and Optometry, Heliodor Swiecicki University Hospital, Poznan University of Medical Sciences in Poznan, Poland. The study was approved by the Bioethics Committee of Poznan University of Medical Sciences under resolution no. 422/14. All participants signed an informed consent document before enrollment.
The CAVRI database contains 3D images of the macula obtained with the Avanti RTvue device (Optovue, Incorporated, Fremont, CA, USA) using the 3D Retina scanning protocol. For this research, from a group of 73 cases a set of 50 OCT volumes was selected: 25 examples of the healthy retina (with asymptomatic vitreomacular adhesion (VMA)) and 25 subjects with VMT. Each 3D scan consists of 141 cross-sections with 640 × 385 px resolution representing 2 × 7 × 7 mm retina volume. The corresponding voxel sizes equal 3.125 µm in the vertical direction (further denoted as y), and 18.18 µm and 49.65 µm in fast-scanning (x) and non-fast scanning (z) directions, respectively. No multi-sampling and noise-reduction protocols were used during acquisition or preprocessing. All 7050 crosssections (also called B-scans), visualized as gray-scale images, were analyzed separately.
The PCV line, visible as a hyperreflective line in a B-scan (see Figure 1), was manually annotated under the supervision of clinical experts (three ophthalmologists) using a custommade public software OCTAnnotate [58]. Selected 50 subjects had a maximum difference of manual segmentation between the experts less than 3 px. In addition, two other lines denoting retina borders were also labeled for investigational purposes, namely the inner limiting membrane (ILM) and retina pigment epithelium (RPE) outer border. These three lines (and consequently four image regions they create) are the ground truth for this research. Based on reference segmentation of PCV and ILM lines, we calculated the preretinal space volumes as the main metric of comparison. The average value of preretinal space volume for VMA is 3.17 ± 1.96 mm 3 and for VMT is 12.19 ± 6.09 mm 3 . The Wicoxontest resulted in p-value close to zero (3.6 × 10 −10 ), which confirms significantly different preretinal space volumes for the VMA and VMT groups.

Data Anomaly Detection
As part of a data-driven field of artificial intelligence, deep neural networks are highly dependent on the data itself. As is widely considered, the more data used for the training, the better the machine learning model [59]. Additionally, when providing data for the training, one should ensure that all classes (data categories) are equally represented in the dataset to give the model a chance to learn characteristic features of all cases possible to occur in the evaluating set. Furthermore, the quality of the obtained model increases with better data samples [60]. Accordingly, anomalous examples may hinder the learning process regardless of the complexity of the model.
By performing statistical analysis of the data, it is possible to discern the anomalous examples in the set. The outliers (otherwise known as anomalies or exceptions) are data examples outside the normal distribution of data features [61,62]. For images, it is possible to distinguish multiple descriptive features (from size and color intensity to contrast and noise). Outliers then could be found in an n-dimensional space (of n-features) [63].
As was established in [64] the presence of outliers in the training dataset could significantly compromise the accuracy of a machine learning model. Therefore, detecting and removing data samples that in any way differ from the bulk of the training set have a favorable influence on obtaining a better and more robust prediction model [65,66].
A wide variety of outlier detection algorithms are constantly developed and compared by many researchers [67][68][69]. In our experiments, a robust covariance method [70] was utilized for this task. The advantage of this unsupervised anomaly detection method is a fast estimation of unusual data without the need for labeling or any prior knowledge about the dataset. This technique calculates the elliptic envelope of the features (assuming a Gaussian distribution of the entire dataset) and regularizes the covariance matrix to determine the samples outside the boundary. The m% of the images with the lowest prediction scores are considered anomalous.
Utilizing the anomaly detection implementation provided by Kucukgoz [71], five image features are employed in our research, namely: noise score, contrast score, brightnessdarkness score, blurriness score, and average pixel width score. Based on the estimated covariance predictions, 3% of the data samples with the lowest score were established as anomalies and excluded from the experiment. Figure 2 illustrates the anomaly scores (= 1− covariance score) for the analyzed images.

Methods
To segment the PCV using a deep learning approach, we trained four state-of-the-art convolutional neural networks. Then, we compared their output with the previously described graph search-based approach and the ground truth. This section gives a general description of the utilized fully convolutional network architectures, including UNet, Attention UNet, ReLayNet, and LFUNet.
The processing pipeline of the proposed system is presented in Figure 3. Our framework learns correct preretinal space segmentation by separately processing a cohort of 2D OCT cross-sections with their relative distance maps. The predicted probability maps are compared with the ground truth, and the resulting error (loss) is used to update the network weights. The final binary segmentation maps are used to calculate borders between the segmented image regions, namely the PCV, ILM, and RPE lines. The classical image segmentation utilizing graph search and dynamic programming is described in [26]. Here, based on the vertical image gradient, an adjacency matrix for a graph is calculated. Next, the algorithm discerns the shortest path between the left and the right border of the image. This path represents a cut between image regions such as preretinal space and retina (i.e., ILM line) or retina and choroid (i.e., RPE line).
As was described in [47], the same principle can be applied to segmenting the edge of the posterior vitreous cortex. The areas between the subsequent lines correspond to preretinal and retinal regions.

UNet
UNet is an architecture proposed by [33] that obtains good accuracy in semantic segmentation of biomedical images. It consists of encoder and decoder paths, each with five levels of two convolution blocks. Each block incorporates a 3 × 3 px convolution followed by ReLU (Rectified Linear Unit) activation function. Between each of the five encoding level, a downsampling 2 × 2 px max-pool operation with a stride 2 × 2 px is applied. Simultaneously, each level doubles the number of feature channels. Consequently, the feature maps are upsampled with a 2 × 2 px up-convolution in the decoder path while halving the number of feature channels.
One of the beneficial procedures introduced in UNet is a skip connection: e.g., the feature maps at the end of each encoder level are concatenated to the upsampled decoder maps before being processed by the convolution blocks. Such operation allows preserving relevant information from the input features. The probability maps are obtained after applying a final 1 × 1 px convolution after the last decoder block, transforming the 64 element feature matrix into a segmentation mask for each desired class.

Attention UNet
An extension of the UNet architecture is the Attention UNet proposed by [72]. It introduces attention gates to highlight any significant features that are passed through the skip connection. Its advantage is maintaining a simple design while decreasing model sensitivity to the background regions.
The general design of this network is similar to the baseline UNet, with five double 3 × 3 px convolution blocks in the encoder and decoder paths. The attention module is applied to each encoding result before they are concatenated to the decoder blocks. The function of this grid-based gating mechanism is to minimize the influence of irrelevant or noisy features. The PyTorch implementation of the Attention UNet network utilized in this experiment was obtained from [73].

ReLayNet
ReLayNet [35] was the first CNN employed for the retina layer segmentation task. It is based on UNet, but with fewer convolution layers in each encoder and decoder block, a non-expanding number of features in each hidden layer, and only 3 (instead of 4) pooling/unpooling operations. An addition to such simplified architecture is the Batch Normalization procedure performed after each convolution and before the ReLU activation function.
The ReLayNet also differs from the original UNet with the kernel size used for each convolution, which is 7 × 3 px instead of 3 × 3 px. As was reported in [35], this ensures that the receptive field at the lowest level in the network covers the entire retina depth. As will be further proved in Section 6, increasing the receptive field of a convolution kernel has a significant impact on the segmentation accuracy.

LFUNet
The LFUNet network architecture is a combination of UNet [35], and FCN [74] with additional dilated convolutions [75]. In this network, the encoder part is the same as in the original UNet and consists of 4 blocks that contain two convolution layers with kernel size 3 × 3, and a 2 × 2 px max pooling layer with stride 2.
The decoder part consists of two parallel paths for UNet and FCN. The UNet path utilizes concatenation of up-sampled feature blocks with the corresponding blocks from the encoder part (a procedure also referred to as "skip-connections", that allows exploiting high-resolution information). The FCN path performs the addition of up-sampled feature blocks with the matching encoder blocks. The upsampling in both paths is performed with the 2 × 2 px up-convolution layer after each convolution block.
The additional strength of this network introduces the last part, which is a concatenation of final feature maps obtained from both decoder paths. They are subsequently dilated with three separate kernels, and the resulting matrices are again concatenated before final convolution. The output probability map for each pixel belonging to one of the C classes was obtained with the Softmax function. ReLu was used as all activation functions in the hidden layers.

Relative Distance Map
It should be noted that the problem of preserving correct topology in retina layers segmentation is even more pronounced for the preretinal space since it has almost the same intensity range as the vitreous. Hence, in this work, we employed the available approach for preparing the RDM (here referred to as "2NetR") based on prior information of retina borders and proposed a modified version tailored to the problem of segmenting the preretinal space. In addition, we also tested if a more straightforward map that does not require two cascaded networks and is computationally less expensive could also facilitate this task.

RDM Based on Prior Segmentation
To increase the significance of preretinal space as a region below the vitreous but above the retina, we propose to utilize a distance map (further called "2NetPR"), that would take the following values: This can be defined for each pixel with the following formulations: Nevertheless, as efficient an idea as this is, it still requires prior knowledge of retina borders in a given cross-section. As reported, this information can be obtained via graph search approach [53], or by performing the segmentation twice, incorporating two neural networks [40].

RDM without Prior Segmentation
Thus, we also investigated an approach that does not require any a priori knowledge about the retina position within the analyzed image. Two following solutions are evaluated: • Basic Map with Orientation-Firstly, we investigated if a map of linearly spaced values in the range of 0, 1 would provide the network with sufficient information about the layers' hierarchy. Additionally, to account for retina orientation in the image and resulting rotation of the preretinal space, we propose to arrange the values according to the said retina orientation. For this purpose, the orientation is determined by first applying a Gaussian filter on the image (with σ = 3) and then calculating arctan of the vertical and horizontal image edges subsequently obtained with the use of Sobel edge detection. This map will be further called "BasicOrient". • Cumulative Sum Map-The second method incorporates calculating a cumulative sum of intensity image values for each column of the image. This is based on the assumption that pixels in the vitreous and preretinal space region have very lowintensity values, as opposed to the retinal region. Additionally, the pixels below the retina have average intensity, hence providing lower variations in the cumulative sum. Furthermore, by performing a scaling operation, it is possible to obtain values similar to those produced by Equation (1), but with the significantly less computational expense (no need to use initial segmentation). Furthermore, this method is not subjected to error propagation (which may occur if the initial segmentation algorithm provides incorrect ILM and RPE borders). Thus, this map is further referred to as the "CumSum" map.

Kernel Size
The pixel intensity values of the preretinal space are similar to those of the vitreous. Therefore, the network may not have enough information about the surroundings to correctly assign a given pixel to a class. Furthermore, the area and shape of the preretinal space differ from B-scan to B-scan.
Another way of providing the network with the information of where a given pixel belongs within the image is using a bigger convolution kernel. In contrast to works described in the literature review in Section 2, we propose the use of a non-typical convolutional kernel. It has been reported that by utilizing a vertical convolutional kernel of the size 7 × 3 px for ReLayNet [35], the network captures the entire retina in the lowest convolution level. Nevertheless, this approach has not been sufficiently discussed or analyzed in retina layers segmentation to explain the selected kernel size.
Within the retina scan pixel intensities vary significantly in the vertical direction, therefore it can be beneficial to utilize a bigger kernel to detect those changes. In our experiments we check the influence of: • square kernels: 3 × 3, 5 × 5, and 7 × 7 px, • vertical kernels: 5 × 3, 7 × 3, and 9 × 3 px, • horizontal kernels: 3 × 5, 3 × 7, and 3 × 9 px. Bearing in mind the computational cost of incorporating a bigger convolutional kernel, we pose that even a non-uniform filter will significantly improve the accuracy of pixelwise segmentation.

Experiment Design
We have performed a comprehensive set of experiments designed to measure the performance of various deep neural networks employed for the segmentation of preretinal space from OCT images. We evaluated the effect of removing anomalous data samples from the training set and the influence of data augmentation on the model accuracy. We addressed the issue of incorrect class topology common in pixel-wise image segmentation with the calculation of a relative distance map as guidance information for the system. With a set of tests we compared our proposed solution to a state-of-the-art method. Furthermore, we also proposed and evaluated an alternative method of changing the size of the convolution kernel while measuring the computational complexity.
In this section, we describe the experiment setup and parameters of the system for all segmentation methods. Next, we describe data augmentation techniques utilized in this research, and finally, we provide information regarding the evaluation metrics used to compare the obtained results quantitatively.

Training
The goal of the segmentation task is to predict 4 separate areas in the image, further described as set of classes C = {0: Vitreous, 1: Preretinal Space, 2: Retina, 3: Space under Retina}. The training network's task of multi-class classification is to assign each pixel of the image to one of these classes.
Bearing in mind the specificity of preretinal space, we consider the possibility that the PCV line is not sufficiently visible throughout the scan or is partially connected to the ILM. In such a situation, using narrow patches could mislead the network. Hence, we input to the network an entire B-scan (e.g., gray-scale image) with the resolution of 640 × 384 px, what encourages a smoother layer surface across the image. In addition, each image before processing was subjected to a standard z-score normalization.
All neural networks described in this paper were implemented using Python 3.7 with PyTorch 1.8.1 and NVIDIA CUDA 11.2 libraries. The experiments were conducted on a 64-bit Ubuntu operating system with an Intel Core i7-7700K 4.20GHz computing processor and 32 GB RAM. The NVIDIA GeForce GTX 1080 Ti GPU card with 11 GB memory was used during training and evaluation.
The CAVRI dataset was randomly split into training, validation, and testing subsets with the ratio of 80%, 10%, and 10%, respectively. The images in the training set were used to learn the neural network weights. The validation set was used at the end of each epoch to check the model's accuracy and validation loss. Finally, the test set contains images previously unseen by the network and is used to evaluate all segmentation methods.
Using the PyTorch Lightning 1.3.5 library, we trained each network with an Adam optimizer and the following parameters: learning rate l r = 5 · 10 −6 , β 1 = 0.9, β 2 = 0.999. Due to the random cropping procedure used for data augmentation, which produces images of various sizes, the batch size was set to one. Each network was trained for at least 50 epochs, and the training was stopped if the validation loss did not decay for the last five epochs. Models were evaluated on the best checkpoint corresponding to the lowest validation loss value. It should also be noted that due to memory constraints, all networks were implemented with 32 initial feature vectors instead of the original 64. According to the initial experiments, this change, however, does not have a significant degrading impact on model accuracy. The hyper-parameters of the model (i.e., weights of the loss function, data augmentation techniques) were chosen experimentally, and the best values and techniques were used to obtain the presented results. The implementation code is available online at https://github.com/krzyk87/pcv_segmentation, (accessed on 5 November 2021).
The graph search algorithm was implemented in the Matlab/Simulink environment [76] on a 64-bit PC workstation with Windows 10 operating system, Intel Core i7-3770 3.40 GHz processor, and 8 GB RAM.

Loss Function
The training is aimed at minimizing a loss function L(I,Î) between the arrays of ground truth I and predictionÎ. It is designed as a weighted sum of multi-class logistic loss (L log ) and Dice loss (L Dice ). The utilized loss function is implemented as it was proposed in the referenced and compared methods [35,40] to sustain a consistency between each network architecture. The multi-class log loss, otherwise known also as Categorical Cross-Entropy, is a type of distribution-based criterion calculated as follows: where I c (x, y) is a binary ground truth mask for class c ∈ C = {0, 1, 2, 3} taking value 0 or 1 at each location (x, y), for x ∈ X = {1, ..., w} and y ∈ Y = {1, ..., h}, where w and h denote the width and height of the image, respectively;Î c (x, y) is the prediction probability of the pixel with indices x and y belonging to class c; n c is the number of pixels in a given class c; and ω c (x, y) is an additional weight given to each pixel depending on its class and position within it. In detail, since the PCV line is very often on the level of noise in the image and due to OCT characteristics, the edges of the regions can be blurred. To boost the network's sensitivity to class boundaries, the pixels at the edges are given an additional weight q 1 . Furthermore, the pixels belonging to classes of interest (namely preretinal space and retina) are given an additional weight q 2 to adjust for their lower area in the image (as opposed to the background). Equation (4) describes the overall pixel weight calculation: where | * | denotes a sum of pixels in the corresponding mask of ground-truth I c and predictionÎ c for a class c. Consequently, the Dice loss L Dice takes into account the Dice scores for all the classes and can be expressed as follows: where λ c is a weight assigned to each class to compensate for their imbalance within the set. Numeric analysis of all the pixels in the dataset belonging to each class shows that the preretinal space is the most underrepresented class, while the background (vitreous region even more than the region below the retina) spans the largest area in each volume. We calculated the weights for each class as presented in Table 2 using the following equation: where n c is the number of pixels belonging to the class c ∈ C. All the weights sum up to 1, so that a maximum Dice score for all the classes would produce a Dice loss equal to 0, according to Equation (6). The overall loss function L(I,Î), being a weighted sum of the above-described formulas, is calculated as follows: L(I,Î) = αL log (I,Î) + βL Dice (I,Î), (8) where α and β are the weights assigned to each loss component. During the experiment, their values were empirically chosen as α = 1 and β = 0.5. The parameters for pixel-wise weight in Equation (4) are also consistent with the compared methods: q 1 = 10 and q 2 = 5.

Data Augmentation
We utilized data augmentation techniques during training to improve the model's generalisability and increase the segmentation accuracy. Thanks to this, the number of the image examples expanded artificially with each technique while maintaining the data characteristics that may naturally occur in the data. The following transformations performed in 2D for each cross-section were used: In order to determine the range of random orientations to apply, we performed a statistical analysis of a retina orientation distribution within the CAVRI dataset. As can be seen in Figure 5, the obtained results for all subsets have similar distribution and are within ±25 degrees. Thus, a rotation with a randomly chosen angle in the range of ±20 degrees was performed for each image. • Vertical Translation-Automatic acquisition protocol in an OCT device aims at focusing the device's optics on the retina. Notably, the thickness of the retina tissues extends to an average of 200 µm within a 2 mm depth of the scan. Therefore, we performed a statistical analysis of the retina position within the image (across all cross-sections in the database) to determine the retina vertical position distribution. For that purpose, we estimated the center of mass in each image and plotted the obtained positions within the image dimensions range as illustrates Figure 6. It can also be noted that each subset maintains a similar distribution, confirming appropriate dissemination of samples between the subsets. Based on the gathered information, we set the range of vertical translation of the image to ±10% of the image height, equal to ±64 px. • Random Crop-The wide variety of OCT acquisition devices allow for an even greater number of scanning protocols, with various image sizes and scanning widths. Thus, performing an augmentation technique of random cropping, we train the network to perform well on any input image regardless of its size or fovea width to image width ratio. In our experiment, we employed a crop with randomly selected values for both width and height (within the range of 80-100% of the original values). Utilizing such data augmentation techniques allowed to increase the number of training examples, as shown in Table 3.

Evaluation Metrics
To compare the correctness of the proposed segmentation methods with the manual annotations, we employ the following evaluation metrics:

1.
Dice Coefficient-A measure of overlap between the segmented region in the ground truth and prediction. It is calculated for each of the segmented classes (Vitreous, Mean Absolute Error (MAE) with standard deviation (SD)-The average vertical distance between an annotated line and a segmented boundary. It represents the error of the segmentation result relative to the ground truth. It is calculated as follows: where P(x) and G(x) are the vertical position of the class boundary line b ∈ B for the prediction and ground-truth, respectively; x stands for the image column index within the image width w; and z denotes an individual image index. The MAE is computed for the three segmentation lines of interest, namely the B = {0 : PCV, 1 : ILM, 2 : RPE}, and is averaged over the number of all B-scans used in the test.

3.
Topology Incorrectness Index (TII)-that indicates what percentage of tested images has incorrect layers topology in the vertical direction. It is computed based on a vertical gradient of the predicted segmentation mask.

Preretinal Space Segmentation Accuracy
We conducted multiple experiments using several networks to segment the preretinal space and the retina in the OCT images. This section presents a qualitative and quantitative comparison of preretinal space segmentation with a graph search approach and four DNN methods. Specifically, UNet, LFUNet, Attention UNet, and ReLayNet networks were used to obtain precise segmentation masks. Table 4 presents baseline results for a whole dataset, without any additional strategies described in this paper. The Dice and MAE (SD) metrics show how accurate the performed segmentation is.
As can be noticed, all neural networks perform better than the graph search-based method, and from CNN, the UNet has the best performance in all segmented areas and borders. On the other hand, the ReLayNet gives the worst results, which may be explained by a relatively lower number of features compared to other architectures. Additionally, the preretinal space boundary of PCV and the image classes it separates (i.e., Vitreous and Preretinal Space) have worse accuracy than the clearly defined ILM and RPE borders and the two image regions they define. This confirms the difficulty of determining preretinal space boundary due to similar pixel intensities of this region to the vitreous part.

Effect of Removing Anomalous Data Samples
In Table 5, we demonstrate how removing anomalous data samples from the dataset can improve the training of the network. The results show that the accuracy improved for all the tested methods. Here the LFUNet and original UNet present the best accuracy, while ReLayNet stays the worst in all areas, although its results have improved the most.  Table 6 shows the effect of applying data augmentation with each of the employed CNN methods. Since the graph search-based approach is not a machine learning method, data augmentation is not applicable here. As can be expected, the addition of more varying images helps to train the network. This strategy boosts the segmentation outcome in all the methods. A detailed analysis revealed that rotation and random cropping are the two strategies that best improve the segmentation results. This supports the observation that the angle setting for each patient is an individual parameter that can change even between examinations. Figure 7 illustrates preretinal space Dice score distribution for each of the above-discussed improvements. From the box plots, it can be deduced that not only the average value has increased, but the overall performance was also improved.

Effect of Data Augmentation
We performed a statistical test of significant differences among models (UNet, LFUNet, Attention UNet, and ReLayNet). The data used for the test is the same as those used to calculate the average values of Dice scores for Preretinal Space in Table 5. Based on the ANOVA test, the f -ratio value is 40.91. The p-value is <0.00001. The result is significant at p < 0.05.
Standard UNet architecture and LFUNet provide the best probability maps, although UNet has slightly better performance in segmenting preretinal space, retinal area, and PCV border. The Attention UNet and ReLayNet performed poorly here, even if their scores are better when employing data augmentation. Figure 8 presents examples of the obtained segmentation masks.
Examples shown in Figure 8 include a set of representative cases of VMT and VMA and illustrate a qualitative comparison of the obtained results. It includes two VMT cases (rows 1 and 2), two VMA cases with perifoveal vitreous detachment (examples 3 and 4), and one VMA case of slight detachment over a wide area (last row).
In the presented images, it is visible that poor evaluation scores in Table 6 for ReLayNet and Attention UNet are the effect of the network's difficulty in discerning the areas with similar intensities (namely: vitreous, preretinal space, and the region below the retina). As an effect, patches of those classes appear in incorrect places in the prediction masks. Such occurrences are less common with UNet and LFUNet; nevertheless, those architectures are not immune to them, and further improvements are necessary. Both UNet and LFUNet correctly learn the preretinal and retinal space borders regardless of the PCV intensity in the image, which is a significant improvement over the graph search-based method. Visually, those networks perform very well for both VMT and VMA cases. Furthermore, their accuracy is not affected by the placement of preretinal space in the image or the area it spans.
Compared with the classical approach based on image intensity gradient, the neural network learns smooth region borders and is not affected by slight local intensity variations-it can robustly generalize the preretinal space structure. Moreover, the graph search-based approach has difficulty correctly detecting the PCV border whenever it connects with the ILM line. This is a distinct disadvantage compared to the neural network methods that do not present such hindering.
On the other hand, it should be noted that in cases when the preretinal space takes only a narrow area of the image (example in the last row in Figure 8), a slight thickening of preretinal space in the prediction mask (e.g., region border 1 px higher) would significantly affect the Dice score (e.g., decreasing it by half). Such lowering of metric evaluation value may lead to the assumption that the designed network is not performing well. Nevertheless, in such a case, the MAE value would be relatively small. This is why the Dice score for regions spanning various area sizes (and particularly for imbalanced classes of small regions) should not be a sole indicator of the network's performance.

OCT image
Ground truth UNet LFUnet Attention UNet ReLayNet Graph search Figure 8. Example B-scans with corresponding reference mask and segmentation results for four analyzed neural networks and graph search-based method. Each shade in the segmentation mask represents a separate class.

Preserving Layers Topology with Distance Map
To tackle the problem of topology incorrectness (in the form of another class patches in the prediction masks, as visible in Figure 8), we tested the influence of providing topology information to the network in the form of a relative distance map. Table 7 includes the results of Dice and MAE scores for four tested network architectures with each of the proposed maps and without one. Table 7 also includes the Topology Incorrectness Index to indicate how each of the methods influence network ability to discern similar classes.
When analyzing the results from Table 7, it can be noticed that for CNNs that previously performed relatively better with respect to vertical layers order (i.e., UNet and LFUNet), all maps have improved the topology, while the "CumSum" map gave the best performance with respect to Dice and MAE scores. For the other two networks (Attention UNet and ReLayNet), the "2NetPR" gives the best segmentation accuracy.
The proposed maps improve the layers' topology from three to five times over, and both of the proposed relative distance maps ("CumSum" and "2NetPR") perform better than the state-of-the-art approach of "2NetR". Furthermore, the "CumSum" does not require an initial segmentation and is less computationally expensive.
Additionally, we observed that a simple linear map ("BasicOrient") not only does not preserve correct layers topology but, in most cases, hiders network ability to segment the OCT image properly. On the other hand, for UNet, and LFUNet, this map lowered the number of images with incorrect topology.

Improving Segmentation with Non-Typical Convolution Kernel
Previously described experiments showed the best performance of segmenting preretinal space and PCV line when utilizing UNet architecture. Therefore, this network will be used for further improvement with various convolutional kernel sizes. Table 8 presents the effect of convolutional kernel size on the performance of preretinal space segmentation. Understandably, the average Dice score increases for every segmented region as the kernel size increases. The same observation can be made for the average MAE for all searched borders. Square kernels provide the best performance in terms of both MAE and Dice scores for the retina borders. Interestingly, the best result of preretinal space area and borders are obtained with a horizontal kernel of the size 3 × 9 px. From the numerical data and averaged results of Dice scores presented in Figure 9 it can be concluded that rectangular kernels (regardless of their orientation) give better results than the square ones when segmenting preretinal space. Additionally, even a kernel of size 3 × 5 or 5 × 3 px performs better than a square kernel of 5 × 5 px (which spans a greater area).  9. Results of preretinal space segmentation calculated using a Dice score for layer segmentation performed with UNet various kernel sizes.

Conclusions
In this work, we have evaluated a set of methods employed for the segmentation of the preretinal space in optical coherence tomography images of the retina. We proposed an OCT image segmentation system that can help doctors automatically quantify morphological parameters of the preretinal space in healthy eyes and pathological cases of vitreomacular traction. Our approach provides robust end-to-end learning of preretinal space borders, with performance higher than in previous works. Employing CNN for this task does not require image preprocessing in the form of denoising, thresholding or other methods as is with standard computer vision algorithms.
We have shown the challenges associated with the segmentation of the preretinal space and proposed a solution in applying convolutional neural networks to this task. With quantitative and qualitative tests, we analyzed four neural network architectures, including UNet, LFUNet, Attention UNet, and ReLayNet for this task. Two standard metrics of Dice score and MAE were used for evaluation. An additional discussion on their interchangeability and limitation for preretinal space analysis was shown. The evaluation tests were conducted on a unique dataset of OCT images with 25 vitreomacular traction patients and 25 vitreomacular adhesion subjects. By performing the segmentation at a 2D level, thus utilizing 7050 images, we avoid computationally expensive 3D convolutions.
In general, the original UNet and LFUNet have performed relatively well, correctly segmenting preretinal space borders, with Mean Absolute Error of 1.33 px and 1.5 px, respectively, whereas the Attention UNet and ReLayNet gave MAE results of 4.02 px and 12.63 px, respectively. Nevertheless, all networks faced the challenge of incorrect vertical topology associated with the semantic image segmentation, but unacceptable in biological image analysis.
We have proposed two new approaches for improving the topological correctness of OCT image segmentation, namely two relative distance maps and the use of a non-typical convolution kernel. Extensive experiments for all network architectures show that both of the proposed relative distance maps tailored for preretinal space better preserve the correct layers' topology (improvement from 15.1% to 3.7% for UNet, and from 11.5% to 4.8% for LFUNet), than the state-of-the-art approach (9.4% for UNet and 6.3% for LFUNet). Additionally, we conclusively demonstrate that using a bigger kernel for a UNet-type network allows improving topological correctness of segmentation to a greater extent than utilizing an additional distance map (improvement to only 2.4% of images with incorrect topology).
The presented results confirm that CNN can reliably segment preretinal space and with significantly better performance than the graph-based approach. The conducted experiments show that the best-obtained Dice score of preretinal space segmentation is up to 0.964 when using a 3 × 9 px kernel in a UNet architecture with 32 initial features. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: